Introduction and Business Understanding
This research aims to build a model and run a Neural Network and Boosted Neural Network models for each predictor to predict the alcohol involvement of the driver in a crash.
Data Understanding and Analysis
The overall analysis will be based on the "Alcohol Prediction Data" dataset using JMP with Principal Component Analysis (PCA). The data set consists of 92,924 records. Before analysis can be performed, a significant amount of data preprocessing was necessary, including treatment of missing values and screening for outliers. In addition, for all the variables, we excluded all the rows to avoid biases.
Also, to avoid the curse of having too many dimensions, we have to reduce the number of values for each variable and the number of continuous variables. Finally, the training set is used to fit a model into a validation set to check if the model validates in a different data set. We put our training set, validation, and test set as 0.65, 0.20, and 0.15, respectively.
The formula used in creating a new column called "DWI" whereas BAC>=0.08 = 1, else 0.
To make the first model - Neural Network for this research, we corrected our binning process by combining more variables, such as Race Description and Hour, Week. As a result, we would be using 11 variables instead of 10 as before in the last research: HitRunDescription, Injury, DrugDescription, AgeGroup, DayWeek, SexDescription, Month, AngleDescription, AnumOccs, DrugResults, PreDwiDescription.
Also, fixed binning to achieve better results as we don't need one for the Neural Network model.
Neural Network is simply a machine learning process that looks at data and tries to figure out the function or set of calculations that turns the input variables into the output with the associated weight and of each neuron and modifies the strength of each input.
Neural networks learn from examples and exhibit some capability for generalization beyond the training data.
The objective is to build the model with a lower error rate, as no matter what we do, the Specificity error would always be more significant than the Sensitivity error. The first model we built was with only one layer, and we had an overall error rate of 25%, which is a lot even compared to the Decision Tree models. As we mentioned before, so far, the best model results were within a Regression Model with an error rate of 21% after fixing binning. On the other hand, even using a regression model, the ROC curve wasn't that much closer to 1 for a better result.
With the option of three activation functions used in the hidden layer: TanH, Linear, and Gaussian, which are the hyperbolic tangent function often used for nominal variables, linear that is most often used in conjunction with the non-linear activation functions and gaussian that is radial basis function behavior.
We used a model with 2,3- and 6-layers (Appendix A as an example), with the lowest error rates. We flagged Transform Covariates as an absolute. The best ROC curve (Appendix B) was with activation functions of TanH: 6, Linear 0, Gaussian: 0. Even though we know that TanH would be the most suitable function for nominal variables, we tried to use different activation functions. We didn't achieve the results we wanted to.
The overall error rate is 22%, simply obtained by dividing the sum of false predictions by the total number of cases (Appendix C). For the alcohol-impaired cases, the model correctly classifies 65% of the time, as called sensitivity. For non- alcohol-impaired cases, the model accurately predicts 92% of the time, as defined by the specificity. The false-positive rate is the percentage of positive (1-Yes) predictions that are wrong - 14% and false-negative rate, which is the percentage of false-negative (0-No) – 28%.
ROC curve is used to compare models, showing what happened to the sensitivity when the cutoff values are decreased from 1 to 0 vs. 1 specificity. In the chart with the range from 0 to 1, AUC, which is the area under the curve, is standardized to be 1 for a model that correctly classifies all cases. In our case, the ROC chart for the validation data set and AUC number is close to 0.85, the highest we have. Therefore, if the curve is closer to 1, the better model we have.
The lift curve is another method of evaluating the gain in the prediction. Our lift curve has a lift of more than 2.5. Ideally, it should decline through 30% of the data, which is a good feature of the model, so as we see that at the beginning, it started to fall in about 10% sharply and then constantly declining. The lift curve of 2 is about 40-45% of the ranked data, which is 5-10% better than a Decision tree model.
Overall, this model performed better than any other in determining non-alcohol impaired cases.
For the next model that we performed, called boosted model, the second layer is not allowed, the number of models chosen to be three.
For the first, we used the default settings as JMP gave it, and we after inserted values into the excel formula, our error rate was around 24%. We decided to take the activation function of TanH = 6 and kept flagged Transform Covariates as an absolute.
The overall error rate becomes 23%, Sensitivity 64%, which is not a big difference compared to the model that is not boosted, and for non- alcohol-impaired cases, the model correctly predicts 90% of the time, as defined by the specificity: false-positive rate 13% and false negative rate 29% (Appendix C). ROC curve stays pretty much the same, around 0.83, and lift curve of 2 is for about 5-7% drop compared to the original model.
Neural Network and Boosted Neural Network models were developed for the BAC data. Even higher sensitivity was achieved compared Decision Tree model, which is a work of binning. Also, the error rate fluctuated from 23% to 24% in the previous research, and we have achieved better results. Overall, ROC and Lift curves show that the best results we achieved with the Neural Network model. Even if we compare to the regression model, we still can improve our model, but it would be based on binning process and levels of layers.
Research Provided by Andrey Fateev
Appendix A – Example of Neural Network Diagram.
Appendix B – ROC Curve & Lift Curve for Neural Network.
Appendix C – Confusion Matrix for Neural Network and Neural Network Boosted.
Appendix D – ROC Curve & Lift Curve for Neural Network Boosted.