Introduction and Business Understanding
This research aims to build a model and run a decision tree model for each predictor to bin all predictors and bootstrap forest and boosted trees models to predict the driver's alcohol involved in a crash.
Data Understanding and Analysis
The overall analysis will be based on a dataset named "Alcohol Prediction Data" using JMP with Principal Component Analysis (PCA). The data set consists of 92,924 records; before analysis can be performed, a significant amount of data preprocessing was necessary, including treatment of missing values and screening for outliers. In addition, for all the variables, we excluded all the rows to avoid biases.
Also, to avoid the curse of having too many dimensions, we have to reduce the number of values for each variable and the number of continuous variables. Finally, the training set is used to fit a model into a validation set to check if the model validates in a different data set. We put our training set, validation, and test set as 0.65, 0.20, and 0.15, respectively.
The formula used in creating a new column called "DWI" whereas BAC>=0.08 = 1, else 0.
Our initial process of binning to achieve lower error rate as well as added additional independent variables, instead of 10, we using 11 variables which are: Injury, Gender, Age group, PrevDwiDescription, AnumOccs, HitRunDescription, Month, Hour, Day_Week, AngleDescription, and RestUse.
To make the first model that is Partitioning, which builds decision trees to predict the response variable. The objective of this model is to watch for R-square, which is based on the difference between observed cell frequency and expected cell frequency, and split the variables with the highest LogWorth, and until Rsquare doesn't change, it will be stable. Instead of using p-value, our attention is to select the variables with the highest LogWorth. We chose DWI as a response variable and all necessary or statistically significant variables, which is 10: Hour, Day_Week, Age Group, Month, AnumOccs, PrevDwiDescription, HitRunDescription, DrugResultDescription, Sex Description, and Injury as our variables for the test. The overall error rate is 23%, simply obtained by dividing the sum of false predictions by the total number of cases (Appendix A). The model correctly classifies 61% of the time for the alcohol-impaired cases, called sensitivity. For non- alcohol-impaired issues, the model accurately predicts 84% of the time, as defined by the specificity. The false-positive rate is the percentage of positive (1-Yes) predictions that are wrong - 38% and the false-negative rate, which is the percentage of false negative (0-No) – 17%.
ROC curve is used to compare models, showing what happened to the sensitivity when the cutoff values are decreased from 1 to 0 vs. 1 specificity. In the chart with the range from 0 to 1, AUC, which is the area under the curve, is standardized to be 1 for a model that correctly classifies all cases. In our case, the ROC chart for the validation data set and AUC number is close to 0.80. Therefore, if the curve is closer to 1, the better model we have.
The lift curve is another method of evaluating the gain in the prediction. Our lift curve has a lift of more than 2.5, and ideally, it should decline through 30% of the data, which is a good feature of the model. However, we see that it started to fall by about 10%, and then constantly declining. The lift curve of 2 is a decent percentage for about 35-40% of the ranked data. This model shows us that it predicts no-alcohol involvement in crashes rather than indicating alcohol impairment in crashes within drivers. It has a similar outcome with ROC Curve and Lifts curve as it was for Prediction Model.
The next model we performed is Bootstrap, which builds a collection of decision trees using random sampling averages results to predict a response. Again, we used the default settings given by JMP, and we inserted values into the excel formula. Our error rate was 24%, with non- alcohol-impaired cases predicted 88% of the time, as defined by the specificity. False-positive rate - 34% and false negative rate– 21%.
For the ROC curve, the AUC number was not close to at least 0.80, which is not a great model based on the model we used before (Appendix B).
The last model would be Boosted tree, which builds decision trees that sequence of smaller trees to predict a response. We used the default settings as well as in Bootstrap Model. Our error rate is 23%, with non- alcohol-impaired cases, which indicates 88% of the time, as defined by the specificity—false positive rate - 33% and false negative rate– 20%. For the ROC curve, AUC number was similar as in the previous models (Appendix B).
The Decision Tree model, Bootstrap Forest, and Boosted Trees models were developed for the BAC data. Higher sensitivity was achieved in the Decision Tree model, which was one thing to improve for the overall results. Also, the error rate fluctuated from 23% to 24%, we can improve our model by reducing cutoff probability, but it will be reducing in specificity, which we tried to avoid, and improving Regression Model. Therefore, binning has to be redone entirely for the variables chosen; we believe a margin error of +2-3% can be reduced to 20%-21% of the Error Rate. Overall, the best results we achieved with the Regression model and the Decision Tree model, which ROC and Lift curves showing, we still can improve our model, but it would be based on the decisions above.
Research Provided by Andrey Fateev
Appendix A – ROC Curve & Lift Curve for Partitioning.
Appendix B – ROC Curve & Lift Curve for Random Forest.
Appendix C – ROC Curve & Lift Curve for Boosted Tree.
Appendix D –Summary with Confusion Matrix of all 4 models.