Prediction of alcohol involvement of the driver in the crash


Introduction and Business Understanding

This research aims to build a logistic regression model to predict the alcohol involvement of the driver in the crash.

Data Understanding and Analysis

The overall analysis will be based on a dataset named “Alcohol Prediction Data” by using JMP with Principal Component Analysis (PCA). The data set consists of 92,924 records. Before analysis can be performed, a significant amount of data preprocessing was necessary, including treatment of missing values and screening for outliers.

MOD_YEAR, AnumOccs, HOUR, FATALS, and BAC have a significant number of outliers that we excluded, 162, 14, 658, 30, and 2, respectively.

BAC and Age Desc have 6026 and 73 missing values that we excluded as well.

Also, to avoid the curse of having too many dimensions, we have to reduce the number of values for each variable and the number of continuous variables. Finally, the training set is used to fit a model into a validation set to check if the model validates in a different data set. We put our training set, validation, and test set as 0.65, 0.20, and 0.15, respectively.

Our data set can identify the binary response variable, which is either 0 or 1, depending on the outcome that we want to predict. The legal blood alcohol content/concentration (BAC) value you can have in your body system within the United States of America is 0.08, meaning that there is 0.08g of alcohol for every 100ml of blood if the number is bigger than 0.08 in our case, BAC number different from the 0 considered to be illegal, or driving under the influence of alcohol. The formula used in creating a new column called “DWI” whereas BAC>=0.08 = 1, else 0.

For the binning process, we look at the specific variables: Age, Injury, Hour, Day of Week, BodyDescription, RestUseDescription, and AngleDescription of the drivers. Then, we build a chart Y by X; as DWI on the Y-axis and Injury and other variables on the X-axis, we must remember to use Local Data Filter as a training data set to avoid unbiases and to prevent overfitting that might occur when we manipulate the data and include the validation set and test set.

To make a confusion matrix regression model, we chose DWI as a response variable and all necessary or statistically significant variables, which is 10, where p-value < 0.05, not every variable must be <0.05. The limit in the research was 0.03. Target level to 1, as we want to predict DWI = 1, people involved in accidents with alcohol within their body system. In (Appendix D), each variable has LogWorth, a function of p-value and representing the ranking of importance for each variable. The misclassification rate of the validation set is 0.21. It is essential to understand that the Confusion Matrix evaluates the effectiveness of the model prediction. Constructed data is used to explain the various metrics that are available to evaluate a model. The matrix shows the outcome of the prophecies of 17214 drivers, of which the model correctly classified 10397 drivers as involved in the crash by alcohol influence, while 3047 drivers were correctly classified as not influenced by alcohol with BAC < 0.08. The model classified 1352 drivers as having no alcohol involvement, however still getting into the crash, and 2418 drivers were classified as alcohol involvement but did not get in the impact.

The overall error rate was 35%, simply obtained by dividing the sum of false predictions by the total number of cases. However, after binning correction and substituting with the right statistically significant variables and no bias or unstable, our error rate becomes 22%, which is good. Thus, the model correctly classifies 56% of the time for the alcohol-impaired cases, called sensitivity. For non- alcohol-impaired cases, the model accurately predicts 88% of the time, as defined by the specificity. The false-positive rate is the percentage of positive (1-Yes) predictions that are wrong - 31% and the false-negative rate is the percentage of false-negative (0-No) – 19%. This matrix/model shows that it predicts no alcohol involvement in crashes than indicating alcohol impairment in crashes within drivers. (Appendix C).

ROC curve is used to compare models, showing what happened to the sensitivity when the cutoff values are decreased from 1 to 0 vs. 1 specificity. The chart with the range from 0 to 1, and AUC, which is the area under the curve, is standardized to be 1 for a model that correctly classifies all cases. In our case, the ROC chart for the validation data set and AUC number is close to 0.80. Therefore, if the curve is closer to 1, the better model we have. (Appendix A).

The lift curve in (Appendix B) is another method of evaluating the gain in the prediction. Our lift curve has a lift of more than 2.5, and ideally, it should decline through 30% of the data, which is a good feature of the model. However, we see that at the beginning, it started to fall by about 10%, and then constantly declining. Therefore, a lift curve of 2 is for about 35-40% of the ranked data, a decent percentage.


The logistic regression model was developed for the BAC data.

The sensitivity needs to be higher than 56%, and it could be achieved by adding additional variables, and fixing binning, to choose more linear relationships between variables that are statistically significant. Also, we can improve our model by reducing cutoff probability, but it will decrease specificity, which we tried to avoid, as we also want to correlate non- alcohol-impaired cases within the model. However, in the ROC curve, we still have 0.20 to improve.

Research Provided by Andrey Fateev


Appendix A – ROC Curve.

Appendix B – Lift Curve.

Appendix C – Confusion Matrix.

AppendixD – Effect Summary.