Application of Machine Learning in Accident Data Analysis: A Case Study Using Self-report Questionnaire

Abstract

Background: Traffic accidents remain a critical global public health issue, resulting in numerous fatalities and injuries annually. Objectives: This study aims to explore the application of machine learning (ML) in analyzing traffic accident data obtained from self-report questionnaires to identify factors influencing the incidence and severity of accidents. Methods: The study design is cross-sectional. In this study, approximately 660 participants completed the questionnaire, of which 43 were incomplete or invalid and were excluded. The remaining 617 participants answered all questions in full. Participants were selected using a convenience sampling method from five districts in Shiraz to ensure diversity, including outreach to taxi and heavy vehicle terminals. Data were collected through face-to-face questionnaires administered by trained researchers, and all responses were self-reported. The dataset collected from 617 participants includes information on demographics, vehicle and road features, personality traits, driving habits, and risky driving behavior. The questionnaire incorporated multiple validated instruments capturing driving behavior, demographics (such as age, gender, marital status, education, income), and habits (e.g., driving duration, cellphone use, fatigue, and substance use). Various ML algorithms, such as random forest and SHapley Additive exPlanations (SHAP) analysis, were employed to identify factors influencing both the occurrence and severity of accidents. Furthermore, the C5.0 algorithm was utilized to extract specific patterns, while prediction tasks were addressed using a combination of random forest, support vector machine (SVM), logistic regression, and Naive Bayes algorithms. Results: The random forest algorithm highlighted that factors such as income, driving time, working time, age, duration of non-stop driving, type of law enforcement, openness, normlessness, sensation seeking, and vehicle safety significantly influence the occurrence of accidents. For accident severity, important predictors included driving time, non-stop driving, working time, age, aggressive violations, income, road quality, type of law enforcement, driving while tired, vehicle safety, foreign car status, and vehicle comfort. Additionally, the C5.0 algorithm revealed specific patterns—such as high normlessness and extended driving hours—increasing the likelihood of accidents, while factors like low normlessness and balanced income served as protective elements. Conclusions: The findings highlight the impact of lifestyle and work-related factors, as well as certain personality traits of drivers, on the incidence and severity of accidents. While the results of the study should not be taken verbatim due to the reliance on self-reported data, the study supports the application of ML in the analysis of accident data. It also advocates for the use of strategies including social and economic interventions, psychological assessments, enhanced road safety education, and customized regulatory measures based on individual risk assessments to effectively prevent traffic accidents.

Description

Keywords

Citation

URI

Endorsement

Review

Supplemented By

Referenced By