AN ARTIFICIAL NEURAL NETWORK MODEL FOR HIGHWAY ACCIDENT PREDICTION: A CASE STUDY OF ERZURUM, TURKEY

.


INTRODUCTION
Worldwide, 1.3 million people die and 20-50 million are injured each year due to traffic accidents.According to the World Health Organization (WHO) data collected from 178 countries, the traffic accidents are the ninth most common cause of death among all age groups [1].Due to the rapid development of economy and continuous improvement in Turkey, the number of vehicles, transportation demand, and traffic accidents have increased during the last decades.Injuries and deaths are not only the result of traffic accidents, but also lead to the waste of social wealth.It is very important to reduce the highway traffic accidents in the developing countries like Turkey, and therefore the traffic safety should be improved by the analysis of accident characteristics.
The number of accidents on a given highway section during a certain period of time is probabilistic in nature and is a non-negative integer.Despite the fact that accidents are random and unpredictable at micro-level, statistical models can predict reliable estimates of expected accidents by relating aggregates of accidents to various explanatory measures of flow, site characteristics, and road geometry at macro-level.Numerous empirical relationships between vehicle accidents and these explanatory variables have been established in several previous studies [2,3,4].The reliability and results of traffic accident prediction models have an important meaning for the improvement of traffic safety management.
Traffic accident prediction models have been developed to understand the factors affecting traffic accidents and eventually to reduce traffic accidents by controlling and/or improving factors [5].As far as the authors are aware the first accident-prediction models for multi-lane roads were devised by Persaud and Dzbik [6].The relationships between crash data and traffic flow, expressed both as average daily traffic (ADT) and hourly volume (HV), were proposed.The analysis was based on generalized linear models.The results showed that crash rate increases with increasing traffic flow expressed both as ADT and HV [7].
Knuiman examined the effect of median width of four-lane roads on crash rate using a Negative Binomial distribution.The findings indicated that crash rate decreases with increasing median width [8].
Fridstrøm related road accidents to four variables, namely: traffic flow, speed limits, weather and lighting conditions.They considered Negative Binomial regression.On the basis of the four independent variables, they were able to explain between 85 and 95% of the systematic component of variation [9].
Hadi proposed several accident-prediction models with regard both to multi-lane roads and two-lane roads of rural or urban designation.The dependent variables were total crash rate or injury crash rate.The values of these accident indicators were estimated as a function of annual average daily traffic (AADT) and road environmental factors.Poisson and Negative Binomial regression models were considered.By examining the effect of traffic flow on the crash rate the conclusions reached were that crash rate increases with increasing AADT on roads having higher levels of traffic, while it decreases with AADT on roads with lower traffic volumes [10].
Persaud presented one of the earliest studies for carrying out separate analyses for curves and tangents, albeit limited to two-lane roads.The dependent variable was crash frequency, while the independent variables were traffic flow and road geometry.The regression models were calibrated using generalized linear modelling.A dummy variable for "flat" or "undulating" terrain was also used.For curves, crash frequency was found to increase with: AADT, section length (L) and curvature (1/R).For tangents, the number of accidents per year increases with AADT and L. A higher accident number on undulating terrain than on a flat one was also shown [11].
Abdel-Aty and Essam Radwan used Negative Binomial distribution to predict crash frequency as a function of: AADT, degree of horizontal curvature, section length, lane, shoulder and median widths, and urban/ rural designation.The results showed that crash frequency increases with AADT, degree of horizontal curvature and section length.Accident frequency decreases, however, with lane, shoulder and median width [12].
Hauer developed statistical road safety modelling by using the Negative Binomial distribution.The dependent variable was the number of accidents per year, while the independent ones were geometric characteristics and traffic flow.The most innovative aspect of this study was the introduction of an alternative tool for measuring the goodness-of-fit of the predictive models, the so-called Cumulative Residuals (CURE) Method.This method consists of plotting the cumulative residuals as a function of the independent variable of interest, a good CURE plot being one oscillating around zero [13].
In a subsequent paper, Hauer applied the abovementioned statistical model to estimate crash frequency on undivided four-lane urban roads.The proposed models evaluated the number of accidents per year and carriageway as a function of the following independent variables: AADT, percentage of trucks, degree and length of horizontal curves, grade of tangents and length of vertical curves, lane width, shoulder width and type, roadside hazard rating, speed limit, access points (e.g.signalized intersections, stop-controlled intersections, commercial driveways and other driveways), the presence and nature both of parking and two-way-left-turn-lanes.The findings showed that significant variables were: AADT, the number of commercial driveways and speed limit [14].
Lately, ANN models have been developed based on traffic accident case studies.The national freeway of Taiwan was analyzed with the model of ANN by Chang [15].The other study considered the relationship between the probabilities of involvement in a traffic accident and driver characteristics using ANN [16].Akgüngör and Doğan developed traffic accident models using non-linear multiple regression and ANN in Turkey and in another prediction model for some big cities in Turkey using again ANN and genetic algorithm approaches [17,18].They also studied the accidents prediction model by using Modified Smeed, Adapted Andreassen, and ANN approaches [19].Cansız estimated the number of fatalities in accidents using a non-linear accident model with Smeed equation and ANN model [20].In these studies ANN models produced better results as compared with other models.
Statistical methods were commonly utilized in most previous studies, such as Linear Regression Model, Logistic Regression Model, Poisson Model, Negative Binomial Model, Zero-inflated Negative Binomial Model and Generalized Linear Regression Model methods.These methods are subject to strong assumptions and limitations in application.In contrast, ANN has been proven efficient and effective in many fields.In traffic safety, some studies have applied ANN to the forecasting of traffic accidents on highways but few have a large data set.Therefore, this paper establishes the traffic accident prediction model with the large data set and several parameters by the use of ANN.The ANN model was successfully expressed in the number of traffic accidents in comparison with the original dataset.

Case study data
The collected data of the number of accidents covered a period of eight years from 2005 to 2012 and relate to the road network of the Province of Erzurum in Eastern Turkey.The traffic accident reports data used in this study are 7,780 complete accident reports which were collected from the Directory of Erzurum Traffic Region.Each accident report has various information such as the date, accident location, pavement type, vehicle type, driver's gender, driver's age, road surface condition, the day and time, weather condition, day or nighttime, the number of deaths, the number of injured persons, the number of involved vehicles, and the number of damaged vehicles.In addition to these data, geometric characteristics of the highway such as AADT, the degree of horizontal and vertical curvatures in each section, lane, median, and shoulder widths were collected from 12 th Highway Regional Directorate of Erzurum.After eliminating the missing and erroneous data, 7,285 accident reports were utilized in this research.These data were categorized with 18 variables as shown in Table 1 [21].
In order to analyze the traffic accidents on the highways, one needs to select highway(s) that possess a wide variety of geometric and traffic characteristics.The goal of this data collection exercise is to divide these highways into segments with homogenous char-acteristics.After reviewing several highways around Erzurum, it was decided that D950-03, D100-28, D052-03, and D100-29 four-lane median-divided highways were most appropriate for this task [21].
The highways include a total of 152 km of major principal arterials that connect the east and the west of Erzurum.These arterials are long enough to produce an adequate number of segments to develop the model.The information on highways includes geometric characteristics such as gradient of the highways, horizontal and vertical curves, shoulder widths, median widths, and traffic characteristics such as annual average daily traffic.D950-03, D100-28, D052-03, and D100-29 were divided into 5, 3, 4, and 4 highway segments, respectively and defined by any change in the geometric and/or highway variables (e.g. a new section would be identified when median changed from 2 to 4 m).Thus, each highway segment is uniform with respect to all the possible geometric and traffic features (Figure 1).The routes are as follows: -D950-03, Erzurum-Tortum (52 km); -D100-28, Erzurum South Ring Road (20 km); -D052-03, Erzurum North Ring Road (30 km); and -D100-29, Erzurum-Köprüköy (50 km).

Artificial neural network
Modelling of non-linear systems is far more difficult than linear systems.The disturbances influencing the system make the modelling task even more difficult.Scientists have been studying non-linear system modelling for years and they have succeeded in teaching non-linear system dynamics to artificial neural networks without any mathematical modelling [22,23,24].
Neural networks, with their remarkable ability to learn complicated relations from imprecise data can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques.A trained neural network can be thought of as an ''expert'' in the category of information that has been given to analysis.Artificial neural networks can be used in a variety of powerful ways: to learn and reproduce rules or operations from the given examples; to analyze and generalize from samples and to make predictions or to memorize features of given data and to match or make associations from new data to old data [25].
Figure 2 shows that there are nodes in the input layer and one node in the output layer which is called target value.To get strong reflection ability, a differential sigmoid function is used in the intermediate layer and a linear function in the output layer.There is no regulation or theory base to determine the number of neurons in the intermediate layer; it could be achieved by calculating iteratively.
Network training is a process by which the connection weights and biases of ANN are adapted through a continuous process of simulation by the environment in which the network is embedded.The primary goal of training is to minimize an error function by searching for a set of connection strengths and biases that causes ANN to produce outputs that are equal or close to targets.In other words, training aims at estimating the parameters ( W1 , W2 , b1 , and b2 ) by minimizing an error function, such as the mean square error (MSE) of the output values expressed as [26]: where N is the number of data.

Development and application of ANN models
Alyuda Neuro Intelligence is a neural network software for experts designed for intelligent support in applying neural networks to solve real-world forecasting, classification and function approximation problems; to use intelligent features to preprocess datasets; to find efficient architecture and to analyze performance and apply the neural network to new data.Experts can create and test their solutions much faster, increase their productivity and improve results [27].
To estimate the ANN model, there are a number of software packages ready to perform the Levenberg Marquardt algorithm, and Alyuda Neuro Intelligence was chosen for this study.In the ANN model, independent variables are named as the input, and dependent variables are named as the output.The input importance chart shows the relative importance of each input column.This chart can help the user decide what input columns can be removed without affecting the results.It also helps to understand the most important columns that had the biggest influence on the network.The input column importance is calculated as degradation in network performance after the input was removed and not used by the network [28].Among the eighteen parameters applied in modelling, eight parameters were found to be significant based on those criteria.The significant parameters are years, highway sections, section length (km), annual average daily traffic (AADT), the degree of horizontal curvature, the degree of vertical curvature, traffic accidents with heavy vehicles (percentage), and traffic accidents occurred in summer (percentage).Here, there are 8 input variables containing 31 neurons which are the input variables representing the potential risk factors for accidents.
Before the training of the network, both input and output variables were normalized within the range 0-1 using a minimax algorithm.Categorical columns were automatically encoded during the data preprocessing using One-of-N method by Alyuda Neuro Intelligence program.The One-of-N encoding means that a column with N distinct categories (values) is encoded into a set of N numeric columns, with one column for each category.For example, for the Capacity column with values "Low", "Medium" and "High", "Low" will be represented as {1,0,0}, Medium as {0,1,0}, and High as {0,0,1}.The minimum and maximum of the dataset were found and scaling factors selected so that these were mapped to the desired minimum and maximum values.The mimimax algorithm is as follows: where Hmax and Hmin indicate the largest and the smallest values of H, respectively, and Hl the unified value of the corresponding H. Normalization of the data greatly improves learning speed and it is beneficial in reducing the error of the trained network.The mathematical formulation of ANN is  (3) In Equation 3; Y is the number of accidents, W1 and W2 are weight matrices and b1 and b2 are bias vectors.The effectiveness of the back propagation training algorithm depends on the number of neurons in the hidden layer; various numbers of neurons (ranging from 1 to 29) in the hidden layer were tested.In this study, one output layer is called number of accidents for all highways.
The fitness criterion specifies what network parameter should be used to distinguish the best network.It automatically finds the best architecture offering the user graphs of the search process and details for every tested neural network [28].In the analysis, the sum of squares was adopted as the error function, activation function was taken to be logistic and the search method aroused to be exhaustive methods.The best five optimal architecture networks which were obtained from the software packages are shown in Figure 3. Considering Table 2 and Figure 3 the optimal network architecture was found to be 31x9x1.Data sets were divided into three sections: the training set, the verification set and the test set.Training algorithms do not use the verification or test sets to adjust network weights.The verification set may optionally be used to track the network's error performance, to identify the best network and to stop training if over-learning occurs.The test set is not used in training at all, and it is designed to give an independent assessment of the network's performance when an entire network design procedure is completed.Seventy percent of the data set was used to train the network, while the remaining thirty percent was employed for testing and verification.The assignment of cases to the training, verification and test subsets can sometimes affect the performance of the train-ing algorithms.In order to eliminate this situation, the cases should be shuffled randomly between subsets.The cases can be left in their original order, or grouped together in the subsets.In this model, the cases were shuffled randomly between subsets (training, test and verification) [29].

Evaluation of ANN model
The coefficients of determination ( R 2 ), mean square error (MSE), and the root mean square error (RMSE) are the main criteria that are used to evaluate the performance of ANN model.They are defined as follows: R r .

Cor Ratio r =
] g n actual a ctual n predicted p redicted n actual predicted actual predicted where tmi is the i-th observation value; tgi is the i-th model value; N is the number of trained data; m is the number of parameters in the model (the total number of weights and invariables in the net structure).

RESULTS
The conclusion summary has minimum, maximum, mean and standard deviation for target, output, absolute error (AE) and absolute relative error (ARE) along with R-squared and correlation parameters of the tested network which is given in Table 3.In order to determine the performance of the ANN model, the comparison of model prediction performance between the target and output is examined.Table 3 shows the comparison results.The information criteria of the model are given in Table 4. R 2 is used to measure the closeness of fit.A perfect fit would result in R 1 2 ., a very good fit near 1, and a poor fit would be near 0. In the ANN model correlation ratio and R 2 are 0.991186 and 0.982452, respectively.When considering the low values of MSE, RMSE and high values of the correlation coefficients in ANN the superiority of the model will be understood.In the ANN model MSE and RMSE values are 4.110521 and 2.027442, respectively.This demonstrates that the ANN model is an appropriate methodology for analyzing traffic accidents.Results showed that in Table 5 the four important features, namely the degree of horizontal curvature, annual average daily traffic, the degree of vertical curvature, and section length were very effective in predicting the number of traffic accidents.The output graph displays a line graph of the actual and network output values for records displayed in the output table.Horizontal axis displays the row number of the input dataset and vertical axis displays the range of the output values.In addition to this the ANN model trends were closer to actual values as shown in

CONCLUSION
In this study, the factors which cause accidents have been investigated, for providing road safety, and accident prediction models which include relations between these factors have been established.For the geometrical features of highways sections and traffic accident reports the years from 2005 to 2012 were used to form the database.The obtained data from the database in this study have been investigated with ANN as a tool of forecasting techniques.Since ANN method is a more flexible and assumption-free methodology and furthermore, capable of evaluating/comparing all of the traffic accident characteristics, it is selected for modelling the traffic accidents data.The low values of the MSE and RMSE and the high values of correlation coefficient and R 2 in ANN indicate the superiority of the model.
The model results indicate that the degree of vertical curvature with the high percentage (54.88%) is the most important parameter affecting the number of accidents on the highways.AADT and the degree of horizontal curvature have almost the same effect as the second important parameter.Section length is the third one and highway location, years, traffic accidents occurred in summer, and traffic accidents with heavy vehicles have small effect on the output parameter.Actually, traffic accidents occurred in summer are more important than in other seasons and in the same way, traffic accidents involving heavy vehicles are also more important than those involving light vehicles.
These results have implications for policy makers, transportation system designers, and researchers.Transportation safety designers cannot easily identify factors, make recommendations for incremental changes in the factor, and hope to achieve major differences in accident levels.The problems have to be analyzed and attacked from a multidimensional perspective: a wide variety of geometric and traffic characteristics.Researchers similarly may adopt techniques such as neural networks for analysis of such variables.This paper aims to offer and apply the great potential of neural network analysis through an example of the accident prediction model in the transportation engineering department.

Recommendations for future research
The modelling results obtained using the annual data (the years from 2005 to 2012) are encouraging for further research by the expanded data sets.By  setting up some random variables in the design parameters, it may be possible to predict the number of accidents for the years in the future.The future work might focus on how to improve the prediction performance of ANN models.It would also be interesting for future studies to predict the number of accidents on highways with using the technique of Genetic Algorithm and to see if the prediction performance could be improved.

M
. Y. Çodur, A. Tortum: An Artificial Neural Network Model for Highway Accident Prediction: A Case Study of Erzurum, Turkey

Figure 3 -
Figure 3 -The top five networks

Figure 4 -
Figure 4 -Relationship between estimated and actual values

Table 1 -
A summary of input variables M. Y. Çodur, A. Tortum: An Artificial Neural Network Model for Highway Accident Prediction: A Case Study of Erzurum, Turkey

Table 2 -
The fitness criteria of the best five networks

Table 3 -
Conclusion summary

Table 4 -
Information criteria of the model