PREDICTION OF FATAL AND MAJOR INJURIES OF DRIVERS, CYCLISTS, AND PEDESTRIANS IN COLLISIONS

Traffic-related deaths and severe injuries may affect every person on the roads, whether driving, cycling or walking. Toronto, the largest city in Canada and the fourth largest in North America, aims to eliminate traffic-related fatalities and serious injuries on city streets. The aim of this study is to build a prediction model us-ing data analytics and machine learning techniques that learn from past patterns, providing additional data-driven decision support for strategic planning. A detailed exploratory analysis is presented, investigating the relationship between the variables and factors affecting collisions in Toronto. A learning-based model is proposed to predict the fatalities and severe injuries in traffic collisions through a comparison of two predictive models: Lasso Regression and Random Forest. Exploratory data analysis results reveal both spatio-temporal and behavioural patterns such as the prevalence of collisions in intersections, in the spring and summer and aggressive driving and inattentive behaviours in drivers. The prediction results show that the best predictor of injury severity for drivers, cyclists and pedestrians is Random Forest with an accuracy of 0.80, 0.89, and 0.80, respectively. The proposed methods demonstrate the effectiveness of machine learning application to traffic and collision data, both for exploratory and predictive analytics.


INTRODUCTION
The World Health Organization estimates that over 3,400 people die in traffic collisions on a daily basis, and tens of millions are injured and disabled on a yearly basis [1]. In 2016, Canada's collisions leading to personal injuries reached a total of 115,956, and collisions leading to fatalities reached 1,717 [2]. In that same year, Toronto traffic fatalities hit their highest number since 2002 [3]. As a result, collision prevention, analysis, and prediction have been crucial topics in the traffic and transportation discipline [4]. Collisions are studied through various angles, such as the development of Accident Prediction Models (APM), road safety measure assessment, user behaviour analysis and others [4]. Various initiatives have been developed in response to this issue. In Europe, the PRACT project (Predicting Road Accidents) was developed in 2013 with the purpose of building an accident prediction model framework applicable to different European roads and networks [4]. In 1997, Sweden launched the Vision Zero project [5], aimed at eliminating road fatalities and serious injuries. Many countries have adopted this project, such as Canada, Germany, the UK, the Netherlands, and the US [6]. The Vision Zero Canada has been implemented in Edmonton [7], Vancouver [8], Ottawa [9], and Toronto [5,6]. Toronto, the largest city in Canada and the fourth largest city in North America, saw a recent increase in road fatalities [10]. The Toronto Vision Zero Safety Plan is a 5-year plan (2017-2021) aiming at identifying the factors contributing to this type of collisions, with an ultimate goal of reducing collision fatalities and severe injuries to as close as possible to zero.
The goal in this study is to identify the patterns in Toronto severe and fatal collisions and to build a predictive model to estimate injury severity of individuals in a collision, that is, drivers, pedestrians and cyclists. This paper is organized as follows: in Section 2, the related literature is discussed, then in Sections 3 and 4, an overview of the dataset is presented and the methodology discussed. In Section 5, data mining is performed and rules and patterns in Toronto collisions are presented. Section 6 presents and discusses the results of the predictive models, performance and the variable implications in the models. The threats to validity are considered in Section 7 and the paper is concluded in Section 8.

BACKGROUND
Different types of research have been undertaken in relation to collision analysis and prevention. Outside Canada, many studies analyse the physical aspect of a collision, such as structure, weight, and velocity of a car with regards to cyclists' [11] and pedestrians' injuries [12]. Both studies proposed safety measures to dampen the severity of injuries resulting from such collisions. Additionally, the analysis of children's injuries is conducted in the literature. The analysis of children injuries is performed using data from China [13] and Norway [14]. Research in [14], for example, found that misuse of the seatbelt is a major contributor of injuries in child passengers. Driver's characteristics and behaviour have been also extensively studied. The research in [15] demonstrates that attributes such as seatbelt misuse, speed higher than 111 km/h, female drivers and older drivers increase the probability of collision fatality. Similarly, studies on the drivers' behaviour and personality traits reveal that impulsivity and aggressiveness, as well as driver fatigue, are significant contributors to traffic collision occurrence [16,17], and may lead to severe injuries [18].
Many studies use machine learning approaches to detect patterns and factors contributing to severe collisions. Research in [19] uses decision-tree-based algorithm to extract rules from the Spanish rural highway dataset, whereas research in [20] performs an extensive analysis to explore the factors contributing to collision occurrences. They construct a Bayesian network to classify crash types. Other different prediction models have been examined with regards to traffic collisions, such as artificial neural networks and support vector machine for predicting collision duration [21]; decision trees, Naïve Bayes, KNN and AdaBoost for predicting collision occurrence [22]; binary and skewed logistic regression [23], decision trees, multilayer perceptron [24], probabilistic neural net, Random Forests [25], and Bayesian networks [26] for predicting injury severity. Sensors and vehicle-to-vehicle communication [27], as well as genetic programming [28], are also investigated in the literature in the context of real-time collision prediction. Some studies have also taken a time series approach to analyse the fatalities in traffic collisions [29,30]. Real-time driving environmental data have been explored in [31], where data such as real-time traffic flow, weather, road design, and others were added to the Colorado State Patrol crash database. The authors in [31] build a crash prediction model using mixed logit models, and find that weather, road surface, and traffic conditions play an important role in crash prediction. Other studies focus on injury severity. In [32], the authors explore the truck drivers' severity injury characteristics in single-vehicle and multi-vehicle accidents by building mixed logit models and evaluating the corresponding independent variables. Similarly, the authors in [33] analyse injury crash versus non-injury crash by building different spatio-temporal models and by evaluating the parameter estimates.
In Canada, a study was done in 2007, analysing the age and gender patterns in relation to collision injury, using the Canadian National Population Health Survey and Transport Canada data. It was found that injury rates between males and females are not significantly different; however, fatality rates in males are twice as high in Canada [34]. The children involved in collisions are also of interest in the Canadian research. Research [35] found that children in Canada are at a much higher risk of major injuries when involved in a back-over collision. The physical aspect of a collision is studied in a few cities in Canada, such as Edmonton [36] and Ottawa [37]. These studies analyse the proximity of two vehicles and its effect on the collisions.
In Toronto, a crash potential index (CPI)-based collision prediction model is built based on the proximity, velocity, and type of vehicles using past collisions data and data from loop detectors on Gardiner Expressway [38]. Further, research [39] investigates the pedestrians' injuries in collisions, but it is limited to children and elderly pedestrians' collisions. Another study [40] focuses on cyclists' injuries and road type analysis in both Vancouver and Toronto. The injury severity prediction models suicide). (2) Major: a non-fatal injury that is severe enough to require the injured person to be admitted to hospital, even if only for observation at the time of the collision (includes: fracture, internal injury, severe cuts, crushing, burns, concussions, severe general shocks). (3) Minor: a non-fatal injury requiring medical treatment at a hospital emergency room, but not requiring hospitalization of the involved person at the time of the collision. (4) Minimal: a non-fatal injury at the time of the collision, including minor abrasions, bruises, and complaints of pain, which does not require the injured person to go to hospital. (5) None: uninjured person.
The final dataset has 8,922 observations and 26 variables including both collision and individual related attributes ( Table 1).
The data are subset into four different datasets: collisions, drivers, cyclists, and pedestrians, each including their specific attributes. Each of the subsets is examined for missing values or data inconsistencies. Fifteen variables have some blank values and two variables have data inconsistencies. Inconsistent values are corrected accordingly, and each blank record is added to an existing or new category that is either called Other or Unknown. The final variables selected are described in Table 1, where S1-S5 represent spatial characteristics, E1-E3 represent environmental characteristics, T1-T3 represent temporal characteristics, and I1-I13 represent traffic participant characteristics, including age, actions, conditions, type of vehicle operated at the time of collision and injury levels. Only the two most frequent levels are reported due to paper space capacity.
An initial analysis was carried out and it was found that during the eleven years from 2007 to 2017, KSI collisions followed a general decreasing pattern, going from 453 collisions in the year 2007 down to 331 collisions in 2017, the lowest number of collisions since 2007 ( Figure 1). Similarly, both fatal and major injuries, as well as minimal, minor and no injury instances were at their lowest in 2017, decreasing by 26% and 60%, respectively since 2017. Meanwhile, the data obtained for all other collisions, including less serious collision types such as property-damage-only collisions, show an increase by 15% ( Figure 1). In that same period, KSI collisions went down by 5%.
It was also found that the most frequent type of involvement were drivers, followed by pedestrians, cyclists, motorcycle drivers and truck drivers in the Canadian collision datasets have not been discussed in the literature. This study aims to build models to predict the injury severity in collisions in the city of Toronto. For this type of problem, the regression models are the most widely used algorithms, mainly logistic regression in a classification problem [20,23]. Due to sparsity of the data, Lasso Regression is used to avoid overfitting and to compare it with a tree-based model.

DATA DESCRIPTION
The KSI (Killed or Seriously Injured) dataset provided by the Toronto Police Services is used in this study. The dataset is now available at data. tps.on.ca as part of the Public Safety Data Portal [41]. It includes all traffic collision events in which at least one person was killed or seriously injured and covers the years from 2007 to 2017. The dataset includes 58 variables and 12,557 observations. The variables can be categorized into individual attributes and collision attributes. Individual attributes describe the characteristics and behaviour of each individual involved in the collision. Collision attributes describe the temporal, spatial and environmental conditions. Each row in the dataset represents an involvement type, an individual involved in the collision, such as a driver, pedestrian, etc.
The focus is on the drivers, cyclists, and pedestrians. Drivers include automobile drivers, motorcycle drivers, or truck drivers. Cyclists include bicycle riders and moped drivers. Pedestrians include any pedestrian, in-line skater, or wheelchair user.
As for the variable selection, these are decided based on two selection measures: (1) a qualitative selection is performed to remove redundant variables; (2) a quantitative selection is performed using an analysis of Spearman correlation, in which highly correlated variables are removed including multi-collinear variables. Additionally, data engineering was performed to extract monthly information and to merge injury levels into fatal or major, and, minimal, minor or none. Prior to merging injury levels, there was a total of 542 fatal injuries, 3,598 major injuries, 465 minimal injuries, 566 minor injuries, and 3,751 none. Their definition is as follows: (1) Fatal: person sustaining bodily injuries resulting in death. This includes only cases where death occurs in less than 366 days as a result of the collision (does not include death from natural causes such as heart attack, stroke, etc. or  in Section 3) was performed. Adding more variables will give our models a propensity to over-fit the data, resulting in inaccurate outcomes. Additionally, many of our variables inherently include other information; such as variable CYCLISTYPE, which has 22 levels (examples of this variable are found in Table 1). We wanted to keep the original levels for replicability purposes. Moreover, when later the variable importance is discussed, we list specific levels, and not just the variable itself, in order to distinguish the effect of different possible causes, and to avoid overestimating one particular variable. Two classification algorithms are used: Lasso Regression and Random Forest. Lasso Regression is a method for the estimation in linear models that performs variable selection and regularization, which is an approach to fine-tuning model complexity. It is used to deal with the sparsity in our dataset. A sparse dataset implies high variance. As per the bias-variance trade-off, high variance in the dataset increases the model complexity and the mean squared error [42]. Lasso Regression adds a penalty (lambda) to the coefficients, and therefore reduces the model complexity [42]. Random forest is a treebased classifier that uses an impurity measure (Gini) to decide on the best split. Each variable is considered as a candidate for a splitting node. Splits are assessed and chosen using the Gini impurity measure. A split is pure if after the split, for all branches, all the instances choosing a branch belong to the same class [42]. A low Gini measure indicates that the split variable is important for data partitioning.
The main difference between the two models is how they deal with complexity and generalizability. In Random Forest, which is an ensemble method, complexity is decreased through the training process. In Lasso Regression, complexity is decreased through regularization, where an augmented error function is used [42].
(sedan drivers) ( Figure 2). Pedestrians, cyclists and motorcycle drivers are the most affected in collisions, with more than 90% of each of the involvement mentioned above type having a major or fatal injury.

METHODOLOGY
We use the a priori association rule technique to mine the dataset and uncover patterns and rules between the variables [42]. An association rule is of the form X&Y, where X={x 1 ,x 2 ,…,x n }, and Y={y 1 ,y 2 ,…,y m } are two sets of mutually exclusive observations. For an association rule to be of interest, it must satisfy two interest measures: support and confidence. Support is an indication of how often an observation or a set of observations appear in the dataset and it equals P(X,Y). Confidence measures the strength of the rule and is equal to P(Y|X). A rule of the form {}&{Y} means that the observation in Y will appear with the probability given by the rule support (which equals confidence).
To predict the severity of an injury as one of the two classes: 'fatal or major' and 'minimal, minor or none,' a classification approach was used.
In such a binary context, the authors in [43] stated one possible limitation related to endogeneity of the explanatory variables: "one potential concern […] is the possibility that the explanatory variables may be endogenous with respect to injury severity". The authors explained that a possible solution for that is to add more variables, which in turn can provide a better explanation of the overall picture. The authors gave an example of airbag as an explanatory variable. They stated that drivers owning vehicles with airbags may also tend to be risk-averse [44]. As such, airbags can be coupled with risk-averseness variable to avoid endogeneity problem and over-estimation of the importance of the airbag variable.
In this study, the data used are very sparse, with high dimensionality. Due to sparsity, dimensionality reduction (quantitative and qualitative as mentioned The variable importance in each model is then analysed to detect which variables have the most weight on the models. For Lasso Regression, the coefficient t-test is used [42]. For Random Forest, the out-of-bag error is used [46]. The measures reported are scaled (0-100).

DATA MINING
The apriori algorithm was applied to collision subset. It was noticed that the majority of collisions occur on major arterials and/or intersections (Rules 1, 2, Table 3). Within the collisions taking place in major arterials, 72% are located in intersections (Rule 3, Table 3).
Collisions in Toronto mostly occur in locations where there is a traffic signal or no traffic control at all. Ninety percent of collisions in Toronto took place in locations with either of those two traffic control characteristics (traffic signal or no traffic control) (Rules 4, 5, Table 3).
It can be seen that the largest proportion of collisions happen under clear and dry conditions, and in daylight (Rules 6, 7, 8, Table 3). It is found that these three characteristics together occur 51% of the time (Rule 9, Table 3).
A related trend is noticed in the time patterns of collisions; that is, most collisions occur during the summer/spring season, seasons associated with dry, and clear conditions. Additionally, one can see The advantage of Lasso Regression is its ability to take into account the correlation among the variables; its weakness, however, is that some features' coefficients can be reduced to 0 through regularization; therefore, bias could be introduced in the model. Random Forest's advantage, on the other hand, is its ability to deal with complexity and generalization error. This is done by its training process, and also by pre-pruning the tree. Pre-pruning the tree ensures that a node is not split further if the number of observation reaching that node is smaller than a certain percentage of the training set [42].
For modelling, the dataset is divided into 80% training set and 20% test set; then a 10-fold cross validation is conducted on the training set. Because the dependent variable is imbalanced in each of the three subsets (as seen in Figure 2), it is treated using Synthetic Minority Oversampling Technique (SMOTE) [45]. In the drivers' subset, 'fatal or major' instances are oversampled and 'minimal, minor or none' are undersampled. The opposite is done for pedestrians and cyclists.
To assess the performance of the proposed predictor, the performance measures used in two-class problems ( Table 2) are used. The number of true 'minimal, minor or none' estimations are denoted with TN, the number of false 'true minimal, minor and none' estimations with FN, the number of 'false fatal or major' estimations with FP, and the number of true 'fatal and major' estimations with TP.
The accuracy measures the rate of correct estimations (Equation 1). The True Positive Rate is also known as sensitivity (Equation 2). The True Negative Rate is also known as specificity (Equation 3) [42]. exceeding speed limit, speeding too fast for the conditions, following too close, disobeying traffic control, failing to yield right of way, passing improperly. One-third of the drivers in our dataset exhibited aggressive driving behaviour (31%). Amongst these drivers, 55% failed to yield right of way, and 17% disobeyed traffic control; these are the two most common aggressive driving behaviours. The data show that failing to yield right of way is a common action in case of inattentive drivers. The vast majority (85%) of inattentive drivers failed to yield right of way (Rule 3, Table 4). It was also observed that when a vehicle is turning right while a pedestrian is crossing with right of way, 88% of the time that driver failed to yield right of way while turning (Rule 4, Table 4). Similarly, when a driver is turning left, 78% of the time that driver failed to yield right of way (Rule 5, Table 4).
It can be seen that the turning left manoeuver occurs in almost a third (27%) of the one driver and one pedestrian collisions (Rule 6, Table 4). In the one driver and one pedestrian collisions, the likelihood within the hourly patterns of collisions ( Figure 3) that collisions peak between 4 p.m. to 7 p.m., a period usually associated with the end of a working day, and, in the summer and spring season, associated with daylight.
To find the underlying issue of these time and location trends, the behavioural patterns within the most common collision dynamics are investigated. These are one driver and one pedestrian collision, which represent 40% of all collisions in the dataset (1,689 collisions), and two drivers' collisions, which represent 25% of collisions in the dataset (1,113 collisions).
In the one driver and one pedestrian collisions, the intersections were found to be the most frequent collision locations (70% of all such collisions) (Rule 1, Table 4). It was also noticed that collisions that occur while a pedestrian is crossing with right of way at an intersection, is almost always associated with a driver failing to yield right of way; this happens 85% of the time (Rule 2, Table 4).
Failing to yield right of way is the most common aggressive driving behaviour. Aggressive driving is defined as any of the following actions [47]:

Figure 3 -Monthly and hourly collisions
or majorly injured. The patterns leading to major or fatal injuries amongst each subgroup of drivers, pedestrians and cyclists, are analysed.
Amongst drivers, the majority of fatal or major injuries occur as a consequence of losing control of the vehicle amounting to 422 collisions (Rule 1, Table 6). Particularly on mid-blocks where there is no traffic control. In fact, losing control of a vehicle in such a location is associated with a 94% probability of fatal or major injury (Rule 2, Table 6).
On the other hand, the drivers' subset presents a new finding regarding motorcyclists. Motorcyclists have a 94% probability of a fatal or major injury (Rule 3, Table 6). More specifically, motorcyclists going ahead in an intersection where a traffic signal is located, and, either on the major arterial or in normal condition, have a probability of 97% or more of a fatal or major injury (Rules 4, 5, 6, Table 6). Motorcyclists driving in Toronto East York during daylight also have a similar probability of fatal or major injury (Rule 7, Table 6). of a driver turning left given that a pedestrian is crossing with right of way at an intersection is 66% (Rule 7, Table 4).
In collisions between two drivers, 64% of the time, one driver simply goes ahead (Rule 1, Table 5). Whenever one driver fails to yield right of way on a left turn, the driver almost always collides with the driver going ahead (Rule 2, Table 5). It is also observed that when a driver makes an improper turn (Rule 3, Table 5), or turns left inattentively (Rule 4, Table 5), the other driver almost always goes ahead. However, it is also seen that a small portion of drivers going ahead disobey traffic control. Most of the time, these drivers collide with another driver who is, in turn, driving properly (Rule 5, Table 5). Similarly, drivers who follow too close or drive inattentively have a 90% probability and more of colliding with a driver who drives properly (Rule 6, 7, Table 5).
In the first type of collision, which is one driver and one pedestrian collision, 1,695 individuals get fatally or majorly injured. In the second type; the collision between two drivers, there are 948 fatally  Similarly, three types of collisions were detected that always result in a fatal or major injury. These are collisions that involve a cyclist and a driver travelling in the same direction where one vehicle sideswipes the other, a motorist turning left across the cyclists' path, and cyclists struck by the opened vehicle door (Rules 8, 9, 10, Table 7).
When it comes to pedestrians, we see that onefourth of pedestrians are fatally or majorly injured on mid-blocks (Rule 1, Table 8), particularly on major arterial (Rule 2, Table 8).
It can be noticed that drivers with medical or physical disability are also more prone to fatal or major injuries, particularly those driving an automobile or a station wagon (Rule 8, Table 6).
The risk of cyclists' fatal or major injury in the months of June and July exceeds 95% (Rules 1, 2, Table 7). As noted earlier, these months have a very high collision frequency (Figure 3).
Consistent with our previous findings, it was noticed that the cyclists' fatality or major injuries occur primarily on major arterials or intersections (Rules 3, 4, Table 7).
Many rules were found in which 100% of injuries were fatal or major. For example, all cyclists' collisions in ward 18 and ward 28 resulted in such severe injuries (Rules 5, 6, Table 7). Also, it appears  the two models are statistically different for each subset (p-value<0.05). Both Random Forest and logistic regression resulted in a good prediction with a minimum of 76% accuracy and maximum accuracy of 89%. However, it is observed that Random Forest algorithm is consistently generating higher overall accuracy for all the subsets (Table 9). Random Forest, as a non-linear model, uses the mean decrease Gini statistics as the basis for deciding on the splitting node. In this way, Random Forest captures the importance of each variable in classification.
To understand which variables affect the models the most, the top 20 most important variables in the models are listed.
Within the driver model, motorcycle has the most weight importance in both Random Forest and Lasso Regression models. There exist other common variables between the two models; these are medical or physical driver disability, losing control of the vehicle, and failing to yield right of way ( Figure 4).
As for cyclists and pedestrians, it can be seen that Random Forest captured behavioural variables, whereas Lasso Regression captured mostly locations and hours. The common variable between the two models in the cyclists' subset is age-related; it is cyclists aged 50 to 54. Within the pedestrian subset, there are no common variables (Figures 5 and 6).
Areas with no traffic control also result in high probability of pedestrian fatal or major injury, particularly in cases where pedestrians are hit at midblock or when pedestrians are crossing in areas with no traffic control (Rules 3, 4, Table 8).
Intersections are also risky areas when it comes to pedestrian injuries. In Toronto East York, for example, a vehicle turning left at an intersection while a pedestrian is crossing with right of way is associated with 96% probability of fatal or major injury (Rule 5, Table 8). This finding is consistent with the rules discovered earlier regarding one driver and one pedestrian collision type, where it was found that many drivers fail to yield right of way on a left turn.
It was noticed that pedestrians' injury level is affected by the weather. At an intersection, a rainy day and wet surface condition result in a major or fatal injury 95% of the time or more (Rules 6, 7, Table 8). In general, a wet road surface condition and a dark lighting condition (the time between sunset and sunrise) is associated with a 96% probability of major or fatal injury (Rule 8, Table 8).

RESULTS OF PREDICTION MODELS
The performance measures were used to assess how well each algorithm predicts the injury severity, and the analysis of variance to test for statistical difference between the models. The test showed that  fatal or major injury probability. For example, in case of pedestrians, the following variables are associated with 100% fatal or major injury: ward 4, ward 12, ward 26, hour 5, snow, age 90 to 94, and a person getting on/off a vehicle. Another example is the wards discussed in the cyclists' association rules, where it was found that wards 28 and 18 are associated with 100% fatal or major injury. These instances, however, represent less than 35 cases within the pedestrians and cyclists' subsets.

THREATS TO VALIDITY
Internal Validity. The dataset under consideration in this study had some missing values. We were informed by the Toronto Police Service that there could be some cases where police officers may have skipped some items in the questionnaire especially when the conditions were normal. The removal of the missing records or their implementation by means or It was also noticed that Random Forest captures many of the patterns presented in the data mining section. Within the drivers' subset, ten out of the 20 variables listed in Random Forest are discussed in Section 4, such as motorcycle, losing control of the vehicle and intersection. Logistic regression only includes three out of 20.
Within cyclists, eight out of the 20 variables listed in Random Forest model are discussed, such as major arterial, intersection and motorist turning left across the cyclist's path. Logistic regression only includes three of the 20 variables listed.
Within pedestrians, logistic regression does not list any of the variables discussed in Section 4, whereas Random Forest lists seven, such as pedestrian hit at mid-block, crossing at no traffic control area and rain.
However, when analysing the 20 most important variables in logistic regression, it can be seen that many of the variables are associated with 100%  generalizability. We observed that Lasso Regression model gave much importance to features that are associated with 100% fatal or major injuries such as specific wards or specific manoeuvers. For example, there are only three observations of disabled manoeuver in our drivers subset, yet, our Lasso Regression model considered this feature as one of the top five most important features; that is likely due to the fact that all three observations are associated with fatal or major injury. In that sense, we can say that Lasso Regression's feature importance selection is very precise in terms of selecting the features that best distinguish the fatal or major instances versus minimal, minor or none instances. However, overall, Random Forest generalizes better, with the most important features reflecting 'big patterns' in the dataset as highlighted in Section 5; that is due to the Random Forest training process.
Based on the findings in our data mining section and the prediction results section, the following summary conclusion is drawn: (a) The temporal and environmental characteristics of severe collisions can be summarized as follows: as shown in Figure 3, severe collisions in Toronto occur most frequently in the summer and spring, particularly in clear and dry conditions. Cyclists sustain major and fatal injuries particularly during the months of June and July, whereas pedestrians' risk of fatal or major injury increases in rainy conditions, in case of wet surfaces and dark light as shown in both the data mining section and feature analysis in the prediction model selection; (b) The spatial characteristics can be summarized as follows: in both data mining and prediction model we see severe collisions recurring in major arterials and intersections. Intersections are particularly high-risk locations for the pedestrians. These, along with mid-block, traffic signal and no traffic control represent the riskiest spatial features of severe collision occurrences for all the traffic participants (drivers, cyclists, pedestrians). Pedestrians are highly at risk of severe injuries in collisions taking place at mid-blocks and in no traffic control areas, whereas motorcyclists are at high risk of severe injuries at intersections where traffic signal is present; (c) Behavioural characteristics, including drivers' action and condition are summarized as follows: we see a recurrent pattern of aggressive and inattentive driving behaviours. In aggressive driving, the most common behaviour is failing to yield right of way, mostly at left turns, but also at right turns. Together with inattentive mode could cause concern for internal validity. In order to mitigate this effect, we performed a very detailed exploratory analysis and used information within the dataset to impute the missing values.
External Validity. To analyse the whole dataset without any sub-setting of drivers, cyclists and pedestrians could result in an external threat to validity since our model would not be generalizable. To ensure generalizability of our results, we ensured that each of the involved types was treated separately.
Construct Validity. In a binary classification setting, in our case fatal/major vs. non-fatal/minor outcome, a class imbalance affects the impact that a given exploratory variable has on the outcome, which can cause construct validity. To overcome this problem, we treated our data for imbalance prior to applying the models.
Statistical Conclusion Validity. The association rules have been qualitatively selected due to the high number of rules (exceeding 10,000). As such, the findings presented are not exhaustive of all possible rules. However, we ensured that the rules selected were based on both the highest support and confidence, and a lift greater than 1.

DISCUSSION AND CONCLUSION
This paper analyses and predicts the collision injury severity in Toronto using both data mining techniques (association rules), and classification algorithms (Lasso regression and Random Forest).
Severe collision prevention measures can be tackled by spreading more awareness among the drivers, pedestrians and cyclists. We found that drivers tend to get involved in severe collisions when the following characteristics are exhibited: aggressive driving, particularly failing to yield right of way and improper turns, and inattentiveness. We found that pedestrians are at a much higher risk of severe injuries when crossing at mid-blocks, whereas cyclists are at high risk of severe injuries particularly when colliding with motorcyclists.
The prediction of such injuries through Lasso regression model and a Random Forest tree-based model is promising. We found that Random Forest's accuracy consistently exceeded Lasso regression's accuracy for all three subsets: drivers, cyclists and pedestrians.
Moreover, we noticed that Random Forest was able to generalize better as observed in Section 6. As mentioned in Section 4, the two algorithms differ in how they deal with complexity and driving at intersections, these characteristics constitute the majority of severe collisions in Toronto. This is applicable to both collisions where one driver and one pedestrian are involved, and where two drivers are involved. Another aggressive driving behaviour appears in two drivers' collisions, that is, following too close and disobeying traffic control. Drivers also seem to be at high risk of severe injuries in collisions when they lose control of the vehicle or when they have a medical or physical disability. Although medical and physical disability observations are low in our data, we were informed by TPS that these may be much higher due to the fact that not all drivers disclose that information to the police officer. As for collisions where cyclists suffer major or fatal injuries, we noticed that these are mostly associated with the following actions and conditions: driver sideswipes cyclists while driving in the same direction, motorist turning left across the cyclist's path and cyclists struck by the opened vehicle door.
The goal of such a comprehensive study of different risk factors affecting drivers, cyclists and pedestrians including temporal, environmental, spatial and behavioural characteristics, is to highlight the different features involved in severe collisions in order to facilitate the decision making of effective traffic safety and injury prevention measures. These can be translated into decisions such as: the decision to dispatch more officers on the roads given specific temporal, environmental and spatial characteristics, the design of traffic safety campaigns run by the Toronto Police Services, including strategic messaging, and the spread of more awareness about aggressive and inattentive driving.
Moving forward, we aim to include more datasets from the Toronto Police Service and the City of Toronto to make the results more generalizable.