EXTREME GRADIENT BOOSTING (XGBOOST) MODEL FOR VEHICLE TRAJECTORY PREDICTION IN CONNECTED AND AUTONOMOUS VEHICLE ENVIRONMENT

Connected and autonomous vehicles (CAVs) have the ability to receive information on their leading vehicles through multiple sensors and vehicle-to-vehicle (V2V) technology and then predict their future behaviour thus to improve roadway safety and mobility. This study presents an innovative algorithm for connected and autonomous vehicles to determine their trajectory considering surrounding vehicles. For the first time, the XGBoost model is developed to predict the acceleration rate that the object vehicle should take based on the current status of both the object vehicle and its leading vehicle. Next Generation Simulation (NGSIM) datasets are utilised for training the proposed model. The XGBoost model is compared with the Intelligent Driver Model (IDM), which is a prior state-of-the-art model. Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) are applied to evaluate the two models. The results show that the XG-Boost model outperforms the IDM in terms of prediction errors. The analysis of the feature importance reveals that the longitudinal position has the greatest influence on vehicle trajectory prediction results.


INTRODUCTION
The technology of connected and autonomous vehicles has developed rapidly in recent years. It is believed that CAVs could lead to a significant improvement of roadway safety and mobility. One important reason is that CAVs can receive information on their leading vehicles through multiple sensors and V2V technology. Thus, CAVs are able to predict their future behaviour accordingly.
The Intelligent Driver Model (IDM) is a traditional method to predict the acceleration rate for the object vehicle based on the current status of the object vehicle and its leading vehicle. The IDM is a widely used car-following model which utilises an intelligent braking strategy to transit vehicle behaviour between acceleration and deceleration and creates a crash-free roadway dynamics [1]. However, when the gaps between the two vehicles decrease significantly, the IDM model will generate strong braking manoeuvre for the object vehicle which is unrealistic and not possible in the real world. In this study, we propose the XGBoost model, a relatively new machine learning method, to predict the acceleration rate of the object vehicle.
Machine learning (ML) has proved to be an effective statistical tool to solve regression and classification problems [2]. ML technology can explore potential correlations between input features and output labels by learning from the training dataset without programming in advance. As one of the ML methods, the XGBoost algorithm has gained much popularity in machine learning competitions in Kaggle, which is an online community of data scientists and machine learners [3]. The XGBoost model can generate higher prediction accuracy while taking less processing time compared to other ML methods [4]. The XGBoost model is trained historical vehicle trajectory data is determined for safe and unsafe classification by stability analysis. Support Vector Machine can also predict the lane changing manoeuvre by considering the lateral position and the heading error of the vehicle [16].
Ju et al. [23] proposed a multi-layer architecture Interaction-aware Kalman Neural Networks (IaK-NN) to analyse high-dimensional traffic environmental problems. Nikhil and Morris [24] predicted human trajectory based on convolutional neural network (CNN). CNN can extract existing human trajectory data and assign importance to various aspects of the data. CNN has the ability to support increased parallelism and temporal representation, and predict human trajectory efficiently and precisely. Han et al. [25] employed a Support Vector Machine (SVM) algorithm to predict the action of the preceding vehicle. Artificial neural network approaches have also been proposed for trajectory prediction since the trajectory data can be viewed as a time series [26]. Xing et al. [27] presented a Long Short-Term Memory (LSTM) based joint time-series model to predict the leading vehicle trajectory prediction based on the Long Short-Term Memory (LSTM) Recurrent Neural Network model (RNN). RNN has shown promising results in sequence modelling, such as natural language processing. LSTM is a particular implementation of RNN. LSTM can model long-term dependencies between input features. Therefore, LSTM is used to model vehicle manoeuvre and predict vehicle trajectory.
As a relatively new ML approach, XGBoost has been used by many researchers in various topics [28]. Parsa et al. [29] presented the XGBoost model to predict accidents and analyse the contributing factors. Dong et al. [30] disused the XGBoost model to monitor structural health by predicting concrete electrical resistivity. Lim and Chi [31] used XGBoost method to analyse the damage level of bridges. XGBoost is an improved gradient boosting decision tree (GBDT) model. Compared to GBDT and other methods, such as Random Forest, XGBoost uses a regularisation term to reduce the potential of overfitting problems. However, to the knowledge of the authors, the performance of the XGBoost model for vehicle trajectory prediction is still unknown. To fill this gap, the XGBoost model is developed in this study by formulating vehicle trajectory prediction as a regression problem. and evaluated using the NGSIM US-101 datasets which contains vehicle trajectories collected from Californian freeways [5]. The prediction results will be compared with the outcomes based on the IDM in terms of the RMSE and MAE. The feature importance will also be analysed to identify the critical attributes and their relative influence on vehicle trajectory prediction through the proposed XGBoost model.

RELATED RESEARCH
Various vehicle trajectory prediction techniques have been developed in existing studies. Motion models are traditional ways for trajectory prediction [6,7]. However, motion models are unreliable for non-linear problem, which is a potential characteristic of vehicle trajectory. Intelligent Driver Model is a previous State-of-the-Practice, which has been widely used in the microsimulation of car-following movements. As one of the deterministic models, IDM produces an unrealistic behaviour [8]. Recently, machine learning methods are prevalent in trajectory prediction topics. ML methods can predict vehicle trajectory with a large amount of traffic data through the developed data acquisition technologies such as GPS and roadside cameras. Some ML approaches have already been applied for trajectory prediction, including hidden Markov models [9][10][11], Gaussian process regression models [12,13], Bayesian networks [14], Support Vector Machine [15][16][17], and Long Short-Term Memory [18][19][20][21][22].
Hidden Markov model-based trajectory prediction is able to describe the position and behaviour of vehicles in a network-constraint environment [9]. The hidden states and observation states were extracted from existing vehicle trajectory data and used to predict optimal future trajectory accordingly. Gaussian Process Regression models are able to learn motion patterns from two-dimensional unlabelled trajectory patterns [12]. The captured spatio-temporal characteristics of traffic patterns are then used to predict the future trajectories of vehicles in the system. Bayesian network can model high-level driving manoeuvres by inferring for each vehicle in the traffic scene via Bayesian inference [14]. Irrational driving behaviour can be detected and used to predict manoeuvre-based probabilistic trajectory subsequently. Support Vector Machine for regression is employed to achieve accurate and reliable vehicle position [15]. The where I j ={i|q(x i )=j} is the set of data point indices belonging to the j-th leaf. Since the same score is assigned to all the data points on the same leaf, the index of the summation in the second line can be revised. The terms g i and h i denote the first and second derivatives of the loss function. Let Gj gi i Ij / then the final objective function is changed to a quadratic function as follows: Finally, the optimal solution of the optimised objective function can be generated: When using the XGBoost model for regression, each regression tree maps an input data point to one of its leaves that contains a continuous score. The training process is conducted by adding new trees and predicting the residuals of prior trees and combining the new tree with previous trees to make the final prediction. For a given status of object vehicle and its leading vehicle, the current status is mapped as a new leaf of the XGBoost model. The XGBoost model will find the best tree split that generates the minimum residual for the objective function. In doing so, the best prediction value of acceleration rate of the object vehicle could be calculated. Figure 1 shows the flowchart of the proposed XGBoost model.

Intelligent Driver Model
The Intelligent Driver Model (IDM) produces better realism than most of the deterministic car-following models [33]. The fundamental of the IDM is to calculate the acceleration rate of the object vehicle by considering both the ratio of desired velocity versus actual velocity and the ratio of desired headway versus actual headway of the vehicle. The calculation of the acceleration rate is expressed as follows: ,

XGBoost algorithm
XGBoost is a prevalent boosting tree algorithm employed in industry because of its accuracy and high efficiency in prediction. In fact, XGBoost is developed from the GBDT algorithm and employed in classification and regression problems with multiple decision trees [32]. XGBoost can prevent over-fitting by normalising the objective functionxs. The details of the model are illustrated as follows.
A dataset is assumed as D={(x i ,y i )} (i=1,2,…,n), and the model has k trees. The result ( y i V ) of the model is expressed as: , where F is the hypothesis space, and f(x) denotes a regression tree: where ω q(x) represents the score of each leaf node; q(x) is the number of leaf nodes. When a new tree is developed to fit residual errors of the last tree, the predicted score for the t-th tree can be calculated as follows: The objective function is as follows: where L is a loss function, Ω is a penalising term to reduce the complexity of the model, and: where γ is a parameter that represents the complexity of the leaf; T denotes the number of the leaves; λ is a parameter scaling the penalty; and ω is the vector of scores on each leaf. Unlike the general gradient boosting methods, the XGBoost uses the second-order Taylor expansion to the loss function. Formula 5 is then simplified as follows: Then, the final objective function can be calculated as follows:

Model comparison
The XGBoost model is compared to the GBDT model and the IDM regarding the prediction errors. Root mean square error (RMSE) and Mean absolute error (MAE) are two of the most common metrics used to evaluate the performance of the proposed models. Both RMSE and MAE express the average model prediction error and are negatively-oriented scores. Comparison between the predicted and observed values is carried out in conjunction with statistical metrics including RMSE and MAE. RMSE denotes the average of square errors between predicted values and actual values and is calculated as: Mean absolute error (MAE) is calculated by averaging the absolute errors between predicted values and actual values: where N is the number of data points; y i * and y i represent the predicted and actual values.

Dataset
In this study, the Next Generation Simulation where: a -acceleration rate of the object vehicle; a m -maximum acceleration rate; v -current speed of the object vehicle; v 0 -desired speed; δ -acceleration exponent; s * (v,Δv) -desired minimum headway; Δv -speed difference between the object vehicle and the leading vehicle; s -current headway between the object vehicle and the leading vehicle; s 0 -linear jam distance; s 1 -non-linear jam distance; T -desired headway; b -comfortable deceleration rate. Table 1 presents the values of all the parameters in the proposed IDM in this study. The parameters are adopted from one of the authors' previous studies [8].

Performance of the models
In this study, RMSE and MAE are employed to evaluate the prediction accuracy of the XGBoost model, GBDT model, and the IDM. Table 2 shows the RMSE and MAE values for the proposed models. As we can see from the table, the RMSE and prediction related studies [34][35][36][37][38]. More specifically, we consider a 15 minute segment of vehicle trajectories on the US101 highway. Since different vehicle type has different car following behaviour, only passenger cars are involved in the analysis. The time period is between 7:50am and 8:05am, 15 June 2005. In total, the selected dataset includes trajectories for 1,993 individual vehicles, recorded at 10 Hz. The information (including the location, speed, and acceleration rate) of the leading vehicles are extracted and attached to their following vehicles accordingly. Using one vehicle as an example, it follows another vehicle for 243 seconds. Figure 2 shows an example of the speed difference between the leading vehicle and the following vehicle. Figure 3 shows the headway between the two vehicles. To evaluate the performance of the XGBoost model, 80% vehicles in the selected dataset are used as the training set and the remaining 20% are used in the testing phase.

Feature Extraction
The NGSIM dataset provides vehicle speed, position, acceleration rate, and headway of each individual vehicle. In this study, the objective is to predict the acceleration rate for the object vehicle, which is the determining factor of vehicle trajectory. Under the CAV environment, the object vehicle can receive information from its leading vehicle. The acceleration rate of the object vehicle is then

CONCLUSIONS
This study presents an innovative algorithm for connected and autonomous vehicles to determine their trajectory considering surrounding vehicles. For the first time, the XGBoost model is developed to predict vehicle trajectories in connected and autonomous vehicle environment. The proposed model is compared with previous machine learning model GBDT and the existing deterministic model IDM. The NGSIM dataset is utilised to train and test the proposed XGBoost model. The predicted results verify that the proposed XGBoost model can generate higher prediction accuracy than the IDM while taking much less processing time than the GBDT. The longitudinal position of the object vehicle is the most important feature to predict the vehicle trajectory. The results of this study could help guide the machine learning approaches in the area of vehicle trajectory prediction. The proposed model can be extended by considering more surrounding vehicles while predicting vehicle trajectory. The proposed model could also be applied in other domain, such as driver behaviour prediction and pedestrian detection.
MAE of the XGBoost model are 3.9953 and 2.6950, respectively, which are similar to the errors of the GBDT (i.e., 3.9647 and 2.7146) and smaller than the IDM (i.e., 6.2748 and 4.7164). The execution time for the XGBoost model is 102.9 seconds, while the execution time for GBDT is 3,733.1 seconds. This illustrates the superiority of the XGBoost model in the prediction of vehicle trajectory. Figure 4 shows the predicted and observed values in a predict horizon of 30 seconds. As can be seen in the figure, the XGBoost model can effectively predict the acceleration rate of the object vehicle. The prediction results of the IDM are inferior to those of the XGBoost model. By comparing the prediction results, we can conclude that the XGBoost model is more reliable for vehicle trajectory prediction than the IDM.

Feature importance
To further explore the impact of each feature on the vehicle trajectory prediction, the relative importance of the eight input features in the XGBoost model are calculated. The feature importance is ranked based on the F score, which is a measurement of the frequency that a variable is selected for splitting. The feature will get higher score if it is used to make decisions in the decision trees more frequently. The importance ranking of the input features are displayed in Figure 5. As can be seen from There are some limitations of this study. The dataset we use is from a freeway section and lane change situations are not considered. Future research efforts will investigate other machine learning models to predict vehicle trajectory considering lane changing as well as different roadway scenarios. To get more accurate results from IDM, sensitivity analysis should be made for the parameters. The dataset in this study represents the congested traffic condition during peak hour. The proposed model should also be tested under non-congested traffic condition.