TRAFFIC SPEED PREDICTION FOR HIGHWAY OPERATIONS BASED ON A SYMBOLIC REGRESSION ALGORITHM

Due to the increase of congestion on highways, providing real-time information about the traffic state has become a crucial issue. Hence, it is the aim of this research to build an accurate traffic speed prediction model using symbolic regression to generate significant information for travellers. It is built based on genetic programming using Pareto front technique. With real world data from microwave sensor, the performance of the proposed model is compared with two other widely used models. The results indicate that the symbolic regression is the most accurate among these models. Especially, after an incident occurs, the performance of the proposed model is still the best which means it is robust and suitable to predict traffic state of highway under different conditions.


INTRODUCTION
According to the Urban Mobility Scorecard Annual Report in 2015 by INRIX, traffic congestion continues to be one of the biggest challenges in many major cities worldwide.For example, congestion caused each commuter in Washington DC to travel 82 hours more and burn 35 gallons of fuel extra which amounted to an average value of $1,834 per traveller.In order to reduce the congestion, in the last decades, a large number of roads were built.However, this approach did not give the expected results and did not entirely solve the problem.A better solution would be to fully use the existing roads and provide real-time traffic information to the travellers.With the useful information, travellers could choose when to start their trips and which routes to take the least times.For traffic information to have beneficial effects, short-term traffic speed prediction is important, because in the Advanced Traveler Information System (ATIS) the traffic state is represented by the speed.In fact, current traveller information activities are being hampered by the lack of capacity to predict future traffic state [1].
Formally, given a sequence of values {X 1 ,X 2 ,…,X t } of some observed traffic parameters (such as speed or occupancy) from sensors in the transportation network, the prediction problem is defined to predict the future values X t+i (i=1,2,…n) and the objective is to build a prediction model that can minimize the gaps between prediction values and actual values.Throughout the years, many researchers have developed a wide variety of traffic prediction models from different perspectives.In general, the models can be grouped into statistics-based and computational intelligence-based models [2].
In statistics-based models, the linear regression basically assumes a linear combination of covariates and has a simple structure that can be found in [3][4][5].In these studies, some techniques were implemented to enhance the accuracy such as stepwise method and varying coefficients.On the other hand, time-series approaches were widely used based on the works of Box and Jenkins who popularized a specific class of the method.They comprise combinations and generalizations of autoregressive moving average (ARMA) models [6].Starting from this work, different types of models originated including autoregressive integrated moving average (ARIMA) [1,7], seasonal ARIMA (SARI-MA) [8,9], vector ARMA (VARMA) [10,11], space-time ARIMA (STARIMA) [12], etc. Kalman filter is another statistical model that has been used to predict traffic parameters by various researchers [13][14][15].This approach uses time series methods predicting the future

TRAFFIC SPEED PREDICTION FOR HIGHWAY OPERATIONS BASED ON A SYMBOLIC REGRESSION ALGORITHM
egies, colony optimization, differential evolution, and many others [27].Among these techniques, genetic programming developed by Koza is the most popular because it is easy to understand and uses probabilistic selection rules, not the deterministic ones [28].
There are various successful applications of symbolic regression in prediction.For example, it was applied to predict wind energy production and got a reliable prediction for the renewable energy [29].It was also used to predict the oil production and validated it on both synthetic and real data, and made reliable results [30].A novel symbolic regression was introduced to predict elastic modulus of self-compacting concrete [27].In [31] it was used to predict syngas yield production.After comparison the results showed that multigene genetic programming is better than single-gene genetic programming.In the area of transportation, GP was used to build models for real-time crash prediction on highways [32].It was used to input the missing value in the intelligent transportation system [33].Also, GP was applied to evaluate the performance of the pavement, where it was used to develop models to predict pavement rutting [34].
To reduce the congestion caused by accidents, one can look for the influencing factors [35] and the severity of the damages.In this paper a different approach is taken.A task is set to build some models that can effectively utilize the huge amount of data and accurately predict the traffic state.We propose a novel symbolic regression method for traffic speed prediction which is built based on a popular evolution algorithm, genetic programming [28].The contributions of this study are: (1) the proposed model does not only make up for the disadvantages of both statistics-based models and computational intelligence-based models, but also makes full use of their advantages; (2) the proposed model is evaluated with the data from the real world and the sensitivity of the input is analysed.

METHODOLOGY 2.1 General description of SR
Let y be the vector of true values and y ˆ the vector of prediction values for the traffic data.The structure of the symbolic regression can be presented as in Figure 1: where for each id{1,…,G}: b 0 and b i are the parameters that can be estimated by the least squares method; t i is the (N•1) vector of outputs from the i-th GP tree; N is the sample size.
The symbolic regression shown in Figure 1 can be applied through the following four simple steps: (1) generate the initial random population; (2) evaluate the fitness of individuals; (3) conduct genetic operators; (4) select the best model among outputs.The detailed explanations of each procedure are given in the following: traffic parameter by continuously updating the selected state variables.With the transition equation, the selected state vector is developed based on the past and current observations along with an optimal estimation state vector [16].From the previous studies, it can be summarized that the statistics-based models have solid and widely accepted mathematical foundations that give them ability to provide more insight in the relationship among variables.However, its disadvantages are also obvious.The main drawback of this approach are the strong assumptions which are not practically relevant to the realistic traffic flow [17].
In recent years, the applications of computational intelligence-based models were the focal point in literature including Neural and Bayesian Networks [18,19], Support Vector Regression (SVR) [20], k-nearest neighbours [21,22], regression trees [23,24], etc.This type of models was considered as inevitable, particularly as most classical approaches have been shown inadequate under unstable traffic conditions and complex road settings when and where it is needed [25].Research has proven that computational intelligence-based models are statistically superior to linear statistical models because they are more flexible, more robust and more appropriate for capturing the complexity and uncertainty of traffic flow than traditional statistics-based models [17].However, despite the numerous research papers in traffic prediction that utilize computational intelligence-based models, researchers often used them blindly, ignoring some of their shortcomings.First of all, this type of models are "black-box" because they cannot provide specific algorithms and the relationship between variables is difficult to interpret.In addition, the over-fitting problem is quite common when applying computational intelligence-based models with complex structure to training set with relatively small sample size.Getting a nearly perfect performance for the training set can lead to a large performance drop for the testing set [26].
Symbolic regression (SR) uses a function discovery approach for modelling relationships in the dataset.Unlike traditional regression methods in statistics that fit parameters to an equation with a predefined form, symbolic regression tries to develop the algorithm by simultaneously searching both the parameters and the forms.As a result, the output form of a symbolic regression is a formula that is simple and interpretable.Moreover, it is not necessary to give a prior assumption between the explanatory variables and the response variable as well as on the model structure.Additionally, it is not limited by the sample size and the important variables are automatically selected into the model.In recent years, a variety of evolution algorithms have been used to identify symbolic regression including genetic programming (GP), evolution strat-

Complexity
The complexity of an expression is decided by its structure such as tree depth, tree nodes, component function nonlinearity and number of variables.F. Smits proposed using the sum of the complexity of the tree structure and all sub trees as a metric that can reflect the characteristic of the model [37].The criterion is defined as the total number of nodes in the structure and each sub tree.The less complex the expression is, of course, the better.Figure 3 shows the computing processes of two different simple structures with three nodes.

Figure 3 -Example of the computing processes
The complexity of expression xdΩ is computed by where p j (x) is the total number of nodes in the j-th sub tree of expression x, j=1,…,m.In Figure 3, the second structure is more complex with f 1 (x)=6.

Fitness
The fitness evaluates how good a model is.Basically, two types of measures, root mean squared error (RMSE) and the R 2 , are usually used during building the model.R is the correlation coefficient of observed values and predicted values for expression x: 1) Generate the initial random population A population of random individuals is generated using the input variables as well as a specified function set such as sin, tan, cos, log, or exp, and so on.The parameters are randomly generated from the predefined bounded interval [-10, 10] [36].In this stage, node functions, size of population, maximum depth of tree and maximum number of genes need to be set to control the generation.

2) Evaluate the fitness of individuals
Then each generation individual is evaluated by calculating the correlation coefficient R of observed values and predicted values and the bigger R corresponds to the better individual.

3) Conduct genetic operators
Once the fitness of individual is obtained, the genetic operations are conducted.The operators including crossover and mutation are shown in Figure 2. Next, the new individuals are sent to step 2. In this stage, the total number of the cycles needs to be set.

4) Select the best model among outputs
Finally, it returns a large number of models which differently describe the specifics of the datasets.In the last stage, it is a difficult task to select the best model while balancing a trade-off between performance and complexity.In this paper, we convert this task into a multi-objective optimization problem and solve it using Pareto front.The two objectives of the optimization problem are to minimize the complexity f 1 (x) and the fitness f 2 (x) of the expression xdΩ, where Ω is a feasible region, i.e. a set of all feasible solutions.The data used in this paper are extracted from the monitoring system of Xi-Cheng freeway, Jiangsu, China which is 22 km and has three lanes in each direction.The point in Figure 5 represents the position of a microwave sensor which is selected as the object of this study.

Microwave sensor
Figure 5 -Location of the microwave sensor (Source: Openstreetmap) In the monitoring system, the speed is measured every five minutes in each lane.To obtain the speed v of a road section, an ensemble average algorithm is presented as follows: where V 1 , V 2 , V 3 is speed in lane 1, lane 2 and lane 3, respectively; V 1 , V 2 , V 3 is volume in lane 1, lane 2 and lane 3, respectively.The final dataset includes 5 weekdays from June 8 th 2015 to June 12 th 2015.However, in the dataset, 30 records (2.08%) are missing.To minimize the effect of the missing values, they are filled by the average of values measured at the same time of the day from other weekdays.The summarization of the data are shown in Table 1.Full dataset is divided into two parts: a training dataset and a testing dataset.The training dataset, including the first four days, is used to build the :: where y i and y i ˆ are the i-th prediction value and measured value.y ¨ and y ˉ are the averages of prediction values and measured values, respectively.
When the measured value is low, RMSE always becomes sensitive, and therefore R 2 is selected in this study and the fitness is defined as f 2 (x)=1-R 2 for expression xdΩ, to qualify the accuracy of the model.

Pareto front
To find the balance of complexity and fitness, Pareto front is widely used in the multi-objective optimization.In the last stage of the symbolic regression, a large number of models is obtained.The Pareto front considers all models to be equally important and the aim is to identify the non-dominated ones which can be conducted after some definitions.

Definition of Pareto dominance
We define a binary relation of Pareto dominance and write x 1 x 2 saying that the solution (expression) x 1 dominates the solution (expression) x 2 if both following conditions are true:

Definition of Pareto optimality
If no other solution dominates x * , it is said to be Pareto optimal.Notice that it is not necessarily unique.In the mathematical form where Ω is a feasible region.

Definition of Pareto front
For a given multi-objective optimization problem, a Pareto front (frontier, boundary) P is a set consisting of all Pareto optimal vectors: The Pareto front is a set of non-dominated points.In other words, there are no other points better than the points of the Pareto front in both complexity and fitness.In particular, the point in the lower left of the Pareto front (high accuracy and low complexity) is the best solution to the multi-objective optimization problem.An example of the technique is presented in Figure 4.All models are plotted with 1-R 2 as the vertical axis, and the expressional complexity as the horizontal axis.The Pareto front is highlighted in green and the point with a red circle is the solution to the problem.to compare against SR for the following two reasons.First, ARIMA is a statistics-based model and SVR is a computational intelligence-based model.They can represent traffic prediction models from the main perspectives.Second, ARIMA and SVR were proven to be more accurate than other models among the above two types of models in previous studies.The SR model is detailed in section 3 because it is the focus of this paper.The formulations of ARIMA and SVR were not given because they were only used for comparison.We added references in which the readers can find the detail procedures of the ARIMA and SVR models used in this study [17].

Parameter selection
The parameters of ARIMA model are p, d, qp is the order of autoregression; d is the order of differentiation, and q represents the order of the moving average.They can be obtained by the autocorrelation function and partial autocorrelation (PACF) plots.In the SVR, radial basis function is selected as kernel and two parameters γ, C are set with grid search prediction models.The last day presents a testing dataset to evaluate the performance of the models with root mean square error (RMSE) and mean absolute percentage error (MAPE).RMSE and MAPE are accuracy measurements commonly used to evaluate traffic prediction models.They are defined in the following expressions: where v i is the measured value of the i-th time interval and v ˜i is the predicted value of the i-th time interval; N is the total number of the testing intervals.

Experiment design
Input is the data before the time when prediction starts and the length must be known in advance (as is shown in Figure 6).For ARIMA, its strength is to capture the temporal characteristic of the traffic speed, and the length is determined by parameters p and q.For SR and SVR, the input is set by the researchers.The length of input vector significantly affects the prediction accuracy.It is necessary to analyse the influence of the length of the input data.In the first experiment, SR and SVR are built with different vectors having 1, 2…, 10 lagged elements to study the relation between length of input and prediction accuracy.
In the second experiment, three different models are compared.ARIMA model is executed in R with the package 'forecast'.SVR is executed in R with the package 'e1071'.The code for the presented model SR is written in Matlab using the GPTIPS 2 tool box, which is available online [36].These two models are used  However, in some specific points the error of the three models is different.When the speed fluctuates within a small range but dramatically, the performance of the ARIMA appears to be worse shown by the green oval (Figure 8 a).SVR outperforms ARIMA and it seems to capture the sudden change of the speed (Figure 8 b).Obviously, SR cannot only capture the pattern in normal period but also performs well under incident (Figure 8 c).
To evaluate the performance of the models under different traffic conditions, the speed is divided into four regimes: 20-40 km/h, 40-60 km/h, 60-80 km/h and 80-100 km/h.For each regime, VAPE and RMSE are calculated (Table 3).The lowest RMSE and MAPE reflect that SR outperforms the ARIMA and SVR in all regimes, especially in 20-40 km/h and 40-60 km/h.
In Figure 8 SR line of the predicted values follows the measured values while two other models show more discrepancy and higher MAE after an incident occurs.

CONCLUSION
In this study, a simple and interpretable symbolic regression based on genetic programming is proposed to predict the traffic speed.Using real world dataset from Xi-Cheng freeway, Jiangsu, China, the model is compared with the traditional models SVR and ARIMA.The results show SR is the most accurate.It cannot only take advantage of the computational intelligence-based models but also it is interpretable.To evaluate the performance of the models in abnormal condition, the test day including an incident was used.The result shows that the presented model is good in predicting the traffic speed during all traffic states.In summary, the proposed traffic speed prediction model outperforms SVR and ARIMA.
In fact, traffic information retrieved by just one sensor, as considered in this paper, is not enough to accurately represent traffic conditions at an entire freeway segment, unless such segment is small enough to ensure homogenous characteristics of traffic and infrastructure.In the future, we will obtain data from lots of sensors and the spatial information of the traffic speed will be taken into consideration to enhance the accuracy.method.Coefficient C governs the relative importance of the penalty function; γ controls the width of the kernel function.Initially, parameters are set in a large interval and then, with k-fold cross-validation, the interval can be narrowed.In several cases, the best parameters are obtained.The parameters of SR are set considering the requirement of the prediction interval and to capture the non-linear pattern.Parameters of the algorithms are shown in Table 2.

Experiment 1: Analysis of the input length
The results of experiment 1 are presented in Figure 7.When the length of the input vector equals 6, the MAPE of SR is the lowest.The SVR gets its lowest MAPE as the length of the input vector is 5.At the beginning (length of input vector is less than 3), the MAPE of SR is higher than SVR.However, as the length increases after 3, the MAPE of SR becomes lower than the MAPE of SVR and both fluctuate in a narrow range.
From the time-consuming aspect, the training time and the testing time will increase with the increase of the input vector length.In practice, the prediction model of intelligent transportation system has strict requirements for its real-time.Accordingly, to consider the time-accuracy trade-off of the models and make a fair comparison, in the following section two models are trained with the same vector having 3 elements as input.

Figure 6 -
Figure 6 -Input of the models

Figure 7 -
Figure 7 -The influence of the length of input vector

ACKNOWLEDGEMENT 3 TimeFigure 8 -
Figure 8 -Comparison of the three different models

Table 1 -
Summarization of the initial data

Table 2 -
Parameters of the symbolic regression