GPS DATA BASED NON-PARAMETRIC REGRESSION FOR PREDICTING TRAVEL TIMES IN URBAN TRAFFIC NETWORKS

A model for predicting travel times by mining spatiotemporal data acquired from vehicles equipped with Global Positioning System (GPS) receivers in urban traffic networks is presented. The proposed model, which uses k-nearest neighbour (kNN) non-parametric regression, is compared with models that use historical averages and the seasonal autoregressive integrated moving average (ARIMA) model. The main contribution is provision of a methodology for mining GPS data that involves examining areas that cannot be covered with conventional fixed sensors. The work confirms that the method that predicts traffic conditions most accurately on motorways and highways (namely seasonal ARIMA) is not optimal for travel time prediction in the context of GPS data from urban travel networks. In all the examined cases, kNN approach yields a mean absolute percentage error that is twice as good as ARIMA, while in some cases it even yields a mean absolute percentage error that is an order of magnitude better. The merit of the model is demonstrated using GPS data collected by vehicles travelling through the road network of the city of Zagreb. To evaluate the performance, the models mean absolute percentage error, mean error, and root mean square error are calculated. A non-parametric ranked Friedman ANOVA to test groups of three or more models, and the Wilcoxon matched pairs test to test significance between two models are used. The alpha levels are adjusted using the Bonferroni correction. Today’s commercial fastest-route guidance systems can readily incorporate the proposed model. Since the model yields travel times that are dependent on dynamic factors, these commercial systems can be made dynamic. Furthermore, the model can also be used to generate pre-trip information that will help users to save time.


INTRODUCTION
One of the main tasks in today's urban traffic control and route planning systems is to forecast various traffic conditions such as traffic flow, mean speed, and travel time.Travel time prediction has been recognised as one of the most valuable elements, especially for Advanced Traveller Information Systems (ATIS) and Advanced Traffic Management Systems (ATMS) in the context of intelligent transport systems.Since traffic conditions are significantly time-dependent, route guidance systems must be dynamic.For instance, the routes that have higher speed limits may not be the optimal choice during certain times of day such as rush hour.Dynamic guidance systems try to find the fastest route by using algorithms that generate a travel time that changes according to the trip start time.The most commonly used algorithms are modifications of Dijkstra's shortest-path algorithm [1,2].
There are many explanatory models that describe traffic conditions: DynaMIT [3], VISUM-online [4], Schreckenberg's cellular automata model [5], and Kerner's jam front propagation model [6].In addition, there have been many attempts to estimate future conditions using data mining.Some are parametric linear and non-linear regression models [7][8][9][10], nonparametric regression models [11], ARIMA models [12], space-time ARIMA models [13][14][15], ATHENA models [16], Kalman filters [17], artificial neural networks [18][19][20][21][22], and support vector machines [23].Emerging traffic data collection techniques make these extrapolation-based models easier to use.Older techniques, such as roadside sensors, cannot collect sufficient traffic data on spatially complex traffic networks due to coverage limitations.With the rise of GPS technology, vehicles travelling through road networks collect useful traffic data.Data mining can be used to predict future conditions.While significant work has been performed for motorways and highways, only limited work has been attempted for urban networks, where temporal dependence of travel time is more complex.
The main contributions considered are: 1.The travel times are predicted from GPS data.GPS data help applications cover spatially complex networks, which roadside sensors cannot do.This enables the exploration of large urban traffic networks.2. A method for mining GPS data is developed.Since predicting travel times is not the primary motivation behind GPS technology, it is found that GPS data must be preprocessed before any travel time methods can be applied.The proposed preprocessing step involves map matching (that is, linking the GPS records with actual digital maps), temporal outlier detection (filtering records with unusually high travel times due to vehicles making stops), and reduction of travel time variability to ensure more accurate forecasts.The preprocessing step reduces travel time variability by using a nonequidistant aggregation interval approach.3. The non-equidistant aggregation intervals approach is proposed as a novel way of handling the missing values.It enables the usage of GPS data even when coverage is low.This issue is critical when the data are collected from GPS transponders onboard delivery vehicles, as in the studied case.4. Urban traffic networks are investigated.Urban traffic behaves differently from other types of traffic networks: specifically, travel times can show higher variability and chaotic behaviour [24].GPS data enable the investigation of urban traffic networks, but both the GPS data and the nature of urban traffic networks (specifically, their volatility) introduce additional issues.The volatility of urban traffic, and appropriate confidence bounds can significantly impact real-time traffic forecasting [25].
Three fundamentally different methods are used for travel time prediction: the historical averages method, the seasonal Autoregressive Integrated Moving Average (ARIMA) model, and the nonparametric k-nearest neighbour (kNN) model.The historical averages method is used as a baseline method and can be expected to produce the least accurate results.The seasonal ARIMA model is used because it is referred as the most accurate approach.The non-parametric kNN model is used because it is expected to be appropriate for urban traffic networks: that is, it is expected to be able to capture the chaotic behaviour associated with travel times.
The most suitable method for analysing GPS data and urban traffic networks is identified by exploring the case study data.The forecast performance of the models is investigated by using the mean absolute percentage error, mean error, and root mean square error.Statistical significance between the models is determined by the mean rank for groups of more than two models by using the Friedman, and by Wilcoxon matched pairs test for the groups of two models.Additionally, the Bonferroni correction is performed.
All the analysed data came from 297 courier service vehicles travelling during a period of approximately 6 months (from October 2005 to April 2006) on the roads of the city of Zagreb, the capital of Croatia.
The original reasons to collect the data were to track a fleet of courier service vehicles and to construct and update a digital road map.The motivation in this paper was to examine the possibility of using the data for another application (i.e., to predict travel times).Subsequently, the predicted travel time can be used with fastest-route guidance systems, either enroute (during driving) or to confirm pre-trip information.Because of the original motivation for data collection, the sampling was defined spatially and not temporally.Specifically, sampling was not performed at constant time intervals, but rather using constant spatial intervals (100 metres).
Section 2 explains the data used for the analysis and describes the data preprocessing procedures.Section 3 gives theoretical foundations for the selected methods.Section 4 presents the experimental results.

GPS DATA AND SPATIO-TEMPORAL PREPROCESSING
A global positioning system is a positional and navigational system that can be used to determine the location (and speed) of any GPS receiver.GPS data have already been used to estimate traffic congestion [26], to record information about traffic delays and to use this data for traffic monitoring and route planning applications [27].
In the study, GPS receivers produce a tabular log of record time, speed, latitude, longitude, course and GPS status.The record time is the time during which the record is generated.Generally, it is expressed in coordinated universal time UTC (i.e., as the number of seconds from 1. 1.1970).The speed is the speed of the vehicle monitored by the GPS receiver in km/h.The latitude and longitude in the WGS84 geodetic system determine the location of the vehicle.The course is the angle at which the vehicle is travelling with reference to the North.Access to the GPS status, which indicates the data accuracy, is also available.A poor GPS status indicates records of questionable accuracy since they are generated from a small number of satellites or in the context of unsuitable satellite configurations.Each record also includes the identification number of the GPS receiver device.As each car has one receiver, this can also be interpreted as a vehicle identification number.
The devices in the vehicles were programmed to transmit information to the servers periodically.If the vehicle is moving, then the GPS device sends a position fix every 100 metres.If the vehicle is stationary, then the data are sent every five minutes.The initial amount of GPS data included 51,835,560 records.
The first step in data preprocessing is to eliminate records that have low GPS status.The second step is to do a map-match of the positions to link records with the appropriate road segments.

Map matching
Due to the limited accuracy of GPS and constraints on GPS signal reception in the urban environment (for example, multipath signal bouncing), GPS data are normally associated with a measurement error [28].Both surrounding objects (such as buildings) and atmospheric conditions influence this error.
Collected GPS data feature these measurement errors.Certain data points are off-road even though all vehicles travelled on the roads at all times.To accurately determine vehicle location, these data must be preprocessed to match the trajectory of the vehicle movement to the link that the vehicle travelled through.This technique is known as map matching [28].
The used digital road network is represented in the database by vectors.Onboard systems use information about road networks to map current vehicle positions onto appropriate road segments.These systems represent the vehicle trajectory as a sequence of historical positions.For real-time applications, the task of map matching can be quite time-consuming.Accordingly, in a trade-off between speed and accuracy, entire trajectories are not used, but instead only the most recent positions are used.In addition, if the onboard system is navigational as well as positional, the destination can be known in advance.This information can be used to ensure effective mapping.
Many map matching algorithms are currently in use (for more information, see [28]).The authors did not develop a map-matching algorithm, since one was already available with the data.In the experiments, a map-matching algorithm that was developed by the Mireo Company [29] was used.Originally it was used during the creation of digital maps from GPS data, yielding maps with ±5 metre accuracy in 95% of cases [29].
The most important step in preprocessing stage is to identify the outliers.Outliers are observations that are numerically distant from the rest of the samples.In a sense, map matching can be said to be a process of identifying spatial outliers and correcting their values.After map matching has been done, temporal outliers must be removed.These outliers are sample travel time values that should not be used in the process of modelling.

Temporal outlier detection
Data used for the analysis were acquired by courier service vehicles that make frequent stops.For that reason, some of the sample travel times have disproportionately higher values than other samples obtained for the same link.
The values that do not follow the characteristic distribution of the data are referred to as outliers.Outliers are not necessarily error values: they can indicate unusual behaviour within the underlying process and highlight anomalies.Identifying outliers is one of the main challenges associated with data mining.In a modelling process, outliers can negatively impact the accuracy of the final model.Specifically, in a regression analysis, where the sums of the squares of the distances are minimised to form a model, outliers can significantly influence the regression line.Because of this, outliers must be detected before developing a model for travel time prediction.
One of the most widely used methods to detect outliers is a Box Plot technique, which has already been applied in travel time prediction [30][31].Bajwa et al. used the technique on highway data (Tokyo Metropoli-tan Expressway) to reduce variability and achieve higher estimation accuracy.They used the 25 th percentile as the lower quartile and the 75 th percentile as the upper quartile, and the interquartile range to model lower and upper boundaries.They used 1.5 times the interquartile range to define lower and upper boundaries (that is, they used inner fences).For the experiments described in this paper, to make sure that the modelling stage receives more data, outer fences (that is, three times the interquartile range) are used.The boundaries are defined as: lower boundary = = lower quartile -3 (upper quartile -lower quartile), and upper boundary = = upper quartile + 3 (upper quartile -lower quartile), where lower quartile is the 25 th percentile and upper quartile is the 75 th percentile.
Every sample time below the lower boundary and above the upper boundary is marked as an outlier and is excluded from our modelling of travel time.

Reduction of travel time variability
There are two types of temporal travel time variability: short-and long-term variability.Short-term variability of vehicle travel time is the result of traffic signal phases in urban networks.Long-term variability is the result of evolving traffic patterns during the day (i.e., congestion).While preserving long-term variability is crucial, a reduction of short-term travel time variability plays a key role in our ability to accurately estimate travel time.
Torday and Dumont [32] have used microsimulations and the floating car data technique to show how to reduce short-term variability in urban networks using appropriate sub-link definitions.However, this approach is not used, because the amount of data does not allow it.Using simulations, they have also shown that minimising the aggregation interval reduces variability [32].They suggested that the aggregation period should be a multiple of the duration of the traffic lights signal switch periods.
As a result of the small sample size, it is not possible to model all the effects that may be present in analysed travel time data.Therefore, the main motivation is to bind the long term variability caused by congestion.Luckily, if there are two intervals with the same duration but at different times of day, the one compiled during congestion will feature more samples.The other may not even have a single sample (e.g., on Sundays or during the night, when traffic moves more fluidly).
This reality inspired the use of a non-equidistant aggregation intervals approach.Experiments with different durations and placements of time intervals are performed, and finally, the following settings are selected.There are several intuitive reasons for such day-part divisions.Intuitively, during the night (i.e., from 20:00 to 06:00) there is no congestion and all the samples can be aggregated into a single value; from 06:00 to 10:00 when congestion is severe, aggregation intervals are set to 15 minutes; between 10:00 and 15:00, to take into account medium term variability during the day, aggregation intervals of 1 hour are used; from 15:00 to 18:00 aggregation periods are again 15 minutes; and finally from 18:00 to 20:00 aggregation is performed in 1-hour intervals.Since there are very few data for Saturdays and Sundays (the courier service makes few weekend deliveries), a 24-hour aggregation interval is used.In cases when there was an aggregation interval that contained zero samples, the travel time duration was set to equal the median travel time of the corresponding link.
Since the investigated regression methods require time series with a fixed time step, all the aggregation intervals are divided into equally sized segments, i.e., the size of the shortest time interval (15 minutes).For instance, if the duration of a certain aggregation interval is one hour, it is broken into four 15-minute intervals for which the time series value is the same as for the original aggregation interval.

TRAVEL TIME PREDICTION METHODS
Although there are a number of methods to predict traffic conditions, most of the work has been performed on data collected using roadside sensors [11,30,31].Additionally, most such research is concerned with motorway traffic and not with urban traffic conditions.Given the nature of urban networks and the reality that all of the analysed travel time data were collected by vehicles equipped with GPS receivers, three methods to predict travel times are selected: the historical averages method, the seasonal ARIMA and the non-parametric kNN model.
The historical averages method is used as a baseline method.It is used because of the issues with sample size.It is challenging to acquire large sample sizes for every aggregation period, for each past date and for every link in a large-scale urban network.
Since the literature states that the seasonal ARI-MA model outperforms neural networks and kNN non-parametric regression [11,12], it is included among the implemented models.On the other hand, it is questionable whether the seasonal ARIMA model would give satisfactory results, given that it models processes that are non-deterministic with linear state transitions.Disbro and Frame [24] showed that traffic flow behaves chaotically, especially in cases frequently found in urban traffic networks (i.e., during congestion periods).Given that chaotic systems are described by processes that are deterministic and feature non-linear state transitions, it motivated the use of the nonparametric kNN model.

Historical averages method
The historical averages method is a very simple model in which every weekday is one case that can be described by a series of aggregated values.Figure 1 shows a graph in which every weekday is described by an appropriate time series.
Time series are stored as records in the database.Each record has a weekday attribute defining the day of the week, a timestamp defining 15-minute periods during the day, and a duration giving the average travel time for a given link.The user supplies the values for the required link identifier, weekday and time of day.
The returned value denotes the predicted link travel time.

Seasonal ARIMA model
The seasonal ARIMA model was proposed by Box and Jenkins [33,34] for analysing time series Xt " ,.In order to define a seasonal ARIMA process formally, the backshift operator B with order of differencing j is used to transform the time series: B X X j t t j = -.The seasonal differencing with seasonal period s is given as: The seasonal differencing with seasonal period s and order of seasonal differencing D is written as B X 1 In terms of the backshift operator, the non-seasonal differencing is defined in a similar manner as , and the non-seasonal differencing of order d as (2) The backshift operator B is given by Eq. ( 1).The functions {, U, i, and H are polynomials defined as respectively.The coefficients {i, Ui, ii, and Hi are unknowns and have to be found.The time series et " ,corresponds to errors, known as white noise that can be found from standard ARIMA(p,q) model X e X e t t i t i i p i t i i q . Additionally, p is the non-seasonal autoregressive order, q is the non-seasonal moving average order, P is the seasonal autoregressive order and Q is the seasonal moving average order.Similarly, d is the order of non-seasonal differencing and D is the order of seasonal differencing.
Background on the theoretical seasonal ARIMA process and its usage in forecasting traffic conditions is also given by Williams and Hoel [35] and by Smith et al. [11].

kNN non-parametric regression
In the previous chapter, a linear parametric model (ARIMA) was introduced.There, the main idea was to form a model that could satisfactorily approximate the entire instance space.On the other hand, in instancebased learning, as represented by the kNN model, only a local approximation of the target function that applies in the neighbourhood of the new forecast instance needs to be constructed [36].For that reason, there are no restrictions on the data being modelled (specifically, there is no requirement for stationarity, unlike in the ARIMA model).The model consists of past (historical) values that are stored and subsequently used to determine the values for new instances.
An arbitrary instance x is represented by attributes (or features) denoted as ai, i = 1, 2, ..., n, and its feature vector or state space is , , , , a a a an 1 2 3 f 6 @.The observed instance can then be viewed as a point in an n-dimensional space represented by the values of the attributes ( ) Each training sample has a known target function value f yi ^h and can be written as: and h is a simple average function then the regression problem forecasting is given by Eq. ( 6). .In this case, the regression problem forecasting is given by Eq. (7).
The distance between the test point and the training sample yi, @ is determined by the standard Euclidean distance Distance metrics can also be weighted in such a way that some features contribute more or less to the overall distance.There is an infinite number of distance metrics and the standard Euclidean distance is chosen as the measure of the distance between the instances for the purpose of forecasting travel time.
Different state vectors can be used for the kNN regression.More precisely, there is an infinite number of possible state vectors.The most reasonable features to be used are present and time-lagged values of the time series , , ( ), , where d is the selected lag.However, in forecasting traffic flow, Smith et al. [11] have shown that using past average values yields more accurate forecasts.They used a hybrid model , , , , If their traffic flow is considered in the context of travel time, then V t ^h is the travel time at the present interval and V t hist ^h and V t 1 hist + ^h are the historical average travel times for the weekday and the time of day associated with time t.There is a sound justification for the use of past average values.The attractor of the chaotic system is the value to which the system settles when time approaches infinity.This occurs as the kNN approach tries to rebuild the attractor of the process that generates the time series [37] and the average of past values puts each instance on the cyclic pattern of the attractor.Various state spaces are investigated, and Section 4 shows the results.The mean absolute percentage error (MAPE -for the formal definition see Section 4) is used to determine which state space should be used and to determine the state space that produces the lowest MAPE for forecasting purposes.
The required number of neighbours k must be determined experimentally.This is done by determining the MAPE for models with different numbers of neighbours and selecting the one with the lowest value for forecasts.A small number of neighbours could have too much variance and could result in loss of generality, while too large number of neighbours could introduce too much bias into the forecast [38].
In the context of regression analyses, there is an infinite number of possible forecast estimations.The most common ones are straight averages (Eq.( 8)) and averages that are weighted by the inverse of the distance (Eq.( 9)).Other forecasts include heuristics to assure more accurate estimates.While forecasting traffic flow, Smith et al. [11] obtained the best kNN forecasts with the hybrid approach, which adjusts by both V t hist ^h and V t 1 hist + ^h, and weights by the inverse of distances (Eq.( 10)).Again, to find the most accurate forecast estimation, MAPE is used.
In Eqs. ( 8) -( 10) k is the number of nearest neighbours, V t 1 + t ^h is the forecast time series value at time 1 + (corresponding to the forecast value introduced in Eq. ( 5 To develop the model and to test its performance, 20 random links out of the 100 links with the greatest number of matched records are selected.Table 1 gives descriptive statistics for the links used to build the models.Furthermore, Section 4.6 presents results for four additional links used to illustrate the evaluation process for the model.

Forecast performance measures
The measures used for the model's forecast performance are: mean absolute percentage error-MAPE and root mean squared error-RMSE where n is the number of samples, Ai is the known (observed) value of the i-th sample, and Fi is the forecast value of the i-th sample.MAPE is used to estimate the size of the forecasting error, ME is used to determine whether the forecasts are biased, and RMSE is used to determine whether the error distribution features outliers.Although MAPE gives guidance as to which method might be better, it does not offer any statistical confidence.For that, non-parametric Friedman ranked ANOVA [34] tests whether there is a significant difference in absolute percentage errors between the methods.For every forecast point and for every method, the absolute percentage error is calculated.The H0 hypothesis is that the medians of the errors for all the methods are equal.If the α -value is small enough (for all cases <0.05), then there is evidence that the H0 hypothesis can be rejected.Similarly, to test the difference between two methods, the Wilcoxon matched pairs test is performed on the absolute percentage errors.Additionally, Bonferroni correction adjusts the alpha values.

Historical averages results
Table 2 gives the results from the historical averages model.For all 20 random links, the mean MAPE equals 0.1738, which is relatively good, but the maximum MAPE of 0.3409 suggests that for certain links, this method performs unsatisfactorily.Maximum values of ME (9.0364 s) and RMSE (26.3311 s), show that for some links this method is both biased and sensitive to extreme values.Overall, the historical averages model results show that there are some effects that cannot be modelled as time-of-day and day-of-theweek dependencies.

Seasonal ARIMA results
Using Box and Jenkins procedure [33,34], travel time is determined to be an , , , , 1 0 1 0 1 1 ARIMA 672 ^ĥ h process that matches the results obtained by Smith et al. [11] for traffic flow forecasting.A seasonal lag of 672 corresponds to one week, because one week encompasses 672 15-minute intervals.From the investigations of all 20 random links, travel time is described by the same , , , , 1 0 1 0 1 1 ARIMA 672 ^ĥ h model.Table 3 lists the results.The minimum MAPE of 0.0096 suggests that some links can be modelled quite accurately by seasonal ARIMA.The maximum MAPE of 0.3362, however, suggests that for some links, seasonal ARIMA may not be the most suitable model.The mean value (0.1315 s) and standard deviation (2.9892 s) of ME suggest that forecasts for 20 random links are, in general, not strongly biased.

kNN non-parametric regression results
Experiments with a range of lagged values in state spaces and with different numbers of neighbours are performed.The simulations include lag values from 0 to 10 and from 1 to 30 nearest neighbours.Additionally, straight averages, weightings by the inverse of distance, and a hybrid state space are also used.In total, for all 20 random links and all possible kNN parameters, 11×30×3×20 = 19800 executions are performed each for two weeks of 15 minute data (i.e., 1344 forecasting points).It is found that kNN with a hybrid state space yields smaller MAPE values.
Figure 2 shows the dependence of MAPE, ME and RMSE on the number of lagged values in the state space and the number of neighbours when a hybrid state space is used.It can be seen that, generally, a high number of lagged values results in higher MAPE and ME in a manner that is independent of the number of neighbours, while generally, a low number of neighbours results in higher RMSE values.
The kNN model with the smallest MAPE values is proposed as the preferred model.Specifically, kNN with a hybrid state space, with one lagged value (lag=1), and 26 neighbours (k=26) is proposed.Table 4 gives the results for all 20 random links, obtained using the proposed kNN model.The mean (0.0218), maximum (0.0423), and minimum (0.0018) values of MAPE show that the proposed kNN performs very well for all 20 random links.Moreover, low values of ME and RMSE indicate that kNN forecasts are neither biased nor sensitive to extreme values.

Forecasting performance of the models
For all 20 random links, the results obtained across all models are compared.For all the links, with respect to ranked Friedman ANOVA, the null hypothesis (that the medians of the errors for all the models are equal) is rejected.Additionally, the Wilcoxon matched pairs test is performed.This result is used to determine inter-group differences in means.The Bonferroni correction of α-value is performed and this results in an α-value of 0.016666667.For 5 links out of 20, the Wilcoxon matched pairs test null hypothesis at both the original and the Bonferroni-corrected α significance level cannot be rejected.For all five of these links, the null hypothesis for historical averages and the seasonal ARIMA model cannot be rejected.For historical averages and the proposed kNN model, as well as for the seasonal ARIMA and proposed kNN models for all 20 random links, the hypothesis at the Bonferroni corrected α significance level can be rejected.
Figure 3 (a) shows the MAPE for 20 random links, as well as the mean for all the links obtained with historical averages, seasonal ARIMA and the proposed kNN model.It can be seen that the proposed kNN yields a lower MAPE for all the links.The maximum MAPE is 0.04229.In most cases, ARIMA yields lower values than historical averages do, and this approach reaches a maximum value of 0.3362 while the historical average model reaches a maximum value of 0.3409 for MAPE. Figure 3 (b) gives the calculated mean Friedman rank for 20 random links, and the mean rank for performance on all links, with respect to historical averages, the seasonal ARIMA and the proposed kNN model.For the links where the Wilcoxon matched pairs test null hypothesis cannot be rejected, the obtained α-values are shown.For all the examined cases, the proposed kNN yields the lowest mean rank.Additionally, when compared to the other two methods using the Wilcoxon matched pairs test, the null hypothesis can be rejected.Figure 4 shows the ME and the RMSE for examined links obtained with historical averages, seasonal ARIMA and the proposed kNN model.It can be seen that the proposed kNN in some cases yields a higher ME than both historical averages and the seasonal ARIMA model.However, the maximum ME values for both historical averages and the seasonal ARIMA model are higher than the maximum ME for the proposed kNN.Overall, for the proposed kNN, the ME is positive for all the examined cases, but its absolute value is never more than three seconds.In addition, in all of these cases, the proposed kNN model yields lower RMSE values.

Evaluation of the model on selected cases
To evaluate the proposed model, four selected links are used.They are shown in Table 5 lists the properties of the selected links, while Table 6 shows the results.The results are presented for historical averages, seasonal ARIMA, bestperforming kNN and the proposed kNN model.Again, to find the best-performing kNN model for a given link, from 0 to 10 lagged values, from 1 to 30 neighbours, and the straight average, weighted by inverse of distance, and a hybrid state space are used.The main purpose of this experiment is to determine how similarly the proposed kNN performs to the optimal kNN for a given link.For all four selected links, the proposed kNN performs better than both historical averages and the seasonal ARIMA model with respect to the mean rank according to Friedman, MAPE, and RMSE.For link 2619, the proposed kNN results in an ME higher than the one for ARIMA, but the acquired MAPE is almost seven times lower.In addition, a substantially lower RMSE can be observed.
When the proposed kNN is evaluated against the best-performing kNN, it can be seen that the differences for all four links with respect to mean rank, MAPE, ME, and RMSE are relatively small.The greatest difference is for link 2775, where the proposed kNN gives a 3.4 % greater MAPE than the best-performing kNN.However, it is still 4 % lower than the MAPE associated with the seasonal ARIMA model.

CONCLUSION
One of the main tasks of this paper is to define a model that can predict spatio-temporally dependent travel times for urban road networks from GPS data used for automatic digital road map creation.The majority of the work presented in the literature on travel time prediction has been performed on data collected using roadside sensors and other techniques.Additionally, the data collected to date have focused on motorways.In this paper, a model based on GPS data collected for urban road networks is presented.In this framework, methods for preparing GPS data for modelling, map matching, outlier detection and reducing travel time variability are demonstrated.The non-equidistant aggregation intervals approach is implemented to handle insufficient GPS data coverage.
Three different travel time prediction methods are investigated and implemented.The most basic method, the historical averages method is used only for reference, and, as expected, it produced very poor results.The seasonal ARIMA model and the kNN models are the other two methods that are explored.Seasonal ARIMA is used because the available literature commonly presents it as the most suitable approach for the prediction of traffic conditions on motorways and dual carriageways.Since there are some effects that are typical for urban networks, seasonal ARIMA was expected not be the most suitable method for this type of data.Surprisingly, in all the examined cases, kNN proved to be the most accurate method.
All the analysed links are part of the urban traffic network of Zagreb.The proposed kNN model is determined by analysing 20 random links out of the 100 links that featured the greatest data coverage.Then the proposed model is evaluated on four selected links.Specific links are chosen to illustrate different construction and congestion issues.The analysed data are collected for a period greater than six months.The proposed model can also be applied to predict travel times in other cities.The size of the city is not an issue: for large urban environments, a grid computer could be used to ensure fast performance.The only limit of the model is the coverage of the GPS data.This is not, however, an issue for most developed urban environments, where it is often necessary to use a GPS.
For the links presented in this paper, the forecasting mean absolute percentage error for the baseline The experiments provide justification for the use of the kNN method in travel time prediction.To the best of the authors' knowledge, no other published research has shown that the kNN approach can perform better than seasonal ARIMA.There are two reasons for this.Firstly, in this study, GPS data are used for travel time prediction, and secondly, the data are for urban traffic networks.Since seasonal ARIMA and kNN non-parametric regression are usually used to model different systems (non-deterministic with linear state transitions as opposed to deterministic with non-linear state transitions), this contribution may suggest that traffic in urban networks behaves chaotically.
Because of the lack of coverage and the way in which the GPS data are sampled in the study, the authors were unable to apply certain very interesting methods.One such method is space-time ARIMA.Future work should attempt to determine whether STARI-MA would be the most appropriate method, since it can model the influences that neighbouring links exert on each other.Such broader perspectives, enabled by additional GPS data, may also include examining the performance of the proposed model when an entire route map is analysed.

Figure 1 -
Figure 1 -Weekday profiles of an exemplar link

,
where n is the dimensionality of the state vector, and a x r ^h and a y r i ^h are features of the test point and the training sample, respectively.

Figure 2 -
Figure 2 -Dependence of MAPE (a), ME (b), and RMSE (c) on the number of lagged values (lag)and the number of nearest neighbours (k) for kNN in the context of a hybrid state space.The 3D surface plot is generated with the use of a distance-weighted least square fit.

Figure 3 -
Figure 3 -Comparison of MAPE values (a) and mean Friedman rank (b) as generated using historical averages, seasonal ARIMA, and the proposed kNN model

Figure 5 .
These links are chosen because they are elements of roads in different parts of the city, each with different characteristics.Link 4562 (a) is a section of a bridge.It has only one input and one output connecting link.Link 2619 (b) has more than one dominant output link, so it represents the opposite situation.Other links, link 2775 (c) and 947 (d), are somewhere between those two extreme cases.Link 4562 is the only link, out of the selected 4, that is one of the aforementioned 20 random links.

Figure 4 -Figure 5 -
Figure 4 -Comparison of ME (a) and RMSE (b) as generated using historical averages, seasonal ARIMA, and the proposed kNN model

)
The k-nearest neighbour forecasting can then be defined as follows.Given the test point x and N train- 6) )), V t 1 + ^h is the historic average value for the i-th nearest neighbour time series value aggregated by weekday and time of the day with respect to t 1 + .April 2006 has been studied.The data are divided into two groups.The first group contains data from 1 October 2005 to 7 April 2006, and is used to develop the model.The other group contains data from 7 April 2006 to 21 April 2006, and is used to evaluate the model.It should be noted that the validation group contains only two weeks' worth of data.Two weeks are chosen because this time period corresponds to two seasonal lags in the obtained ARIMA model.Non-equidistant time intervals are used to average the travel time.The final time series resolution is 15 minutes.

Table 1 -
Descriptive statistics for the links used to build the models.

Table 2 -
MAPE, ME and RMSE calculated by historical averages.

Table 3 -
MAPE, ME and RMSE obtained for the XXX_FORMULA_XXX model.

Table 4 -
MAPE, ME and RMSE for our proposed kNN model.

Table 6
gives the mean rank in the context of Friedman, MAPE, ME and RMSE values.In all cases, the null hypothesis with respect to both the ranked Friedman ANOVA and the Wilcoxon matched pairs test at the Bonferroni corrected α significance level is rejected.

Table 5 -
Properties of the links used to evaluate the models

Table 6 -
Mean Friedman rank and MAPE, ME and RMSE for selected links as given by historical averages, seasonal ARIMA, the proposed kNN and the best performing kNN (historical averages) ranges from 7.27% to 39.41%, for the seasonal ARIMA model from 0.96% to 33.62%, and for the proposed kNN from 0.18% to 5.20%.Additionally, the mean error and root mean square error for forecasts show that the historical averages model gives the least accurate forecasts, the proposed kNN model gives the most accurate forecasts, and seasonal ARIMA gives forecasts with intermediate accuracy. method