PASSENGER FLOWS ESTIMATION OF LIGHT RAIL TRANSIT ( LRT ) SYSTEM IN IZMIR , TURKEY USING MULTIPLE REGRESSION AND ANN METHODS

Passenger flow estimation of transit systems is essential for new decisions about additional facilities and feeder lines. For increasing the efficiency of an existing transit line, stations which are insufficient for trip production and attraction should be examined first. Such investigation supports decisions for feeder line projects which may seem necessary or futile according to the findings. In this study, passenger flow of a light rail transit (LRT) system in Izmir, Turkey is estimated by using multiple regression and feedforward back-propagation type of artificial neural networks (ANN). The number of alighting passengers at each station is estimated as a function of boarding passengers from other stations. It is found that ANN approach produced significantly better estimations specifically for the low passenger attractive stations. In addition, ANN is found to be more capable for the determination of trip-attractive parts of LRT lines.


INTRODUCTION
Passenger flow modelling of Light Rail Transit (LRT) Systems is a rarely studied area of public transportation.On the other hand, such modelling is essential especially for the developing countries like Turkey where new rail transit projects are still under construction and are constructed gradually necessitating a long period for completion.The completed and opened-to-service part of these projects may guide to the next steps for determining the location of new stations and mode integration with existing public transit systems.It is also important to ensure that any infrastructure investment will have beneficial effects on the overall transport system and those affected by it [1].Hence, in the decisionmaking process, the modelling approach will be very beneficial.
Izmir is the third biggest metropolitan city of Turkey with over 3 million of population.The city has a new transportation master plan proposing many supplementary transit lines.Therefore, the locations of new transit stations and transfer points have to be examined by using the statistics of existing systems.Izmir LRT is one of the newly constructed modern rail transit systems located at the south-east of the city centre with an approximately linear track between the west and the east.The current LRT system is a small range transit application having 11.6km of total line length, 10 stations and a feasibility capacity of 11,000 passengers per hour per direction.Although the daily demand of Izmir LRT is about 100,000 passengers, it is expected to show a considerable increase when the other supplementary transit lines are opened to service in a few years.General map of the current system in service is given in Figure 1.The dashed lines in the figure show the intersecting rail transit project which is being constructed for the North-South connection for the public transportation of Izmir.The LRT line also has extension projects connected by Bornova, Üçyol, and Halkapinar stations.
The operational plan of Izmir LRT system is based on some statistical data of passenger flows.The statistical data are obtained from the prepaid ticket machines at the stations including time, usage and station locations.By using these data, passenger numbers getting into every track according to time and the day of week are obtained for a minimum time period of 5 minutes.The assigned time intervals between tracks are controlled whether total boarding passengers of each station exceed the physical capacity of the tracks or not.However, there is an important detail which cannot be neglected in this operation planning.There is no information available on how many of the passengers come to the stations and to which direction on the line they go.This is because all of the passengers who enter use the same prepaid ticket machines regardless of trip directions.The number of passengers getting off at any station is also neglected in this application.Therefore, the used statistics may not reflect the real situation and therefore the operational plans are captive of the trial and error approximation.This is another necessity for the passenger flow modelling of Izmir LRT system which can make the operational plan more reasonable by predicting the amount of alighting passengers at each station.
There are some studies in literature about passenger flows in the public transportation facilities.However, rather than the flow prediction, these studies generally focused on passenger flow management of busy stations or station stop time and departure time optimization [2,3,4].Lee et al. studied the modelling of the flow weight distribution and found a power-law behaviour for Seoul subway system [5].There are some other studies involving the application of ANNs for predicting daily trend of total public transit flows that provide practical benefits for the operational planning and decision support [6,7].
ANN is one of the recently explored technologies, which show promise in the area of transportation engineering.Neural networks have the ability to learn from their environment and to adapt to it in an interactive manner similar to their biological counterparts.This is an exciting prospect because of the vast possibilities that exist for performing certain functions with ANN [8].Therefore, the use of ANNs in passenger flow prediction may reduce the dependency on probabilistic approaches used for the flow weight distributions and increase the significance of the past flow statistics on planning practice.
In this study, the Izmir LRT trip flow predictions by the regression and ANN models are explored.The estimation performance of trip-productive stations which has great importance in decision-making for feeder line projects has been also investigated.

DATA
In the study the daily total numbers of boarding and alighting passengers have been used for the models of each station.The total number of alighting passengers at each station is estimated by using the total number of boarding passengers of other stations.The data belong to nine consecutive months from October 2007 to June 2008 and a total of 240 days are included in which 20 items of record (10 boarding and 10 alighting sum) are arranged.The data set includes also weekend days in which travel demand changes dramatically.On the other hand, the data of some specific days like national holidays in the given interval have been eliminated for preventing the inclusion of extreme observations which may decrease the estimation performance.Thus, a reasonable heterogeneity of data is obtained which can make the distinction clear between the estimation capabilities of regression and ANN approaches.
Although peak and off-peak hour distinction for trip flow estimation is necessary for a more rational analysis, it is not possible in practice because the alighting passenger numbers are recorded from mechanical counters at the end of the day for each exit of all stations.Since the exit gates cannot be monitored, a detailed record can be easily obtained from entry gates through the electronic ticket machines.This can lead to reduced prediction capability since boarding and alighting activity observation necessitate short time lags between the two activities.The dynamic passenger flow analysis cannot be realized by this kind of data.On the other hand, this study which aims at predicting the source stations of general outputs of each station for a general trip flow analysis can also show the prediction capability of the developed models with limited conditions.
The histograms and closeness to normal distribution of the data set can be seen in Figure 2. The abbreviation "B" means boarding data and "A" corresponds to alighting data.As it can be seen, the data of Basmane, Halkapinar and Stadyum stations are the farthest from normal distribution and the values seem

REGRESSION MODELS
For a dynamic passenger flow model, one can easily think that the passengers boarding from a specific station split to groups going to other stations and there is a simple linear relationship between the total boarding number and the divided alighting number.Therefore, the regression analysis may seem the only sound method for demonstrating this linear relationship.However, for the common case in which the dynamic boarding and alighting data are not available, the splitting percentages of each boarding station may not be obtained easily.
The regression analysis utilized for the passenger flow is based on the linear estimate of alighting passenger numbers of a specific station depending on the number of boarding passengers of other stations: where "n" is the number of stations, "Y, i a " the number of alighting passengers at the station, " X , i b " the number of boarding passengers at other stations, " 0 b " the constant term, " i b " the coefficients of explanatory variables and " , i a f " the regression residual.The ordinary least square method is applied by varying data type, the number of explanatory boarding stations and the consideration of constant term.Eight different regression types are obtained.Since the ranges of the passenger numbers using each station are considerably different, data standardization can increase the efficiency of the regression.Therefore, beside the regressions with raw data, the standardized data are also applied for the purpose of comparison.Eq.2 is utilized for the standardization which is also being used for ANN applications.
. (2) where xmax i ^h and xmin i ^h are the maximum and minimum values of "i"th station."s" and "n" indices indicate the standardized and natural cases respectively.The used standardization method compresses the data into [-0.9,0.9] range which intends to prevent upper and lower limit saturation problem in ANN analysis.The same standardization method with ANN approach is preferred to get homogeneity in performance comparison.
The regression models are also utilized for both cases of eliminated and non-eliminated boarding sta-  The regressions are also diversified by the inclusion and exclusion of the constant term which may be significant in the case of ineffective boarding stations.
The estimation performances of the regression models are compared by using Root Mean Square Errors (RMSE) (Eq.3) and Efficiency Factor (EF) (Eq.4).RMSE is a frequently used measure of the differences between the predicted values and the actual (observed) values and serves to aggregate the residuals into a single measure of predictive power.
where "Yi obs " and "Y i pre " are the observed and predicted values of "i th " alighting passenger observation and "Y i " is the mean of alighting passengers for each model.EF accounts for model errors in estimating the mean of the observed data set which ranges from minus infinity to 1.0." 1 EF = " corresponds to a perfect match of modelled alighting passenger numbers to the observed data." 0 EF = " indicates that the model predictions are as accurate as the mean of the observed data and an efficiency less than zero ( 0 EF 3 1 1 -) shows worse prediction than the mean.
For a more accurate comparison, RMSE is given as the ratio of the observed mean for obtaining an impartial comparison (Table 1).Coefficients of determination values (R) of the regressions are not provided because they may be elusory for the regressions without constant term.
When the table is investigated in a station base, the six stations (Ucyol, Konak, Cankaya, Halkapinar, Stadyum and Bornova) seem to indicate successful estimations with EF values close to "1".The performances of different regression types are also similar for these stations.However, this cannot be said for the regressions of other four stations (Basmane, Hilal, Sanayi and Bolge).The identical property of these stations is their considerably low number of passengers for both boarding and alighting flows.Besides, Basmane station is close to the biggest social and commercial fair area of Izmir and this causes high fluctuation in trip demand depending on the time and size of the activity at the fair.
In general, there is no remarkable difference between the given performance statistics of the regression types.However, it can be said that the exclusion of the constant term decreases the estimation capability especially for the mentioned four stations.The elimination of the boarding stations also has a minor decreasing effect on the performance.It means that the flow effective stations can be easily distinguished.The statistics of the post regressions between observed and predicted values of the regression models are presented in Table 2.
As it was known, the squared R and the slope of the post regression ( 1b ) should be close to "1" and the constant term ( 0 b ) should be close to "0" for a sound model.For these criteria the regression models of Halkapinar station give the highest reliable estimations.
Halkapinar is located at the middle of the LRT line and it is the centre of inter-modal public transit.Therefore, this considerable success at Halkapinar station is very important to make inferences about the efficiency of transfer points.For the models constructed with standardized data, the exclusion of the constant term indicates smaller decrease in the performance compared with the models of the raw data.The four stations un-   Consequently, it can be said that the regression models give high estimation capability for the sections of LRT line where the trip demand demonstrates stable and relatively higher trend compared to its average.However, as presented above, the regression models perform poorly for the stations where there are fluctuations in trip demand and therefore a more reliable modelling approach may be required.

ANN MODELS
Neuro-computing is concerned with processing information which first involves a learning process within an artificial neural network architecture that adaptively responds to inputs according to a learning rule.After the neural network has learned what it needs to know, the trained network can be used to perform certain tasks depending on the particular applications [8].
ANN can have one or more layers consisting of many neural cells which are connected by the con- nection links having a certain direction determined by the network architecture.Each connection link has an associated weight that represents its connection strength and each neuron typically applies a nonlinear transformation, called an activation function, to its net input to determine its output signal.The network is trained by using an expected output in a manner that the weights of connection links are updated according to the selected learning method in a typical iteration step called epoch [9].
The neural networks as global approximation tool have been widely used due to the ability to process and map external data and information based on past experience to generate successful forecasts.One of the developing application areas of ANN is transportation engineering.Murat and Ceylan investigated the applicability of ANN models in forecasting of transport energy demand and found consistent results [10].Zhang et al. used ANN for the reconstruction of vehicle crash accidents and they claim that the pre-impact velocity of vehicles without tyre marks could be predicted by ANN model [11].Murat used ANN to estimate vehicle delay for non-uniform and over-saturated conditions [12].
In this study "Feed Forward" perceptron with "Back Propagation" training algorithm (FFBP) type of ANN, which is the most widely used type and a remarkable alternative for the regression approach, is chosen for the application.In Feed Forward ANN, the nodes are arranged in layers and they are connected to those in the next layer; however, not to those in the same layer.The information flows only in the forward direction, from the input layer to the hidden and output layers.
The Back Propagation is a supervised training algorithm in which an input-output training set is used and it consists of mainly two activities: forward pass and backward pass.In forward pass, the training pairs from the input data sets are selected and fed into the input neurons and the activity is propagated from input layer to hidden and then output layers.In backward pass, the propagation occurs in a reverse direction and the errors are computed for each output unit.Layer by layer, the error for each hidden unit is computed by the propagating errors.The weights are updated by the generalized delta rule which is based on the steepest gradient descent with the direction vector being set to negative of the gradient vector.Consequently, the solution often follows a zigzag path while trying to reach a minimum error position.Therefore, it is sometimes possible to get trapped by a local minimum."Gradient descent with momentum" technique is a successive way to avoid this problem in which the weights of the next epoch are determined by including the effect of the weight difference between the past two epochs [13]: (5) where, " ij n D ^h " are the present iteration differences of the weights, " ij n 1 D -^h " the past iteration differences of the weights, "h" the learning rate, "E" the error function depending on the weights and "a" the momentum factor.
It is known that the extrapolation capability of ANN is relatively weak if compared with interpolation [14].Therefore, an attempt is made to distribute the minimum and maximum values comparable for the training and testing parts of the data set.For this purpose, a rank number is attained for each day of 20 data columns in such a manner that the maximum value of the column has the biggest rank.Then, the numbers of ranks are summed up for each row (day of record) and data is sorted according to the summation.The sorted data are distributed to training and testing set one by one for each row and consequently 120 training and 120 testing pairs are obtained.
The data is standardized by using Eq.1 which is compatible with the chosen tangent hyperbolic activation function.This method compresses the data set to the range of -0.9 and 0.9 instead of -1 and 1 which prevents the upper and lower limit saturation.The saturation problem may cause insufficient learning because the activation functions give the values cumulated around "0" and "1" especially for the data having repeated patterns at minimum and maximum limits [15].The independent variables which are boarding passengers of nine stations are standardized by using the maximum and minimum values of the whole data set.However, for back transposition of the dependent variable which is the number of alighting passengers from the model station, the maximum and minimum values of training data set are used.In this way, the output of test data is treated as unobserved.
In this study, two-hidden-layered network architecture is employed.The number of neurons in the first hidden layer was obtained by the trail and error procedure while 5 neurons were fixed in the second hidden layer.Consecutively 5, 10, 15, 20, 25 and 30 neurons are tried for the first hidden layer of ANN model.Thus, six different trainings and tests are applied for each station.The performance measures of post regression, RMSE and EF are calculated for both of the simulation results of test and whole data.The numbers of neurons that give the best performance for each station are obtained as shown in Table 3 for testing data which are more critical than the training results.As it can be seen from the table, the testing data results give "15" as the optimum number of neurons.The resulting network structure is shown in Figure 5.
For the first stage, all of the boarding stations are included in the input layer of the network (ANN-AS).The network was successfully trained with 2,500 epochs, 0.05 learning rate and 0.9 momentum factor.The results of the first stage analysis obtained by using test data are given in Table 4. Beside the mentioned performance statistics, the percentages of discrepancy ratio (DR) are also presented in the Table.DR values of the estimations are calculated by Eq. 6 for each observed and predicted pair: Generally, it is accepted as good estimation if DR value is between -0.1 and 0.1 corresponding to 25% deviation from the observations.In the table, the percentages of the estimation having DR below -0.10 is indicated as "low estimation ratio" (LER), and the percentages over 0.10 DR as "high estimation ratio" (HER).The estimations between the DR of -0.10 and 0.10 are indicated as "proper estimation ratio" (PER).
The PER percentages of ANN models are satisfying in general.A tendency for low prediction is dominant for the most of the stations.Minimum PER percentages are obtained for Basmane and Sanayi which are the stations having the lowest and most inconsistent passenger activity.
When the slopes of the post regressions (β1) of test data simulations are compared, it can be said that Konak station gives the best result for trip flow prediction.Ucyol, Halkapinar, Sanayi and Cankaya stations follow consecutively.The efficiency factors (EF) and coefficients of determination (R) also indicate good prediction performance for the Ucyol and Halkapinar stations.However, Bornova station takes over instead of the Sanayi and Cankaya for EF and R values.Thus, the ANN model including whole boarding stations gives the highest performance for the critical three points of the LRT line (the edges and the main transfer points) and reasonable estimation capability for other stations.
In the second stage of ANN analysis, the capability of the estimation for flow effective stations is tried by eliminating some stations from nine boarding stations (ANN-ES).One by one, a station is eliminated from nine input stations and the corresponding performance is evaluated.This procedure is applied for all the stations; however, for the sake of brevity, we present only the results for Ucyol station in Table 5.As seen in Table 5, for example, the constant term of the post regression (β0) is getting closer to "0" by the single elimination of the boarding data of the 4 th , 6 th and 10 th stations (Basmane, Halkapinar and Bornova).According to these performance improvements indicated by different statistics in the table, the stations which have higher improvements and occur more frequently are selected for the combined elimination.The elimination is gradually continued while observing negligible decrease in the estimation performance.For example, 4-6, 4-6-5, 4-6-5-2-10 combinations are eliminated gradually for the Ucyol station.

LER: low estimation ratio, PER: proper estimation ratio, HER: high estimation ratio
The performance results of the combined elimination are summarized in Table 6.The results revealed that the combined elimination produces markedly different results than the single elimination.A reasonable decrease in the prediction performance can be seen for Ucyol, Stadyum and Sanayi stations (see Table 4 and Table 6).On the other hand, Konak, Cankaya and Hilal stations indicate better performance after the elimination.Consequently, the western part of the LRT line, which is closer to the central business district (CBD) exhibits distinguishable performance with ANN model after the selection of trip-effective stations.Consequently, the CBD-based trips can be evaluated as having more predictable flow for LRT lines.
The trip flow scheme for ANN-ES model is shown in Figure 6.When it is compared with Figure 4 given for RE

COMPARISON OF REGRESSION AND ANN MODELS
Since the different variations of the multiple regression models give similar estimation performance, two  7.As it is expected, the elimination of the boarding stations slightly decreases EF values which should be close to "1" for a proper estimation, for the both of regression and ANN approaches.Except for Ucyol and Stadyum stations, the ANN approach increases the prediction efficiency specifically for Basmane and Hilal stations which have poor estimations for the regression models.Accordingly, it can be said that the ANN approach produces considerably high capability of trip flow prediction for the cases in which the multiple regression is inadequate.
The difference between the two approaches is clearer when the DR percentages are compared.Figure 7 shows the DR percentages of model predictions between -0.01 and 0.01 range which indicates the ratio of the predictions within the deviation of 2.3%.As can be seen from the figure, a considerably high success is obtained for the ANN models which produce reasonable predictions with 60% of the whole predictions.This is only 30% for the regression models.For the first five stations which have been constructed in the CBD, the ANN-ES model indicates higher performance than the ANN-AS model.Hence, rather than the regression models, the ANN models can allow the selection of trip-attractive stations for the LRT lines in CBD.
The statistics of the predictions having DR percentages out of -0.01~0.01range is also important for evaluating the estimation performances.The percentages given in Figure 8 are obtained by the difference in

CONCLUSION
The most distinguishable difference between the two examined trip flow estimation approaches for Izmir For the stations where the regression models produce poor estimations, ANN models show considerably high performance by the inclusion of some boarding stations.The ANN approach necessitates more explanatory variables (boarding stations), especially for the line section in CBD of Izmir.The station selections of ANN approach can be evaluated as more reliable for Izmir LRT because the discrepancies between the observed and predicted pairs produce findings in favour of this approach.
When the numbers of arrows are compared in the figures (Figure 4 and while Ucyol, Konak and Bornova stations show higher attraction.Accordingly, Izmir LRT system in its current form is found to be effective only for the trips between the two ends of the line.The inner trips having shorter distance are not reasonably supported by the system.Therefore, some feeder lines around the middle section of the system can provide higher travel demand and increase the efficiency of the LRT system in public transportation of Izmir.The regression models provide better estimation capability for the sections of LRT line where the trip demand demonstrates stable and relatively higher trend compared to its average.However, the case of fluctuating demand and low trip attractions may cause a dramatic decrease in the estimation capability.
The ANN approach is more capable for the determination of trip-attractive stations because of unbiased DR values after the elimination of the stations.In addition, it is more reliable for the LRT sections constructed in CBD.Hence, it can be concluded that ANN is an effective tool for trip flow estimation.
The multiple regression models can be evaluated as more preferable from the simplicity and manageability points of view.Generally, in cases where the ANN model is ineffective, the regression models have better performance.The opposite of this case is also true according to the results.
In the light of these critiques, it can be concluded that the ANN approach should be considered as a "rescuer" technique, when the used data are unsuitable for the regression analysis.Otherwise, the regression analysis can be more practical and a user-friendlier method for trip flow prediction of LRT lines.

ACKNOWLEDGEMENT
The authors would like to thank the personnel of Izmir Metro Inc., Ilgaz Candemir, Emre Oral and Nurten Caliskan for providing the data of the study.Besides, Mustafa Özuysal appreciates TUBITAK, The Scientific and Technological Research Council of Turkey for doctorate scholarship.

Figure 3 -
Figure 3 -The histograms of standardized residuals for RE model

Figure 4
represents the flow scheme of Izmir LRT obtained by the stepwise elimination of the stations of RE model.Ucyol, Konak and Cankaya stations seem as the most effective stations for trip attraction and production.The stations at the middle section of the LRT line, like Basmane, Hilal, Halkapinar and Stadyum indicate lower dependence on other stations.Konak, Cankaya and Bornova stations show attractiveness for long trips rather than the short ones.

Figure 4 -
Figure 4 -Trip flow scheme for RE model

Figure 5 -
Figure 5 -The network architecture (example for Cankaya station)

M
. Özuysal et al.: Passenger Flows Estimation of Light Rail Transit (LRT) System in Izmir, Turkey Using Multiple Regression and ANN Methods of them (RA and RE) are selected for the comparison with ANN models (ANN-AS and ANN-ES).The efficiency factors (EF) of the four mentioned models are given in Table LRT arises from the flow schemes represented in Figures 4 and 6.ANN model yields considerably different results in the selection stage of trip-effective stations.

Table 1 -
Performance of the regression models The standardized residual histograms of RE model compared with normal distributions are given in Figure 3.It is clear that RE model fairly satisfies the criterion for most of the stations, specifically Ucyol, Konak, Cankaya, Halkapinar and Bornova.Some stations like Hilal, Stadyum, Sanayi and Bolge have a bit of a bias with normal distribution.However, the estimation residuals of Basmane station indicate a distribution considerably far from the normal distribution.
R: raw data, S: standardized data, A: including all stations, E: eliminated stations, C: including constant term der question exhibit more sensitivity for the elimination of boarding stations.When the models with elimination are compared, RE model can be stated as the most successful in general.In order to recognize a regression estimator as a model, the residuals should provide some important criteria like fitness of normal distribution.

Table 2 -
Post-regression statistics of the regression models R: raw data, S: standardized data, A: including all stations, E: eliminated stations, C: including constant term

Table 3 -
The number of neurons giving the best performance for each station

Table 4 -
The performance of network simulation by using the testing data

Table 5 -
Performances of ANN model after single station eliminations for Ucyol.

Table 6 -
The performance of network simulation by using eliminated boarding stations