REINFORCEMENT LEARNING TECHNIQUE IN MULTIPLE MOTORWAY ACCESS CONTROL STRATEGY DESIGN

,


INTRODUCTION
Recurrent and non-recurrent motorway congestion leads to delays, reduced traffic safety, increased fuel consumption, and serious air pollution as well.Such congestion limits the motorway throughput at times when it is most necessary, i.e. during the peak hour.The throughput becomes even more critical when nonrecurrent congestion occurs.Building new motorways will leave current motorway infrastructure insufficiently utilized.On the contrary, traffic flows can be positively influenced by numerous intelligent transportation system (ITS) techniques.
The examples of motorway access control systems are numerous.ALINEA [5,11,17,18] is the first control strategy on a local level and is based on direct implementation of classical control theory with feedback.Other efforts include genetic fuzzy approach, artificial neural networks, and two-level motorway access control approach [9,21,22].
All the existing motorway access control algorithms, although traffic responsive, are not truly adaptive to traffic parameter changes [19,20,14].Most of them are of local regulatory type [4,5].Adaptive in this sense is opposed to the common controversial interpretation of the term in literature.It means more than giving a real time traffic response only.Additionally, the control policy changes itself as a response to the inherent systems characteristics.In other words, in order to be truly adaptive, the system should be capable of learning continuously [4].
In this respect, by implementing the information technology methodology, i.e. the specific artificial intelligence technique, a truly adaptive strategy for multiple motorway access control can be designed and developed.The main research hypothesis refers to the statement that motorway access control can be a completely adaptive and optimal closed loop control strategy that minimizes total travel time on the corridor.
This paper is an attempt to go a step further and use the adaptive control strategy when the level of traffic density necessary to be maintained is not predefined -a situation wherein the strategy itself learns how to minimize the total travel time spent in the system.Furthermore, the agents continuously learn by themselves and adapt to the environment changes accordingly.

ARTIFICIAL INTELLIGENCE TECHNIQUE USED
Reinforcement learning (RL) is a machine learning technique which does not require supervised training as it is the case with other learning techniques such as neural networks.It is based on goal-directed learning from interaction with an environment, i.e. what to do or how to map situations or states towards actions in order to maximize a numerical reward signal.By trying, exploring, and exploiting actions in an iterative process, the learner -the so-called autonomous agent, senses and learns in its environment how to choose the optimal action, or the actions that yield the cumulative reward.
More specifically, the agent and the environment interact at each sequence of discrete time steps , , , t 0 1 2 3f = .At each time step, the agent, t, receives some representation of the environment's state, here expressed as s S t !(where S is the set of possible states), and accordingly selects an action, here expressed as a A s At each time step, the agent implements mapping of the state representations and the probabilities of selecting each possible action.This mapping is called the agent's policy.The most important features of the agent are trial and error search and delayed reward.
In RL, the agent goal is formalized in terms of a special signal called a reward that passes from the environment to the agent.The agent tries to select actions so that the sum of the discounted rewards it receives gets maximized, here expressed as where R t is the expected discounted reward, r t is the reward in the time step t, and γ is the discount rate.
In particular, it chooses at to maximize the expected discounted return, where c is a parameter between zero and one.
Almost all reinforcement learning algorithms are based on estimating value functions i.e.-functions of states (or state -action pairs) that estimate how good it is for the agent to be in a given state.This is explained in 2.1.

Q-Learning
One of the most important improvements in RL was the development of an off-policy Temporal Difference (TD) control algorithm known as Q-learning.This algorithm, developed by Watkins, has been researched most frequently, both theoretically and practically.This is mainly due to its origination from the concept and principles of Dynamic Programming (DP) [1].Thus related to DP, Q-learning integrates planning and learning unlike other reinforcement algorithms [2].One of the most important features of this algorithm is that it does not require a pre-specified model of the environment upon which to base its action selection.Instead, only relationships between states, actions, and rewards are learned.Almost all of the traffic control methods, except the recent ones, usually require prespecified models of traffic flow to generate short-term predictions of traffic conditions or to assess the impacts of possible control decisions [3].
The Q-learning task can be defined as acquiring optimal policy r by learning value function V * of the optimal policy * r , provided by perfect knowledge of the immediate reward function r and the state transition function d.When the agent knows the functions r and d used by the environment to respond to its actions, then it can calculate optimal action for any state s as If the evaluation function , Q s a ^h represents the reward, which is received for executing action a from state s and to which the value discounted by c is added, here expressed by then the agent will select optimal actions even when it has no knowledge of the functions r and δ, that is to say (5) In this case, independently of the policy being followed, the learned action-value function Q directly approximates Q * , that is to say the optimal action-value function.
It is assumed that under certain conditions in a deterministic world (for MDP) estimated value for Q ^ will converge to true Q value.Different authors have made some modifications of the original algorithm introducing learning rate a expressed by where , Q s a ^h is the function of the action reward, a is the learning rate 0 1 < < a ^h, c is the decrease rate parameter, ', ' Q s a ^h is the function of the new action value ' a for the new state ' s .Learning rule used in this research is defined by Q-learning algorithm by Watkins for non-deterministic processes [16].This is the case because the probability distributions both for the reward function , r s a ^h and for the transition function , s a d^h depend on s and a only.They do not depend on previous states or actions as it is a non-deterministic Markov decision process (MDP).Since traffic is a stochastic process, in the learning rule the non-deterministic environment has to be accommodated.The function of the action reward , Q s a ^h is redefined as a value expected from the previously defined value for deterministic case.Hereby, the rule becomes , , ', ' .max In equation ( 8), , Q s a n t ^h is a value expected from the previously defined value for deterministic case of the action function а for state s, n a is the learning rate, ', ' t ^h is the value expected from the previously defined value of the new action ' a for the new state ' s .The learning rate n a is expressed by In the above equation s and a are the state and action updated during the n-th iteration, and , visits s a n ^h is the total number of times that this state-action pair has been visited up to including the n-th iteration.This rule is suitable for deterministic case when n a is 1.As n increases n a decreases.By reducing n a at an appropriate rate during training, convergence of Q values can be achieved.In order to speed up the learning process, fixed n a was used in our experiments.

MODEL TESTING
In order to test the control strategy, a few scenarios were divided into two test cases in accordance with the traffic parameters: -the first test case -coordinated control and parameters measurements taken at the motorway exit, with known traffic demand on the main line (Figure 1); -the second test case -coordinated control and measurements taken downstream at each motorway entry, with unknown traffic demand on the main line (Figure 2).During this test case two types of scenarios were developed: 1 -testing when there is no traffic congestion, 2 -testing when there is traffic congestion in the corridor.
In order to estimate the feasibility of the suggested strategy for optimal adaptive coordinated control of the motorway entry ramp, the results from the agents that learn were compared to the results from the case with no control strategy and to those from the case with ALINEA control -the widely implemented control strategy used as a regulator.
The results gained from the simulations with no control strategy were taken as the base case and the rest of the results that were compared to it were estimated.Testing was conducted after sufficient number of iterations with different numbers of states and after Q-values convergence [4].

No control Control
The above presented strategy for optimal adaptive coordinated motorway access control uses the socalled look-up table.[4]

DISCUSSION
Within the first test case (coordinated motorway access control, measurements at the exit of the corridor, traffic demand known), improvements were as follows: -savings in travel time up to 14.50%; -delay decrease by 26%; -average stop time per vehicle decrease by 37%; -average number of stops per vehicle decrease by 35%, and -the number of vehicles exiting the network increase by 14%.
It is evident that this type of control strategy needs a longer phase of learning for the agents, which makes the strategy not efficient enough.Therefore, localized motorway entry access was implemented, whereas traffic parameters were measured on the mainline downstream of each access (the second test case).During this test case two types of testing (scenarios) were performed: 1. testing with no traffic congestion present; 2. testing with traffic congestion present.
After performing tests with data showing no traffic congestion present (Scenario 1), it was noticeable that there were significant improvements regarding: -delay (decreased by 30%) (Figure 3); -average stop time per vehicle (reduced by 78%); -average number of stops per vehicle (reduced by 80%) proving the smoothness of traffic flow; -longer traveling, evident travel time and delay decrease and a significant difference after one hour of travel.
There was very little improvement in: -travel time (reduced by 3.29%); -number of vehicles exiting the corridor (increased by 3%); -speed change (increased by 0.33% only).
It was noticeable that the strategy followed realtime traffic parameters change, particularly during the transition from the state of congestion to the normal state.The results from implementation of ALINEA for the same effectiveness parameters were similar to the corresponding results gained by the suggested control strategy.This similarity could be explained with the fact that there was no recurrent congestion on the corridor, which made this strategy inferior as compared to ALINEA.
Regarding travel time savings, speed increase, and the number of vehicles exiting the corridor, the results gained with ALINEA were not very promising.This is important because the ALINEA strategy implementation requires some parameter calibrations to be made for the particular motorway and for the corresponding traffic demand.However, the above coordinated control strategy testing can be performed on unknown traffic demand.Therefore, in the case with no traffic congestion, the suggested strategy could be implemented with learning performed with traffic demand similar to the one preceding the implementation.
During the second test case (with traffic congestion on the corridor and with unknown traffic demand) the Q-learning strategy shows extraordinarily good re-sults after relatively small number of iterations (about 1500).The outcome results were as follows: -savings in travel time increase by 15%; -delay decreases by 26% (Figure 4); -average stop time per vehicle decreases by 38%; -average number of stops per vehicle decreases by 35%; -increase in the number of vehicles exiting the network by 10%; -speed increase by 9.85%.
Improvements were almost doubled compared to the results with ALINEA implementation with the same measures of effectiveness (8.41%, 13%, 20%, 19%, 6.22%, and 3.55%, respectively).It was obvious that the strategy adjusted itself to the traffic conditions, i.e. it is adaptive and responds to the real-time traffic demand.Thus, the main research hypothesis stated at the very beginning has been proven [4].
The best improvement was achieved in the case of control implementation with data showing no congestion (for the average stop time per vehicle and average number of stops per vehicle).
Regarding all the measures of effectiveness, the best results were gained when control strategy was implemented on unknown traffic demand with congestion.This shows that the suggested strategy is feasible for coordinated motorway access control that is optimal, adaptive, and traffic responsive.
After the testing with data where there is traffic congestion and unknown traffic demand on the corridor, the strategy that uses Q-learning showed extraordinarily good results after relatively small number of iterations.Thus, its feasibility and efficiency have been confirmed as well.
Suggested coordinated control strategy proves better than ALINEA in relation to the average stop time per vehicle and average number of stops per vehicle during the peak hour.The evidence of this lies in the smoothness of the traffic flow with no interruptions in terms of stop-and-go.This leads to reduced fuel consumption per vehicle, reduced air pollution, and reduced environmental pollution as well.[4]

CONCLUSION
Bearing in mind the results of the model testing, it can be concluded that an optimal adaptive coordinated motorway access control is feasible for performing multiple motorway access control.
This research opens broad possibilities for reinforcement learning technique implementation in traffic control.Some of the steps in scientific research to follow are to deal with coordinated control for noncongested traffic, traffic signal control on isolated intersections, and examination of the model efficiency after implementation.
This research shows the implementation of realtime traffic control strategy.Several facts confirm its uniqueness such as: 1. the strategy requires no environment modeling; 2. the strategy is truly adaptive; 3. supervision is not necessary, 4. no need for traffic parameters prediction; 5. the best optimal control strategy, based on the current traffic state only and on the current control conditions, simplifies the approach; 6. the strategy can be implemented in real time since the model requires neither simulation steps to be performed nor any calculations to be made during the implementation phase Taking the above into account, the conclusion follows that the strategy is a firm basis for further research in the area of the self-learning adaptive coordinated traffic corridor control.

Figure 1 -
Figure 1 -First test case layout