FUNDAMENTAL DIAGRAM ESTIMATION BASED ON RANDOM PROBE PAIRS ON SUB-SEGMENTS

A new statistical algorithm is proposed in this paper with the aim of estimating fundamental diagram (FD) in actual traffic and dividing the traffic state. Traditional methods mainly focus on sensor data, but this paper takes random probe pairs as research objects. First, a mathematical method is proposed by using probe pairs data and the jam density to determine the FD on a stationary segment. Second, we applied it to the near-sta-tionary probe traffic state set through linear regression and expectation maximisation iterative algorithm, estimating the free flow speed and the backward wave speed and dividing the traffic state based on the 95% confidence interval of the estimated FD. Finally, simulation and empirical analyses are used to verify the new algorithm. The simulation analysis results show that the estimation error corresponding to the free flow speed and the backward wave speed are 1.0668 km/h and 0.0002 km/h respectively. The empirical analysis calculates the maximum capacity of the road and divides the traffic into three states (free flow state, breakdown state, and congested state), which demonstrates the accuracy and practicability of the research in this paper, and provides a reference for urban traffic monitoring and government decision-making.


INTRODUCTION
In traffic flow theory, three macroscopic variables of flow, density, and speed are usually used to describe dynamic traffic, and there is a basic relationship q=kv between them [1]. In stationary traffic (all vehicles keep the constant speed and headway), the empirical relationship diagrams between the three variables are called fundamental diagram (FD) [2]. If the FD is known, as long as one variable of the three is known too, then the other two variables can be determined easily. The observational study of Yan [3] also showed that an obvious relationship exists between the near-stationary state, the flow, and the actual traffic density. The FD can not only describe the relationship between macroscopic variables, but also explain the microscopic vehicle to a certain extent [4][5][6]. What is more, it can provide a basis for the classification of traffic states and can be widely used in fields of transportation science and traffic engineering [7][8][9].
To estimate the FD in actual traffic, the functional form of the FD should first be assumed. There are two commonly used functional forms of the FD; one is the original the FD proposed by Greenshield and the other is the triangular FD, the latter being more suitable for dynamic traffic modelling. In addition, the FD with five parameters was proposed by Wu [10] in 2002, but it needed to estimate more parameters, and the estimation process was vehicles. In some other studies, the FD is derived by considering the expected headway [23], reaction time [24] of artificially driven vehicles, safety intervals, and permeability [25]. Abhinav et al. [26] proposed using a recursive binary splitting method to estimate the traffic state based on probe vehicles data, but it was a complicated process when considering the penetration of probe vehicles under different conditions. The purpose of this paper is to establish a computable method of FD estimation using only probe pairs data, and divide the traffic state on the basis of the estimated FD. Each probe pair was randomly matched by any two internet vehicles on the same sub-segment, which greatly reduced the time and cost of the data collection process. In addition, as long as FD is estimated, the traffic state can be easily divided, avoiding the complicated process of traffic state estimation. In order to achieve the purpose of estimation, this paper relies on the minimum exogenous hypothesis, namely the functional form of FD and the jam density. The structure of the paper is as follows. In Section 2, discusses the basic idea of FD estimation using probe pairs data, and illustrates the identifiability of triangular flow-density FD on a stationary segment mathematically. In Section 3, based on Section 2, a new computational algorithm for FD estimation using actual and noisy traffic data is proposed. In this new algorithm, roads are subdivided into an appropriate number of sub-segments. According to the data collected by multiple probe pairs on each sub-segment, the maximum likelihood estimation is used to estimate the most probable FD. In section 4, the accuracy of the new algorithm is verified by simulation analysis. In Section 5, we verified the practicability of the new algorithm by empirical analysis. Section 6 concludes the research results and the prospect of future research.

DETERMINED FD ON A STATIONARY SEGMENT
In this section, it is proposed to use probe pairs data to determine FD from a mathematical perspective in the idealised traffic state (homogeneous traffic) [1].

Assumptions and theorems
Assumption 1: The study is based on the LWR model [27][28] with a homogeneous FD and First-In-First-Out (FIFO) condition, and without source or sink, expressed as follows: complicated. Then, parameters of the FD were estimated by data fitting curves. Since 1935, FD estimation using Greenshield model has been studied extensively. The usual sensors can measure traffic flow and occupancy at their locations, which were closely related to speed, so the FD cannot be estimated in places where sensors are not installed. As the costs of installation and operation for sensors are usually high, it is not always possible to determine the FD on every road.
Compared with conventional sensors that can only collect data at finite discrete locations, probe vehicles [11][12] are distinguished by the fact that they collect spatial continuous data from a wide range of areas. Moreover, probe vehicles can collect data during daily trips, which is a highly cost-effective tool for traffic data collection [13]. Given the advantage of probe vehicles, it is possible to estimate FDs almost everywhere. At present, there are few studies using probe vehicles data to estimate the FD. Many methods based on sensor data use traffic data summarised in a short period (for example, traffic count within 30 s to 1 min) as the input of estimation for the FD, from which statistical relationships are extracted [14]. There are also ways to group the data and then consider the vehicles in each combination separately [15][16]. These methods are based on sensors that can easily obtain the flow, density, or trajectory of all vehicles, so the FD estimation method based on sensor data is not suitable for probe vehicles data. Seo et al. [17][18] proposed two methods for estimating traffic state and FD in 2015 and 2019, respectively. The first method was based on the probe vehicles position and headway data, and the second method used the asteroid remote sensing and probe vehicles data to achieve estimate. It is obvious that both methods require not only probe vehicles location data, but also other data. Herrera et al. [19] proposed that under certain conditions, by manually measuring the shock wave speed between uncongested and congested state, the backward wave speed in triangular FD could be obtained, and then other parameters of triangular FD could be determined by the given jam density. Zhang Nan et al. [20] established the FD based on the distributions of shockwave speed. In actual traffic, it is not easy to measure and determine the speed of shock wave, but the speed can be estimated by using probe vehicles data [21]. Seo [22] also proposed the method of estimating the FD only using probe pairs data, but it required data collection based on a pre-set probe pair, that is, it required higher requirements for probe where a m is a time-space region enclosed by the traveling track of mand m + (the region includes m + , but does not includes m -). In addition, c m -1 denotes the number of non-probe vehicles between probe pair m, and c m is always a constant regardless of the shape, time or position of a m . The steady PTS of probe pair m meet above conditions 1 and 2.
The detailed derivation of Theorem 3 refers to the third section of the literature by Seo et al. [21]. The relationship described by Equation 5 is called the probe fundamental diagram (PFD) of m, which can also be expressed as follows: θ m represents a specific parameter vector. Since FD is similar to PFD, so θ m is composed of the free flow speed u, the backward wave speed w and the jam density κ/c m of PFD.
Assuming that a group of probe pairs M pass through the same segment, probe pairs are matched by any two internet vehicles in the segment, and all the probe pairs are of the same type (the vehicles have the same size, such as length, etc.). And if any region a m enclosed by the traveling track of each probe pair m (m!M) on the segment satisfies conditions in Theorem 3, then the segment is called to be a stationary segment. In other words, any region a m (m!M) on the stationary segment meets conditions in Theorem 3. If there is a region a m on a segment that does not meet conditions in Theorem 3, then it is not a stationary segment. If the set of all probe pairs passing through a stationary segment in one day is considered as a group of probe pairs, which is denoted by M, Equation 5 holds for any probe pair m, and m!M.

A mathematical method to determine FD
If the parameter vector θ of FD is determined based on a traffic state set (TSs) or a probe traffic state set (PTSs), then the parameter vector θ m of the PFD can also be determined by using a given data set (a PTSs of all probe pairs M on a stationary segment). From a mathematical point of view, it is clear that the following proposition holds.
Proposition: A triangular FD can be determined if (1) one or more steady TS data in uncongested state is available, and (2) two or more different steady TS data in congested state are available.
where q(t,x) and k(t,x) denote the flow and density at the time-space point (t,x) respectively, Q represents the flow-density FD function, and θ is the parameter vector of the FD. Assumption 2: The functional form of the flow-density FD is a triangle [4], which is expressed as follows: where u denotes the free flow speed, w denotes the backward wave speed, and κ denotes the jam density, and all of them compose the parameter θ. Equation 2 has two states, uncongested state and congested state.
Theorem 1: According to the generalised definition of Edie [29], traffic state (TS) in a time-space region A is defined as Theorem 2: Consider a probe pair m which was composed of two probe vehicles, the first probe vehicle is denoted by mand the second probe vehicle is denoted by m + , and there may be any or an unknown number of vehicles between a probe pair. The probe traffic state (PTS) of the probe pair is defined as follows: PTS only represents the flow, density, and speed of probe vehicle m + in region A. Theorem 3: If a PTS meets (1) the traffic is stable in region a m and (2) each vehicle in region a m has the same traveling distance, it is called steady PTS. Steady PTS of probe pair m follows a similar relationship of the geometric to the FD in the flow-density plane, and the ratio is 1/c m , then available, (2) two or more different steady PTS data from congested state are available, and (3) the jam density is known.
Note that prior to the analysis, PTSs should be roughly divided into uncongested and congested state based on speed, and the final state of PTSs is divided by the FD recognition results. Figure 1 and Figure 2 further explain Corollary 2. According to the daily visible traffic phenomena, the traffic state is roughly divided into uncongested and congested, as shown in the space-time diagram of a stationary segment in Figure 1. With boundary of the red dotted line, the lower half represents an uncongested state, and the upper half represents a congested state. When the traffic state is transformed from uncongested to congested, there will According to Equation 5, if all probe vehicles in the probe pairs group M are the same type, and the number of non-probe vehicles between every probe pairs is equal to c m -1, then the following corollary can be drawn from the proposition.

Geometric interpretation of FD determination on a stationary segment
Corollary 1: On a stationary segment, a triangular PFD can be determined if (1) one or more steady PTS data from uncongested state is available and (2) two or more different steady PTS data from congested state are available.
This means that the values of the free flow speed and the backward wave speed can be determined by proper pairs of probe vehicle data. Once the PFD is determined and the jam density is given, the corresponding FD can be determined according to Equation 5. If this truth is combined with Corollary 1, the following corollary is obtained.
Corollary 2: On a stationary segment, the triangular FD can be determined if (1) one or more steady PTS data from uncongested state is Density k

ESTIMATION ALGORITHM OF FD IN ACTUAL TRAFFIC
The actual traffic data may have various traffic phenomena due to different drivers and vehicle types. The data always contains measurement noise, so all actual traffic data can be regarded as non-steady data. In addition, different PFDs may be estimated from different sub-segments, that is, it is difficult to obtain a consistent FD. These situations are not completely conformed to the LWR model theory, so Corollary 2 proposed in Section 2.2 cannot be applied to actual traffic directly.
In this section, a triangular FD estimation algorithm based on actual traffic data is presented from statistics. Specifically, Section 3.1 presents a method to extract the near-steady PTSs from probe pairs data, Sections 3.2 and 3.3 propose a method to estimate the free flow speed and the backward wave speed based on the near-steady PTSs.
In order to simplify the mathematical representation, the features of triangular flow-density FD are described as the free flow speed u, the backward wave speed w, and the y-intercept of congested part α, and α≡κw always applies. As shown in Figure 3, Figure 3a is the three-dimensional FD and PFD, and the other three graphs are the decomposition of Figure 3a from three planes, respectively, where q c is the capacity flow and k c is the capacity density.

Extraction of near-steady PTS set
According to Equation 4, multiple PTSs S n (a nm ) corresponding to m and n are calculated from probe pairs data, n represents any sub-segment on a road, m represents any probe pair on sub-segment n, and a nm is the traveling area of probe pair m on sub-segment n (m!M), and M represents the group of all probe pairs on sub-segment n), and the PTSs is called the original PTSs.
In order to satisfy the conditions of Corollary 2, it is necessary to extract near-steady PTS sets. In this paper, a two-step extraction method is adopted.
Step 1: Extract near-stationary segment and consider whether each segment satisfies the following conditions or not: 1) There is no ramp or auxiliary road on the segment, that is, the number of lanes remains the same all the time. 2) All vehicles on the segment adopt uniform speed limit rules and follow FIFO rules.
be a transition state, which is defined as the breakdown state. With the definition of breakdown state, the traditional uncongested and congested state can be divided into free flow, breakdown, and congested state. In Figure 1, the solid line shows the connection track of all probe pairs passing through a stationary segment within a day and the dotted line represents the connection track of all non-probe vehicles. The blue part represents free flow state, the orange part represents congested state, and the colourless area between blue and orange represents the breakdown state. a 1 , a 2 and a 3 are three small regions, representing the traveling region of a probe pair in uncongested state and two different probe pairs in congested state, respectively. Figure 2 is the flow-density FD corresponding to The flow-density PFD is determined by u, w and κ/c m , and has a vertex at the point (0,0). If given the blue point S 1 (a 1 ), as two points determine a line, then the slope of the line u can be calculated easily. In the same way, w and κ/c m are determined by two orange points S 2 (a 2 ) and S 3 (a 3 ). In other words, a triangle that satisfies the following conditions can be identified uniquely: the vertex at the point (0,0), one side on the line q=0, and other two sides respectively passing through the blue point and two orange points.

The jam density acquisition
The jam density is one of the indispensable parameters in the FD and PFD. The jam density must be known before estimate FD, but it cannot be determined by probe vehicles data directly. The jam density is a variable independent of time and space, and the common acquisition methods of it include experience and remote sensing technology. To be more specific, it is possible to query existing studies to set the jam density, or to set the jam density based on values observed in nearby locations, or to directly measure the jam density using satellite remote sensing directly. Moreover, even if the value of the jam density is unknown, the method can still calculate the free flow speed and the backward wave speed, as explained in Corollary 1.
Therefore, the near-steady PTSs are obtained following the two steps above. All segments included in the near-steady PTSs are denoted as N, and the probe pairs on the segment n (n!N) that collect data from the near-steady PTSs are denoted as M n .

The backward wave speed estimation
In this paper, the backward wave speed is estimated from the free flow speed separately, and the uncongested and congested states are roughly divided according to the lower bound u min of the free flow speed. Under the congested state, condition 2 in Corollary 2 indicates that there is a negative linear correlation between the probe flow and the probe density. So the estimated value of the backward wave speed w can be considered as the mean of the regression results of all probe flow and probe density in the congested state. Specifically, the backward wave speed is estimated by linear regression of (k(a nm ),q(a nm )) for each n!N, m!M n and 3) All probe vehicles on the segment are the same type, and the number of the non-probe vehicles between each probe pair is the same. If the conditions above all meet at the same time, this is called a near-stationary segment.
Step 2: Determine whether the PTSs {S n (a nm )|6m} corresponding to each near-stationary segment satisfy conditions in Corollary 2. If {S n (a nm )|6m} does not contain S n (a nm ) with a speed lower than u min or higher than u min , where u min represents a given lower bound of free flow speed, then the data in near-stationary segment n are discarded, and these correspond to conditions 1 and 2 in Corollary 2. In addition, if the correlation between probe flow and probe density in {S n (a nm )|6m,v n (a nm )≤u min } is no less than θ corr , then the data in near-stationary segment n are discarded. v n (a nm ) is the speed of probe pair m in region a nm on near-stationary segment n and θ corr represents the given threshold value, which also corresponds to condition 2 in Corollary 2.  where π s denotes contribution rate of state s in mixed distribution, L is the likelihood function of the near-steady PTSs when parameters π, u, w, {α n }, and {σ s } are given. In order to estimate the parameters π, u, w, {α n }, and {σ s } from the near-steady PTSs (q nm ,k nm ), the latent variable z nms is introduced. If S nm belongs to region s, then the latent variable z nms is 1, otherwise it is 0, then Equation 8 is transformed as follows: where u min and u max are nonnegative constants. In order to make Equation 12a satisfy Equation 12b, Lagrange multiplier method is adopted to solve the maximisation problem Equation 12, which is defined as follows: where λ with a lower index is the Lagrange multiplier corresponding to each constraint in Equation 12b. The Karush-Kuhn-Tucker condition [30][31] is introduced to solve the problem, and finally the following results are obtained: v n (a nm )≤u min , and the final estimate is the mean of all regression estimates. The calculation process is as follows: Step 1: Constructing a linear regression model for the PTSs of the congested state on each sub-segment, to estimate the regression coefficients, that is, w n ←LR(q(a nm ), k(a nm )|n!N,m!M n ,v(a nm )≤u min ).
Step 2: After Step 1, the backward wave speed of each sub-segment was estimated, which is composed of a parameter vector w=(w 1 ,...w n ,..., w N ).
Step 3: Calculate the mean of the parameter vector w=mean(w), where w is the backward wave speed to be estimated.

The free flow speed estimation by using EM algorithm
The free flow speed is estimated by constructing an iteration algorithm of Expectation Maximisation (EM). In this algorithm, the free flow speed u is estimated based on the near-steady PTSs, combined with y-intercept value α n and other auxiliary variables (for example, standard deviations in uncongested and congested state of FD).
It is assumed that the flow in the near-steady PTS set follows a normal distribution in which the mean is the true PFD and the standard deviation is r n σ s under state s (s=U represents uncongested state, s=C represents congested state). σ s denotes the standard deviation between the flow and the mean of FD under state s. r n denotes the specific parameter of segment n, which equals to q ̅ n / q ̅ , q ̅ n denotes the mean of the flow of all probe pairs on sub-segment n, and q ̅ denotes the mean of the flow of all probe pairs on all sub-segments. Therefore, the likelihood function of the PTS set of probe pair m on sub-segment n in state s is expressed as follows: where f(x,σ) denotes the probability density function of normal distribution with the mean 0 and the standard deviation σ. The parameters u, α n , s, and σ s are unknown, whereas the remaining parameters are known.
The mixed distribution likelihood function of the near-steady PTSs is as follows:  (19) where n denotes the number of sub-segments, setting parameters n=50 (indicating that there are 50 sub-segments that meet the conditions), the free flow speed is u=80 km/h, the backward wave speed is w=15 km/h, and the jam density is κ 1 =50 veh/km. ε n (q) is a random disturbance item related to flow, used to describe the noise that may exist in the system or observation. The value of σ n on different sub-segments is different, ensuring the flow of each period is no less than 0, setting σ n =min(uk n ,w(κ-k n )). According to general experience, there are usually morning and evening rushes on roads at 07:00~10:00 and 16:00~20:00, so the day is divided into five time periods (before the morning rush, during the morning rush, after the morning rush, and before the evening rush, during the evening rush, and after the evening rush) to simulate probe pairs data. All sub-segments are recorded for every 5 min.

Parameters setting for simulated sensor data
Using the same simulation method as above, the data collected by the sensors on the complex road by Equation 19 are stimulated. In the simulation of the sensor data, setting n=1, the jam density is κ 2 =200 veh/km, and other parameters are all consistent with the settings in the probe pairs data in simulation. In the end, a total of 288 TS data were obtained on this complex road.

Parameters setting for the estimation algorithm
The cleaning parameter of the PTSs is set to be θ corr =-0.8. The upper and lower bounds of the free flow speed are set to u max =120 km/h and u min =60 km/h, respectively. For each n and m, if v n (a nm )≤u min then the initial values of EM algorithm are set to be p nm (U)=0 and p nm (C)=1, respectively; otherwise set p nm (U)=1 and p nm (C)=0. The parameter λ to determine the convergence of EM algorithm is set to be 0.0001.

Estimated results of the simulation analysis
Under the setting of parameters, a total of 46 near-stationary segments are extracted from 50 sub-segments, and a total of 13,248 near-steady PTS data meet the conditions of Corollary 2. The If p nm (U)=0 holds for all n and m, then the right side of the Equation 14 does not exist. In other words, if no data are observed from uncongested state, the free flow speed cannot be determined. This corresponds to condition 1 in Corollary 2.
For the parameters that need to be input, their values can be determined easily. Any value of θ corr is acceptable, as long as it can generate a sufficient number of near-steady PTS data from the original PTSs. The upper and lower limit of the free flow speed u max and u min do not have to be similar to the actual free flow speed. The upper limit u max is used to eliminate unrealistically high values of u (for example, more than 500 km/h), and the lower limit u min is to determine the speed of congested traffic, which is significantly slower than the free flow speed, so lower values can be accepted (for example, 60 km/h for highways). In addition, the jam density can reasonably be determined according to Section 2.4.

SIMULATION ANALYSIS
This section verifies the effectiveness of the algorithm by simulating the actual traffic data of multiple sub-segments on a road, and divides the traffic state based on the estimated FD.

Parameters setting for simulated probe pairs data
Assuming a road, there are multiple sub-segments that meet conditions 1, 2, and 3 in Step 1 of section 3.1, using the triangular flow-density FD functional form, and simulating the actual traffic data of multiple sub-segments on a road in a day. The functional form is expressed as follows: Comparing the simulated sensor data with the estimated FD, results are shown in Figure 4d. The grey point is the simulated sensor data, the black solid line represents the estimated FD, and the green dashed line represents the true FD. It can be seen from the graph that the true FD is very similar to the estimated FD, and it can almost be considered as overlapping, whereas the sensor data shows little fluctuation in the estimated FD. This fluctuation area can be considered as a transition area between uncongested state and congested state, and can be called a traffic breakdown state. Due to the subtle changes, it is omitted here, and the following empirical analysis section is reflected. According to the change characteristics of the three FDs, the traffic state is divided on the flow-density FD. The region (k<k 2 ) on the left of the point (q 2 ,k 2 ) estimated free flow speed and backward wave speed are shown in Table 1. After comparison, the estimation errors between the true value and the estimated value are small, and the standard deviations are ̂σ U =18.7249, ̂σ C =1.5492.
According to the estimation results, Figures 4a-4c can be drawn. As shown in these graphs, the point (q 2 ,k 2 ) represents the maximum capacity of the road, that is, when the density is k 2 =31.5789 veh/ km, the maximum capacity is q 2 =2526.3160 veh/h. The estimated PFD is the scaling of the estimated FD, and the scaling ratio is 1/4 which exactly equals to the ratio of the jam density of probe pairs data to the sensor data. As k 1 / k 2 =1/4, q 1 /q 2 =1/4, so c m =4, it means there are 3 non-probe vehicles between each probe pair, showing the consistency of the description in Section 2 intuitively.  Figure 4 -The estimated FD and PFD between flow, density, and speed cies in data collection process. In order to ensure that the data are scientific and reliable, they need to be cleaned to remove noise before analysis. Probe pairs data used in this article have abnormal values which need to be assessed and processed. The abnormal data removal method is divided into two steps. The first step is to remove the obvious error data by using the threshold method and the second step is to remove the hidden error data by using the traffic mechanism method. Combined with the data characteristics used in this part, the logical judgment and processing criteria of abnormal data are determined as follows: 1) If the flow is bigger than 600 veh/h, the corresponding PTS data are abnormal and need to be removed. 2) If the speed is higher than 88 km/h, the corresponding PTS data are abnormal and need to be removed. 3) If one of the flow, speed, and density equals to 0, the corresponding PTS data are abnormal and need to be removed.

Parameters setting
Considering that the driving speed of the Beihuan road shall not be lower than 50 km/h or higher than 80 km/h, and the speed which exceeds the limit for 10% will be penalised, the upper and lower limits of the free flow speed are set to u max =88 km/h and u min =50 km/h, respectively. According to the distribution of the near-steady PTSs in Figure 6, the jam density should be set as κ 3 =20 veh/lane/km of probe pairs. The settings of others parameters are consistent with Section 4.2.
indicates free flow state, the region (k>k 2 ) on the right of the point (q 2 ,k 2 ) indicates congested state, and the nearby region of k 2 is traffic breakdown state. In summary, it can be considered that the proposed new algorithm in this paper can estimate the FD accurately, effectively, and quickly.

EMPIRICAL ANALYSIS
In order to verify the application of the new algorithm in actual traffic which was proposed in Section 3, this section applies it to the probe pairs data of multiple sub-segments on the Beihuan road in Shenzhen, China to estimate the FD and divide traffic state. Then the operability of the new algorithm is verified in practice.

Datasets introduction
The empirical analysis data come from the data of probe vehicles passing through 42 sub-segments of the Beihuan road on 26 March 2018, and it is published on the website of the Shenzhen Shanglong Mathematical Centre. There are no ramps or auxiliary roads on any sub-segments. The number of one-way lanes in each sub-segment is 4, and the speed limit is 80 km/h. Probe pairs data are collected by internet vehicles, and the data is recorded every 10 minutes. Time-space trajectories of sub-segment E1 were drawn in the area on the day (as shown in Figure 5), and it can be seen clearly that the traveling distance between 07:00~10:00 and 16:00~20:00 is smaller than in other periods. This is because the day is Monday and the road is congested due to rush hours.
There may be missing, abnormal, and redundant phenomena in data due to the collection equipment, collection method, weather, and various emergen-

Estimated results of empirical analysis
Under the given parameters, a total of 15 near-stationary segments are extracted from 42 sub-segments, and a total of 2,145 near-steady PTS data meet the conditions of Corollary 2. The estimation of the free flow speed is û=63.0311 km/h, the backward wave speed is ŵ=20.6471 km/h, and the standard deviations are ̂σ U =10.8464, ̂σ C =123.2465. Efficiency of the estimated PFD is shown in Figure 6: the solid line indicates the estimated PFD, the dotted line indicates the estimated PFD plus or minus 2 times standard deviations, and the grey point indicates the near-steady PTSs data. It can be seen from the figure that most of the points are near the estimated PFD, but there are still a few points in the congested part that have large deviations from the estimated results. This is because these points are abnormal themselves. The data cleaning method used in this section still has defects in the identification of this type of abnormal. But in general, almost 95% of

CONCLUSION
The paper proposes an FD estimation method which is suitable for the actual traffic data containing noise. In this new algorithm, the backward wave speed and free flow speed are estimated by linear regression and EM algorithm, which is based on the data of any probe pairs on the sub-segments by taking the appropriate number of sub-segments. Through simulation and empirical analysis, the conclusions are as follows: 1) The behaviour of the probe vehicles in this paper is similar to the general traffic situation. As long as the number of sub-segments is sufficient and the number of probe vehicles on each sub-segment is sufficient and sampled randomly, the estimated results can be used as a representative of the whole traffic. 2) In addition to estimating the free flow speed and the backward wave speed of the triangular FD, other auxiliary variables (such as ̂σ U and ̂σ C ) are also estimated, indicating that the new algorithm captures the randomness of the FD. 3) Some steps of the new algorithm proposed in this paper are completely dependent on speed.
Although it is acceptable in terms of its empirical performance, other parameters also need to be taken into consideration in future studies, such as density, time, and so on. 4) The following aspects also need to be considered in future research: firstly, the research on data cleaning to minimise the impact of noise data on the estimated results. Secondly, in order to make the research method independent of the exogenous hypothesis, it is necessary to consider acquiring the jam density from satellite remote sensing. Finally, it should be extended to other forms of the FD function, for example, Wu's FD.