UNDERSTANDING DAILY TRAVEL PATTERNS OF SUBWAY USERS – AN EXAMPLE FROM THE BEIJING SUBWAY

The daily travel patterns (DTPs) present short-term and timely characteristics of the users’ travel behaviour, and they are helpful for subway planners to better understand the travel choices and regularity of subway users (SUs) in details. While several well-known subway travel patterns have been detected, such as commuting modes and shopping modes, specific features of many patterns are still confused or omitted. Now, based on the automatic fare collection (AFC) system, a data-mining procedure to recognize DTPs of all SUs has become possible and effective. In this study, DTPs are identified by the station sequences (SSs), which are modelled from smart card transaction data of the AFC system. The data-mining procedure is applied to a large weekly sample from the Beijing Subway to understand DTPs. The results show that more than 93% SUs of the Beijing Subway travel in 7 DTPs, which are remarkably stable in share and distribution. Different DTPs have their own unique characteristics in terms of time distribution, activity duration and repeatability, which provide a wealth of information to calibrate different types of users and characterize their travel patterns.


INTRODUCTION
Urban rail transit has become an indispensable option for daily travel in China, especially for commuters in the metropolises such as Beijing and Shanghai [1,2]. In 2015 the passenger traffic of the Beijing Subway was 3.32 billion times, and the average daily passenger volume reached 9.11 million times [3]. In Shanghai, the share of rail transit passengers reached 53% in public transportation while the proportion of bus transit decreased to 33% [4].
With millions of people choosing rail transit as their primary travel mode, the congestion occurs in the peak hour. Many measures for passenger inflow control have been taken in the Beijing and Shanghai Subway to limit the number of passengers entering the subway station and waiting at the platform [5,6]. Other measures such as pre-trip information are also used to provide travellers with help in route selection and avoiding crowded stations. For example, a mobile phone App function has been launched for passengers inquiring about the real-time congestion of the Beijing Subway. However, this does not essentially solve the problem of subway congestion. Only with in-depth understanding of the user's travel behaviour characteristics and travel patterns, the nature of subway traffic jams can be discovered and a reasonable solution can be proposed [7,8]. Understanding the daily travel patterns (DTPs) of subway users (SUs) at the individual level can provide richer and useful information to reveal the regulation and characteristics of travellers, and enhance system performance or better assess network investments [9][10][11].
to classify the travel patterns, the approaches are still limited with the clustering variables which might ignore or be affected tremendously by some abnormal data, as well as the number of clusters [17]. Moreover, it takes a significant amount of work to reduce the noise and extract indicators before the travel pattern processing [18,19].
Therefore, a simple data mining method was launched to analyse DTPs of SUs from a more detailed perspective, and to mine travel information that can be used for daily operation monitoring and future network planning of the subway system. The existing studies neither fully utilized all AFC records for mining user daily travel patterns nor explored the detailed characteristics for understanding the typical DTPs. These features distinguish this paper from the others. The main contribution of this study is twofold. First, this paper attempts to propose an easy way of identifying and interpreting DTPs of SUs. DTPs will be represented by station sequence (SS), which is obtained by comparison of the stations in SU daily subway trips. Second, the data mining approach is used to understand the daily popular travel behaviours of the Beijing Subway with a week AFC data. DTPs proposed in this paper can more clearly understand the user's travel characteristics, such as the choice preferences of travel time and stations, one-way trip or round trip, random trip or long-lasting trip. Through the analysis of the characteristics of DTPs, the travel habits and needs of most users can be further understood.
The remainder of this paper is organized as follows. The real datasets used in the study are introduced and the proposed data mining process is explained in the following Section. The experimental result is presented and discussed in Section 3. Finally, the paper is concluded by summarizing the research findings and suggesting directions for future research.

Data foundation
At present the Beijing Subway adopted an all-purpose system for billing, wherein the users need to swipe the smart card both inbound and outbound. With the exception of the airport line, the transfers between lines do not require re-swipe. When a passenger swipes into the station, the AFC system records the information of the entry time, line and station of the inbound. When the card is Travel patterns of transit passengers have been discussed with the automatic fare collection (AFC) data in many research papers. Every transit rider's travel pattern is detected based on the temporal and spatial characteristics by trip chains [8,11,12], multi-week activity sequences [13] or other factors [14,15]. The authors in [8] have classified the travel patterns into five clusters (very low, low, medium, high, very high) with regularity levels of transit customers by the DBSCAN algorithm and found that approximately 41% of transit riders fall into the categories of high and very high (namely commuting patterns), which are further studied in [12]. The authors in [11] focused on detecting the detailed analysis of the characteristics of four user groups (exclusive commuters, non-exclusive commuters, leisure travellers and non-commuter residents), and the results indicate that visitors and registered users have different characteristics in temporal and spatial variability, activity patterns, sociodemographic characteristics, and mode choices. The authors in [13] have investigated passenger heterogeneity and defined eleven clusters (including conventional 9-to-5 commuters, a variety of non-working routines, and non-conventional work routines) as the activity patterns of London public transport users. According to the user's travel map, the authors in [14] used the BP model to classify public transport passengers into the commuting and non-commuting ones. Public transport passengers from the Beijing subway are divided into five categories in [15], including standard commuter passengers, flexible commuter passengers, high-frequency frequent passengers, life class passengers and short-term low-frequency passengers, and travel intensity, time dimension and spatial dimension have been detected to illuminate the travel characteristics at the aggregate level.
It can be concluded that the current research on passenger travel patterns focuses on passenger classification methods, while the analysis of different types of passengers is mainly concentrated on the level of aggregation. These analyses of trip patterns put emphasis on the identification of different modes rather than the fine-grained analysis of trip characteristics [16]. Although the time and space regularity of different travel types are detected, the daily travel characteristics of subway users are still not clear enough for subway operation optimization and line network planning. Meanwhile, while these works highlight the potential of clustering algorithm paper, the station sequence (SS), a string of numbers consisting of the station code (SC), is selected to represent the users' trip chains. SC refers to the user's inbound and outbound station by comparing whether the station is the same with the previous ones. The main process is shown in Figure 1, and the specific steps are as follows: Step 1: Process the site number. The site number (SN) mainly consists of two swipe records, the line number and the station number. These two indicators need to be joined to name SN (shown in Equation 1). The line number and the station number will be changed to character strings, and then they can be joined together instead of simple numeric additions. When the line or station number is less than 10, a 0 is automatically added to the front to form two digits. SN has four characters, of which the first two digits indicate the line number and the last two digits indicate the station number. swiped out, the AFC system will fill in the previous record with the information of the exit time, line and station of the outbound [20]. Other information such as deal type, card type, ticket type, device ID, etc. are also recorded in the transaction record.
In order to present the daily travel patterns, seven main fields of the Beijing Subway AFC data are extracted in the data pre-processing. Table 1 shows a random user's extracted transaction records in a week, which are comprised of five subway trips (each trip a line) with the user's card number, the entry and exit time, inbound and outbound line numbers and station numbers.

Station sequence extraction
A trip chain comprising temporal and spatial information is considered a useful way to present and analyse travellers' behaviours [21,22]. In order to depict the individual activity sequence, the stops (e.g. home and work, origin and destination) are used to define the travel patterns [23][24][25]. In this Step 1: Site number processing Step 2: Trasfer site number processing Step 3: Rank by card code Step 4: Rank by entry time Transaction record j Step 5: Station code tagging AFC analyze data j=m?

Figure 1 -Station sequence extraction method of SUs
The abbreviations used in Figure 2 are described as follows: SC(2j-1) and SN(2j-1) refer to SC and SN of the entry station in the j-th transaction record; SC(2j) and SN(2j) refer to SC and SN of the exit station in the j-th transaction record.
Step 5.3: all transaction records of the i-th user are tagged and SS(i) is output.
where SS(i) refers to the i-th user' station sequence.
Step 6: Complete the station sequence extraction for all SUs.
Taking the user's transaction record in Table 1 as an example, the user's SS is identified and shown in Table 2. Some trip characteristics of this user can be easily determined with SS, e.g., the rider travelled in only two different stations during the week, as well as the rider travelled in a round trip on the 17 th and 18 th but a single journey on the 19 th . Other information such as the frequency of stations, travel days, travel time, and duration of trips and between trips can also be calculated by SS at the individual level or at the aggregate level. The empirical data mining analysis indicates that the identified SS is a succinct and visual way to understand SUs' travel choices.
Step 2: Process the transfer site number. Since there are two or more lines passing through the transfer site, SN is different at the same location of different lines. Therefore, SN has to be unified into the smallest one when there are two or more SNs of the transfer site.
Step 3: The transaction records are sorted by card code and renumbered as i (1, 2, 3, ..., n), and the i-th user's transaction records are extracted. n refers to the total number of SUs in a day.
Step 4: The i-th user's transaction records are ranked by entry time and numbered as j (1,2,3,...,m), where m refers to the total number of the i-th user's transaction records.
Step 5: SCs are tagged by comparison of SNs, shown in Figure 2. And the core steps are described from Step 5.1 to 5.3.
where SC(1) and SN(1) refer to SC and SN of the entry station in the j=1 transaction record, SC (2) and SN(2) refer to SC and SN of the exit station in the j=1 transaction record.
Step 5.2: SCs from 2 nd to m-th transaction records are tagged. Check the rest of the trip records whether the entry or exit station are the same as the previous SNs.
Entry station code  information of regular riders in Section 3.2. The inbound time distribution mainly reflects the peak hours of different trips in different DTPs, which is helpful in optimization of the vehicle scheduling. The activity duration characterizes the length of activity of different DTPs, which can be used to determine the travel destination or activity nature of the traveller. With regard to the repeated travel days, it can reflect the repeatability and regularity of the user's travel in a week, and provide basic data support for managers to identify and optimize the commuter travel services. The typical trips of Top 7 DTPs are summarized and discussed in Section 3.3 to understand the behaviour of SUs and to provide more useful finer features for daily management optimization. Table 3 shows the results of the main DTPs proportions of the Beijing Subway users. Ti is used to present the i-th DTP sorted based on the proportion of all users each day. Only seven DTPs account for more than 1%, while the remaining types are less than 1% (no comment in this study). The proportion of Top 7 DTPs has not changed in this week's ranking, and their total proportion keeps around 93%. About 40% SUs take a one-way trip (T1) in a day, and the proportions of T1 on workdays are smaller

CASE ANALYSIS AND DISCUSSION
Using the station sequence identification method proposed in Section 2, the Beijing Subway AFC data are identified and SSs are calculated by a Python program. To demonstrate DTPs, the data were collected in a typical week from Sunday, October 16 to Saturday, October 22, 2016. There were no special holidays within 7 days before and after the selected week, so the data could avoid any holiday effect. There are 7,181,371 smart cards with 32,743,509 transaction records in the selected week. It only takes a few minutes to finish the data mining work on about 2,729,934 smart cards with 4,677,644 transaction records per day. About 3,800 to 6,300 types of SS per day are obtained through data mining. There are about 4,000 types of DTPs in average on weekdays, and the number rises to around 6,000 on the weekends.
More detailed information is detected and shown in the following section. DTPs features of Beijing SUs are introduced first. The most popular DTPs are shown and discussed in Section 3.1, where DTPs are presented by SS.
For the high similarity of different DTPs in a day, the detailed distribution of entry time and activity duration with one-day data (Tuesday) are investigated in the subsequent analysis, then the repeated travel days with a week of data are extracted to mine (2) one's first trip aims to commute, the second and third trips to return by passing another stop for a short visit.

Activity duration analysis
Durations of stay at the same station have been employed as the simplest attributes correlating with the trip purposes [13,21]. Activity duration analysis in this section focuses on activity intervals between trips of SUs, which is comprised of the transfer duration (TD) and the whole duration (WD). TD refers to the activity duration between trips, which is counted from the exit time of the trip to the entry time of the next trip. WD refers to the activity duration between the first trip and the last trip, which is counted from the entry time of the first trip to the exit time of the last trip in a day. Figure 5 presents the conceptual definition of TD and WD in the sketch of a SU one-day trips, whose DTP is T5, TD is 8 h and WD is 10 h. than on weekends. About 30% SUs take a round trip (T2) in a day, and the proportions of T2 on workdays are higher than on weekends. The standard deviations of the Top 7 DTPs are extremely small, which indicates that the proportions of the most popular travel patterns are highly stable, although there are some differences between workdays and weekends.
It can also be calculated from Table 3: (1) 73% of SUs travel in two different stations in a day, 90% less than three and about 94% less than four, which implies that most subway riders often choose and travel in fixed stations; (2) most SUs choose the alighting station of the last trip as the boarding station of the next trip.
Meanwhile, the distributions of Top 7 DTPs on workdays and weekends are also highly similar and stable. Taking T1 as an example, the distribution of the entry time of T1 in a week is presented in Figure 3. The horizontal axis in the figure indicates the entry time of trips in hours (1 means 0:00-1:00, 2 means 1:00-2:00 and so on), and the vertical axis indicates the number of trips. The distribution curves of T1 trips on workdays are almost the same with pinnacles from 7:00 to 9:00 and from 17:00 to 19:00. On the other hand, the distribution curve of T1 trips on Sunday is similar to the curve on Saturday, all of which are without significant peaks.

Distributions of entry time analysis
Distributions of entry time analysis present the percentages of Top 7 DTPs based on the entry time, which clearly show the passenger's preference for travel time. In Figures 3 and 4, the majority of Top 7 users depart around morning peak hours (7:00 am~9:00 am) and return during evening peak hours (5:00 pm~7:00 pm). In particular, T6 and T7 have three peak hours (Figure 4b), wherein the first peak hours are the same as T2~T5, the second peak hours   Figure 6 shows the distributions of TD and WD of T2~T7 on a working day. Some superficial conclusions can be observed: (1) TD and WD have the same distribution trend and WD is generally one hour later than TD; (2) TD and WD of T2 and T4 are unimodal distributions, while high peaks of TD are clustered at 10 h and WD at 11 h; (3) TD and WD of T3 and T5 are bimodal distributions, with a smaller peak at 4 h to 5 h comparing with T2 and T4; (4) TD and WD of T6 and T7 are equilibrium distributions, though TD -1st is bimodal (peaks at 2 h and 10 h) and TD -2nd is a unimodal (a peak at 2h). These duration characteristics indicate that the   average length of travel for these DTPs is about 1 hour, and the duration of travel between trips is about 9 hours. The interval between two trips in the morning period is relatively small, suggesting that only simple activities have been done here, such as dropping off children to school.

Repeated travel days analysis
SUs' travel patterns may differ daily, because they would change their routes up to different purposes, and even if they go to the same place, they might choose other modes of transportation [25,26]. The repeated travel days analysis tries to assess the loyalty of SUs with the variable R, which refers to the number of days in a week that the user travels with the same DTP. Table 4 shows the proportions of the repeated travel days of Top 7 DTPs, which declines significantly as R increases. Most users do not maintain the same DTP every day; however, there is still a certain percentage of users who insist on the same DTP. About 16% of T1 users and 32% of T2 users travel in the same DTP more than 3 days. The authors further extract the data (R≥5) and the distribution of the repeated in travel time than the return trips in the afternoon or in the evening, having been proved by [11,18,19,22,25].
The typical trips of DTPs can be used to show finer and common characteristics of different types of SUs, and provide a visual representation of the specific characterization to better understand DTPs [27]. For example, T1 users only travel a single trip by subway one day, implying that they might choose trips by other modes of travel, such as taking a bus, a customized taxi or a neighbourhood carpool which has become popular in China, or they might find it not easy to reach the subway stations by walking [28]. In this case, the station distribution of T1 may provide a wealth of information about inconvenient or insufficient subway stations, which can be used to optimize the subway network.

CONCLUSION
A data mining methodology is proposed for analysing the daily travel pattern features observed by smart card data. Daily travel patterns (DTPs) are reflected by station sequence (SS), which is calculated from station code (SC). By the use of the Beijing Subway's weekly smart card data, the authors have found that the Beijing Subway DTPs are concentrated in seven types. The characteristics of the Top 7 DTPs are discussed in three aspects; travel days of T1 and T2 is shown in Figure 7. The horizontal axis indicates the number of travel days, and the vertical axis shows the number of users who travel continuously. The lines in the figure connect the days when the user travels in the same DTP. It can be easily found that most users have the same T1 and T2 from Monday to Friday (R=5, continuous 5-day-travel with the same DTP), then from Monday to Saturday and from Sunday to Friday (R=6). This distribution is consistent with the characteristics of office workers and student groups in Beijing, whose travel is very regular, with a fixed number of travel days and a fixed DTP.

Typical trips of DTPs
With features summarized from detailed analysis of DTPs, the typical trips of Top 7 DTPs are presented in Figure 8 to better understand the travel preferences of Top 7 DTPs. The boarding and alighting time shown in the figure are not the exact times but the approximate times. The similarities and differences between Top 7 DTPs are visually demonstrated in the diagrams. The first trips of Top 7 DTPs in the morning peak hour are similar and the return trips are of uniqueness, which indicates that most SUs' first trips in the morning are more fixed

a)T1 b)T2
x 10,000 x 10,000  5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  5 6 7 8 9 10 11 12 13 14 15 16 17 18  By applying the proposed method, the subway managers are able to understand the features of the most popular DTPs. For example, this study represents the users' segments who travel to fixed stations and who contribute the same DTPs in a week. The proposed method can also be applied to the estimation of a weekly travel pattern (WTP) of SUs or other kinds of transportation with the same trip information, such as actual origins and destinations. In terms of data applications, DTP can be extracted in all AFC systems with transaction records regardless of the entry mode (via smart card, QR code, or face) selected by the passenger. However, the distribution of entry time, the activity duration, and the repeated travel days. In summary, the typical trips of the Top 7 DTPs are extracted and demonstrated in the sketch of the spatial-temporal distribution. The results show that most types of SUs have remarkable stability in share and distribution of all trips, even though most individual users do not travel at the same DTP every day. Meanwhile, different DTPs have their own unique characteristics in terms of time distribution, activity duration and repeatability, which provide a wealth of information to calibrate different types of users and characterize their travel patterns.
DTPs might have different manifestations while the subway networks structure and urban land development and utilization differ [29,30]. In addition, the dataset used in this study does not include the users' personal attributes nor the changes of the rail transit schedules and transport operators' policies [31]. In the future, the proposed method will be applied to the assessment of users' travel patterns on the rail transit control centre with more details identified.

ACKNOWLEDGEMENT
This research was funded by the National Natural Science Foundation of China (NFSC), Grant number 51578028 and the Education Department of Fujian Province, China, Grant number JAT160167.