MULTICLASS CLASSIFICATION WITH IMBALANCED DATASETS FOR CAR OWNERSHIP DEMAND MODEL – COST-SENSITIVE LEARNING

In terms of the travel demand prediction from the household car ownership model, if the imbalanced data were used to support the transportation policy via a machine learning model, it would negatively affect the algorithm training process. The data on household car ownership obtained from the study project for the ex-pressway preparation in the Khon Kaen Province (2015) was an unbalanced dataset. In other words, the number of members of the minority class is lower than the rest of the answer classes. The result is a bias in data classification. Consequently, this research suggested balancing the datasets with cost-sensitive learning methods, including decision trees, k-nearest neighbors (kNN), and naive Bayes algorithms. Before creating the 3-class model, a k-folds cross-validation method was applied to classify the datasets to define true positive rate (TPR) for the model’s performance validation. The outcome indicated that the kNN algorithm demonstrated the best performance for the minority class data prediction compared to other algorithms. It provides TPR for rural and suburban area types, which are region types with very different imbalance ratios, before balancing the data of 46.9% and 46.4%. After balancing the data (MCN1), TPR values were 84.4% and 81.4%, respectively.


INTRODUCTION
Data classification is an analysis method used to define data patterns, classification models, and classification rules. This method predicts different data types, either present or future, such as travel demand predictions. Several minor models were used, including the household car ownership models, trip generation models, tour generation models, trip distribution models, travel time choice models, and travel route choice models [1], with either trip or tour used as the unit of analysis [2]. There are several techniques for data classification [3], e.g. the decision tree (DT) presenting different logical conditions; k-nearest neighbors (kNN) used for the mathematic calculation to find distance or weight; and naive Bayes used to find the probability in the training data. The selection for a high performing technique should rely on the parameters indicating the data classification performance, e.g. accuracy, precision, recall, F1-score. Still, these techniques do not work well on every dataset. For example, some work more effectively on the balanced data than on the imbalanced one; the flat data contains the classes with a similar number of datasets [4]. The imbalanced data has courses with a different number of datasets. At this point, the imbalanced data classification becomes a thought-provoking issue because some of the minority classes include either significant or outstanding data. Consequently, for more effective data analysis, the model's performance to classify the minority class needs to be improved before algorithm training with suitable parameters for the imbalanced data [5,6].
In the imbalanced data, the numbers of each class would be completely different. This imbalanced class is a critical issue often found in the research fields of medical science [7], marketing, banking, and production industry [8,9]. However, it is still rare in transportation planning, especially in using the data with the machine learning model, which are popular and state-of-the-art approaches [10], to predict the household car ownership.
Due to the problem, several methods have been purposively invented to fix these imbalanced data at a data level and an algorithm level to improve

MULTICLASS CLASSIFICATION WITH IMBALANCED DATASETS FOR CAR OWNERSHIP DEMAND MODEL -COST-SENSITIVE LEARNING
results and discussion are presented in Section 5. The concluding remarks and future work are outlined in Section 6.

CLASS DISTRIBUTION BALANCING
This section will explain the problem that might exist due to the imbalanced data distribution in each target class and the classification performance indicators for the imbalanced data. The final part is a review of the CSL methods.

The class imbalance problem
The imbalanced data can be practically seen as unequal numbers of samples in each target class, with most classification problems in research with two categories, as seen in Figure 1. Specifically, this research is mainly focused on the imbalanced datasets with a 3-class problem found in transportation engineering studies, e.g. travel mode choice [15]. In other words, there is one class indicating a lower data number than the other courses in the same dataset. Similarly, the literature review on the minority class was the one that most catches our attention [16,17]. In case that the transportation problem data is imbalanced, most standard algorithms cannot classify the information correctly because they were designed as an accuracy-oriented model. The results can be biased by the majority of classes, which are easier for algorithm training.
The majority class classification or negative instances affect the accuracy metrics more than the correct prediction on the minority class or positive the minority class [11]. Precisely at a data level, the imbalanced data could be solved via sampling techniques. Meanwhile, at an algorithm level, the algorithm's performance would be improved with any helpful technique during the data training process to effectively predict the unseen data while testing the model, such as cost-sensitive learning methods (CSL) [12]. The classification performance at both levels was similar [13]. In increasing data, CSL methods performed better than the sampling methods [14]. Consequently, this research aimed to improve the minority class with a cost matrix table with two categories.
This research proposed a useful technique to improve the algorithm's performance to classify the household car ownership demand model with the 3-class problem. The study used CSL methods to solve the imbalanced data with its negative effect on the classification performance on the minority class, and the feature section, a feature-level data management technique, to find the first ten parameters with the optimal weight. Finally, the data classification performance would be affirmed by the true positive rate (TPR), F1-score, accuracy, false negative rate (FNR), and false positive rate (FPR).
The paper is organized as follows: after the introduction, section 2 will focus on class distribution balancing, performance indicators, and solutions to the class imbalance problem at an algorithm level. Section 3 presents the algorithms selected for the study. The description of the experiment in our research can be found in Section 4, while the obtained  ratio misclassified as the minority, and F1-score was the result of the precision-recall evaluation by considering each category one at a time. However, if the data were imbalanced, the model's classification performance on the minority class would be typically evaluated with a TPR since TPR could define the actual travel distribution data [20]. The table presented TP, FP, TN, and FN, where TP correctly predicted data from the target class. FP was the dataset classified as Class 0, but in other classes, TN was the correctly predicted data in any class besides Class 0. FN were the datasets classified to be in other classes but was in Class 0. It was noted that TN was the opposite of TP, while FN was the opposite of FP. All these values could be used to find accuracy=TP+TN/(TP+FN+FP+TN)·100; precision=TP/(TP+FP)·100; TPR=TP/(TP+FN)·100; FPR=FP/(FP+TN)·100; FNR=FN/(FN+TP)·100; and F1-score=(2·precision·TPR)/(Precision + TPR).
If these values were high, both precision and TPR would be high too.

Cost-sensitive learning, CSL
The cost-sensitive approach assigns unequal weights to each class so that the minority would have more weight, whereas the majority class had lower weight. In effect, the CSL method gives weights to all values using the cost matrix containing. Similar to the confusion matrix, where numbers of rows and columns were equal to the class number, the incorrect prediction was assigned with more weights than the correct prediction, and the accurate prediction values were 0. The model would consider the importance within the cost matrix and minimize the total weight.
To balance the data with the CSL method, the researcher invented the cost matrix table by randomly adjusting false negative (FN) and false positive (FP) values. The minimum was 5.0, and the maximum was 25. This data adjustment was run until the cost matrix parameter was lower or stable [21]. class example. Therefore, this positive class example might be ignored (or treated as noise) since the standard prediction rule states that the negative class example notably provides a higher accuracy rate.
From those mentioned above, this article aimed to improve classification performance on the minority class, i.e., household without a car (Class 0), whereas families with one car and 2+ cars (Class 1 and 2) were the majority class. At this point, the researcher also chose to use the imbalance ratio (IR) defined by the negative class example or the majority class divided by the number of positive class examples or the minority class as the consideration values of each area type. To be exact, if the IR were higher than 9 [18], the dataset would be highly imbalanced. On the contrary, if the IR were lower than 9, the dataset imbalance would be either moderate or low.

Performance evaluation
A validation technique was necessary to affirm the algorithm's classification performance appropriately. It could guide the model creation; therefore, this research intended to suggest a useful method to validate the algorithm's classification performance on the 3-class imbalanced data for each target class. On this matter, both accurate and inaccurate results would be directly recorded in a confusion matrix table, as presented in Table 1 adapted from [19].
As presented in the confusion matrix table, several values were regularly used to validate the model's classification performance, including accuracy, precision, recall, sensitivity, or TPR, FPR, FNR, and F1-score. Explicitly, accuracy represented the model's correctness by considering all classes. Precision was an indicator of the model's accuracy by separately considering each category one by one. TPR represented the model's correctness by separately considering each class one by one. FPR represented a ratio of the minority class misclassified as the majority. FNR represented a majority class and every node can classify the sample into individual subgroups with a homogeneous class. After that, this process stops, and finally, a decision tree model is created.
To avoid overfitting, trees are generally pruned to improve the predictability of decision structures (see [25,26] for more details).

k-Nearest neighbors (kNN)
The kNN algorithm compares the unknown sample with the k training sample, the closest neighbor of the new sample. The preliminary theoretical results can be found in [27], and a comprehensive overview can be found in [28]. The first step of applying the kNN algorithm on a new example is to find the k proximity training examples. "Proximity" is determined from a distance in the n-dimensional space depending on the number of attributes in the training example.
Different metrics, such as the Euclidean distance, can calculate the distance between the new example and the training examples. Because the length is often based on absolute value, it is necessary to normalize data before training and use the kNN algorithm.
In the next step, the kNN algorithm classifies the unknown sample by voting on the majority of the neighbors it finds. In the case of a regression, the predicted value is the average of the found values of the neighbor.
In an imbalanced training dataset, an example of a small class occurs sparingly in the data space. Given the testing dataset, the calculated closest neighbor k has a high probability of finding a sample from a prevalent type. Test cases from small class sizes were likely to be classified incorrectly. Research in [29] and [30] report this notice.

Naive Bayes
Naive Bayes is a technique for constructing classifiers, high-bias, low-variance classifiers, and building a good model even with a small dataset. It is based on the Bayes' theorem and it is a probabilistic classifier. Naive Bayes classifiers assume that a specific component's estimation is independent of estimating other elements for a given class variable. Bayes' theorem: P(C|A) = P(A|C) * P(C)/ P(A, where P (C|A) is the probability value that data with attribute A will have class C, P(A|C) is the

ALGORITHMS SELECTED
In this section, the researcher suggested the study algorithm and the critical problem that the standard machine learning algorithms could not work effectively with the imbalanced data.
Many machine learning algorithms utilize the class distribution in the training datasets to find the likelihood of each class examples for the model to predict the data. Accordingly, several machine learning algorithms, e.g. decision trees (DT), k-nearest neighbors (kNN), and naive Bayes, will recognize that the minority class is as important as the majority class in this research.
However, there are also machine learning algorithms used to classify information in addition to the above algorithms, for instance, the artificial neural network (ANN), a technique based on the use of computer simulations of human brain activity [22]. This neural network is a processing unit that produces either linear or nonlinear transmission between input and output variables; support vector machines (SVM) are one of the most popular and discussed machine learning algorithms. The learning strategy is finding the optimal split hyperplane to maximize margins and reduce training errors based only on margin data points [23,24].

Decision trees
DT is an explanation technique by summarizing truths or the related data to construct the rule of the DT. This technique has often been implemented the most since it helps the model to interpret and make the data more understandable. In this regard, the model was created using the repeated attribute partitioning.
At each level of the tree (from the root node), the algorithm would find the information gain ratio (IG) of each attribute or feature and compare them with the class to find the attribute with the highest IG. Assign it as the root of the decision tree (the selected attribute could classify the data examples for model creation and assign them with the same class if possible (maximizing the class-homogeneity)). The ultimate goal of the decision trees algorithm is to separate all data into subgroups with the same answers or classes, i.e., the sequence of slicing data to generate appropriate if-then rules. From root to leaf, the resulting rules can illustrate an example. All information available is complete. In other words, this process is repeated until the last node (leaf node) systematic random sampling was used to achieve the number of households equal to 2,015 families (2% of the total households in the target area and 4,757 people provided with travel information, 616 without travel information). The data collection was conducted through a face-to-face interview. The participants were chosen from 73 zones, and the GIS database was used to categorize the area of this study; 10 more zones from the suburban and urban areas were added, so the total was 83 zones. The residential density classified these zones into 4 area types, including central business district (CBD), urban area, suburban area, and rural areas, as shown in Figure 2. These area types indicate travel characteristics of each household car and are one of the variables that indicate the source region type, as well as the primary destination location of each likelihood of attribute A having class C in training data [31], P(A) is the probability of attribute A, and P(C) is the class C probability.
Although it expects an impossible condition that attribute values are restrictively free, it performs shockingly well on substantial datasets where this condition is assumed and holds good [32].

Parameters
This section presents the default parameters derived from the RapidMiner Studio Educational 9.7 Software Tool for each algorithm, including the decision tree default parameters (criterion, gain ratio, 20 maximal depth of a tree, 0.25 the confidence level, 0.1 the minimal gain of a node, and 2.0 minimal leaf size). kNN default parameters are used to measure the distance between the predicted data with the k number of neighboring data [33] (k=5, measure types: Euclidean Distance) and Naive Bayes default parameters (Laplace correction). The researcher also defined the weights in the cost matric table to see any consequential effects, as presented in Table 2.
Before and after data balancing, the data were mainly applied to create and test the performance of the household car ownership demand model via each machine learning algorithm. All results were later compared to the statistical significance tests (T-Test) (alpha=0.05).

DATASETS
In this research, the travel data from the Engineering, Economical, Financial, and Environmental Feasibility Study for the Khon Kaen Expressway Master Plan 2015 (Thailand) was implemented. Because the study area population has similar characteristics,

Data balancing (algorithm-level)
In this section, the researcher attempted to solve the imbalanced data at an algorithm level based on the imbalance in each area type using cost-sensitive learning methods (CSL) and the (kNN algorithm), high performing algorithm for this study. This study was strictly conducted to improve the model's performance in classifying the minority class or "positive" class with a higher TPR. Additionally, the researcher defined the cost matrix table ( Table 2).
The fact is that this research began with a noncost classification or NMC that classifies the imbalanced data that was not adjusted with CSL, which means that the cost of every data prediction was equal. In fact, the MCN1-5 case was an attempt to reduce error from the FN by defining a higher fine than for other mistakes in which the penalty could be 5, 10, 15, 20, and 25; meanwhile, the MCNP1 case was an attempt to reduce error from both FN and FP where the FN-error should be lower than the FP-error. In this regard, both MCN1-5 and MCNP1 considered the model's performance to predict Class 0. Figures 4 and 5 illustrated the impact of the TPR and FPR on Class 0 only in the case of FN-error reduction, respectively. Figure 4 shows that the TPR in all area types was higher in every dataset when defined with different costs. Still, the FPR in Figure 5 is also increased. Significantly, when the cost adjustment reached a certain level, the TPR seemed to stabilize, indicating that when the data with the IR from 2.83-5.20 was already balanced with the CSL, the kNN algorithm was assigned to create the model, and later the classification ratio of minority class (Class 0) became more accurate. Consequently, an appropriate cost selection helped maximize the model's prediction performance on the minority class (Class 0), but an increase of the FPR was still unavoidable.
trip under one tour. Table 3 presents the household car ownership dataset summary and the imbalanced ratio [34] around the study area. Class 0 was the minority class, and the rest was the majority class.

RESULTS
In term of the performance test of the DT, kNN, and naive Bayes algorithms on the imbalanced data, the researcher compared the predicted results from each algorithm on each of the area types to one another before using the best performing algorithm to create the model along with balancing the data with the cost-sensitive learning methods (CSL). In the meantime, k-fold cross-validation was used to develop and validate the model's performance before and after data balancing. The default parameters of each algorithm and another ten parameters selected by weight optimization were implemented for model creation and validation.

An experimental comparison
The findings indicated that the kNN algorithm provided a high TPR with a higher accuracy rate in classifying the dataset in the minority class (Class 0) in every imbalanced ratio (Figure 3a). Also, it provided a low error rate in organizing the datasets in courses other than Class 0, compared to different algorithms (FNR) (Figure 3b). Apparently, if the imbalanced data was classified by standard classification algorithms, the results would be completely biased by the majority class, Class 1 and 2. Hence, FNR was close to 100%; for instance, the DT model in the suburban area showed IR = 5.20, whereas the kNN algorithm gave the lowest FNR in every IR depending on each area type. To highlight the data classification algorithm's performance, the researcher decided to use the kNN algorithm to create and validate the model's performance with every single imbalanced dataset in each area type. CSL methods were also implemented to balance the dataset before training it with the kNN algorithm. The reduction of both FN-error and FP-error (MCNP1) for better prediction performance on the minority class (Class 0) was given in Figures 6 and 7. The figures showed that the defined cost matrix table affected the decrease of the TPR and the FPR when compared to the FN-error reduction (MCN1) that resulted when the dataset used to create and validate the model's performance had more of the majority class (Class 1 and 2) than the minority class (Class 0). Despite this, the k-NN algorithm classification provided more of the majority class than the minority class. Thus, the FN-error reduction added more chance or higher possibility for the majority class to be chosen for data prediction. In contrast, it was less possible for the minority class to be the choice despite the higher cost.
After considering the F1-score from every area type (IR = 2.83 -5.20) in Figure 8, the FN-error reduction (defining a higher fee in the FN cost matrix table than that of the other errors) provided a higher F1-score of the minority class (Class 0) at the MCN1 compared to the case of the non-balancing data (at the NMC). However, at the MCN2-5, F1score seemed to decrease. It usually happened that      The research solved the imbalance problem at the algorithm level using CSL methods according to the imbalance of each area type: rural, total, urban, CBD, and suburban . It shows that when using the default parameter of the kNN algorithm, the MCN1 provides class 0 predictive performance (the minority class) after balancing the data with the best cost matrix (higher than NMC in all area types). The TPR values for each area type were 84.4%, 86.3%, 86.4%, 85.2%, and 81.4%, respectively. The results have shown that balancing datasets before processing is beneficial. As a result, the model had a higher TPR (lower learning error rate); in other words, choosing the appropriate cost table would improve the predictive performance of the higher value of the minority class (class 0).
For kNN, the next key point is to find the best k parameter in the case of MCN1, in which case the highest F1-Score and the lowest FPR start by using the default k = 5, and then add different k values up to 100, while k values lower than 5, such as 1 or 3, are not taken into account because they may not be well distinguished [35].
For all datasets within the study area, the results confirm the suitability of using k equal to 5 as shown in Figure 10, showing different validity and k-values, where k = 5 gives the best classification accuracy.

CONCLUSION AND FUTURE WORK
Explicitly, this research aimed to invent a useful model for the household car ownership demand prediction in five target area types. The imbalanced ratio ranged from 2.83-5.20, which negatively affected the model's performance to predict the minority class (Class 0). This research also highlighted the significance of data preparation. The parameters were selected from trip-based and tour-based models via weight optimization to find the first ten parameters with optimal model creation weights. Later, a cross-validation method was used to create and test if the model was high performing when classifying the data with standard algorithms.
Conclusively, the research outcome revealed that when using the best-performing kNN algorithm to create the household car ownership demand model with a 3-class problem and data balancing by a cost-sensitive learning method at an when the prediction performance of Class 0 was continually increasing until it was finally stable, precision would be continuously reducing as well. Accordingly, the F1-score of Class 0, an overall performance indicator, seemed to decline gradually.
When comparing the FN-error and FP-error reduction cases in Figure 9 to the FN-error reduction alone, the imbalanced ratio of the dataset classified by the area types (IR = 2.83 -5.09) was found to be low and moderate. At the same time, the F1-score seemed to be higher. Despite the fact that the IR of each area-type was increasing (IR = 5.20, suburban area type), the F1-score of the minority class (Class  algorithm level, the mode's classification performance on the minority class (Class 0) with only the reduced FN-error was significantly improved. However, both the FN-error and the FP-error were reduced based on the cost matrix table; the TPR of the minority classes also decreased significantly in each of the imbalanced ratios. When the TPR of the minority class was increased, the TPR from other parameters could be either increased or decreased depending on each specific case. For example, the F1-score with the reduced FN would be reduced, but when both the FN-error and the FP-error were reduced, this score would be increased in every imbalanced ratio; IR = 5.20 was an exception. As a result, it was necessary to define the cost matrix table carefully. It impacted the model's classification performance on the household car ownership demand in the study; it could be either high or low. Hence, future work should focus on developing the model with better performance to solve class imbalance with sampling techniques at a data level, under-sampling, over-sampling, and combinations of methods; with ensemble classifier, and semi-supervised classifier at an algorithm level. It will be tested to increase prediction and policy formulation opportunities for better urban transportation planning through a machine learning model with appropriate household characteristics.

ACKNOWLEDGMENTS
I would like to thank Mr. Sorasak Seawsirikul, a lecturer at the Department of Civil Engineering, Faculty of Engineering at the Rajamangala University of Technology Isan, Khon Khen Campus, for his assistance with the coding process. I would also like to express gratitude to the Faculty of Engineering, Khon Kaen University, for supporting this work by providing the Khon Kaen Expressway Master Plan (Thailand) for 2015, which contributed significantly to this article.