Multiclass Classification with Imbalanced Datasets for Car Ownership Demand Model – Cost-Sensitive Learning

Patiphan Kaewwichian

doi:10.7307/ptt.v33i3.3728

Patiphan Kaewwichian Faculty of Engineering, Rajamangala University of Technology Isan, Khon Kaen, Thailand

DOI: https://doi.org/10.7307/ptt.v33i3.3728

Keywords: cost matrix, decision trees, k-nearest neighbors (kNN), cross-validation, tour-based model

Abstract

In terms of the travel demand prediction from the household car ownership model, if the imbalanced data were used to support the transportation policy via a machine learning model, it would negatively affect the algorithm training process. The data on household car ownership obtained from the study project for the expressway preparation in the Khon Kaen Province (2015) was an unbalanced dataset. In other words, the number of members of the minority class is lower than the rest of the answer classes. The result is a bias in data classification. Consequently, this research suggested balancing the datasets with cost-sensitive learning methods, including decision trees, k-nearest neighbors (kNN), and naive Bayes algorithms. Before creating the 3-class model, a k-folds cross-validation method was applied to classify the datasets to define true positive rate (TPR) for the model’s performance validation. The outcome indicated that the kNN algorithm demonstrated the best performance for the minority class data prediction compared to other algorithms. It provides TPR for rural and suburban area types, which are region types with very different imbalance ratios, before balancing the data of 46.9% and 46.4%. After balancing the data (MCN1), TPR values were 84.4% and 81.4%, respectively.

References

Karlaftis MG, Vlahogianni EI. Statistical Methods Versus Neural Networks in Transportation Research: Differences, Similarities and Some Insights. Transportation Research Part C: Emerging Technologies. 2011;19(3): 387-399.

Kaewwichian P, Tanwanichkul L, Pitaksringkarn J. Car Ownership Demand Modeling Using Machine Learning: Decision Trees and Neural Networks. International Journal of GEOMATE. 2019;17(62): 219-230.

Flach P. Machine Learning: The Art and Science of Algorithms that Make Sense of Data. Cambridge University Press; 2012.

Chawla NV. Data Mining for Imbalanced Datasets: An Overview. In: Data Mining and Knowledge Discovery Handbook. Springer; 2009. p. 875-886.

Longadge R, Dongre S. Class Imbalance Problem. In: Data Mining Review. arXiv Preprint; 2013.

Branco P, Torgo L, Ribeiro RP. A Survey of Predictive Modeling on Imbalanced Domains. ACM Computing Surveys (CSUR). 2016;49(2): 1-50.

Mazurowski MA, Habas PA, Zurada JM, Lo JY, Baker JA, Tourassi GD. Training Neural Network Classifiers for Medical Decision Making: The Effects of Imbalanced Datasets on Classification Performance. Neural Networks. 2008;21(2-3): 427-436.

Gu Q, Cai Z, Zhu L, Huang B. Data Mining on Imbalanced Data Sets. Paper presented at the 2008 International Conference on Advanced Computer Theory and Engineering; 2008.

López V, et al. Analysis of Preprocessing vs. Cost-Sensitive Learning for Imbalanced Classification: Open Problems on Intrinsic Data Characteristics. Expert Systems with Applications. 2012;39(7): 6585-6608.

Pamuła T. Neural Networks in Transportation Research–Recent Applications. Transport Problems. 2016;11.

Sun Y, Wong AK, Kamel MS. Classification of Imbalanced Data: A Review. International Journal of Pattern Recognition and Artificial Intelligence. 2009;23(4): 687-719.

Ling CX, Sheng VS. Cost-Sensitive Learning and the Class Imbalance Problem. Citeseer. 2008: 231-235.

Maloof MA. Learning When Data Sets Are Imbalanced and When Costs Are Unequal and Unknown. In: ICML-2003 Workshop on Learning from Imbalanced Data Sets II; 2003.

Weiss GM, McCarthy K, Zabar B. Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? DMIN. 2007;7(35-41): 24.

Xie C, Lu J, Parkany E. Work Travel Mode Choice Modeling with Data Mining: Decision Trees and Neural Networks. Transportation Research Record. 2003;1854(1): 50-61.

He H, Garcia EA. Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering. 2009;21(9): 1263-1284.

Galar M, et al. A Review on Ensembles for the Class Imbalance Problem: Bagging, Boosting, and Hybrid-Based Approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews). 2011;42(4): 463-484.

García S, Herrera F. Evolutionary Undersampling for Classification with Imbalanced Datasets: Proposals and Taxonomy. Evolutionary Computation. 2009;17(3): 275-306.

Rivas-Perea P, et al. Lp-SVR Model Selection Using an Inexact Globalized Quasi-Newton Strategy. Journal of Intelligent Learning Systems and Applications. 2013;5(1): 19-28.

Biagioni JP, et al. Tour-Based Mode Choice Modeling: Using an Ensemble of (Un-) Conditional Data-Mining Classifiers. In: 88th Annual Meeting of the Transportation Research Board. Washington, DC; 2008.

Domingos P. Metacost: A General Method for Making Classifiers Cost-Sensitive. Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining; 1999. p. 155-164.

Liu Z, Guo J, Cao J, Wei Y, Huang W. A Hybrid Short-term Traffic Flow Forecasting Method Based on Neural Networks Combined with K-Nearest Neighbor. Promet – Traffic&Transportation. 2018;30(4): 445-456.

Zhao H, Sun D, Zhao M, Cheng S. A Multi-Classification Method of Improved SVM-based Information Fusion for Traffic Parameters Forecasting. Promet – Traffic&Transportation. 2016;28(2): 117-124.

Zhang Y, Xie Y. Travel Mode Choice Modeling with Support Vector Machines. Transportation Research Record. 2008;2076(1): 141-150.

Wu J, Yang M, Rasouli S, Cheng L. Investigating Commuting Time Patterns of Residents Living in Affordable Housing: A Case Study in Nanjing, China. Promet – Traffic&Transportation. 2019;31(4): 423-433.

Wets G, Vanhoof K, Arentze T, Timmermans H. Identifying Decision Structures Underlying Activity Patterns: An Exploration of Data Mining Algorithms. Transportation Research Record. 2000;1718(1): 1-9.

Cover T, Hart P. Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory. 2006;13(1): 21-27.

Agarwal Y, Poornalatha G. Analysis of the Nearest Neighbor Classifiers: A Review. Paper presented at the Advances in Artificial Intelligence and Data Engineering, Singapore; 2021.

Mani I, Zhang I. kNN Approach to Unbalanced Data Distributions: A Case Study Involving Information Extraction. Proceedings of Workshop on Learning from Imbalanced Datasets; 2003.

Batista GE, Prati RC, Monard MC. A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data. ACM Sigkdd Explorations Newsletter. 2004;6(1): 20-29.

Vit C. Comparative Study on Classification Algorithms. International Journal of Pure and Applied Mathematics. 2018;118(24).

Agrawal R. Predictive Analysis of Breast Cancer Using Machine Learning Techniques. Ingeniería Solidaria. 2019;15(3): 1-23.

Zhang S, Li X, Zong M, Zhu X, Wang R. Efficient kNN Classification with Different Numbers of Nearest Neighbors. IEEE Transactions on Neural Networks and Learning Systems. 2018;29(5): 1774-1785.

Buda M, Maki A, Mazurowski MA. A Systematic Study of the Class Imbalance Problem in Convolutional Neural Networks. Neural Networks. 2018;106: 249-259.

Napierala K, Stefanowski J. Types of Minority Class Examples and Their Influence on Learning Classifiers from Imbalanced Data. Journal of Intelligent Information Systems. 2016;46(3): 563-597.