Thai Text-based Classification for Small and Imbalanced Dataset
DOI:
https://doi.org/10.22399/ijcesen.3015Keywords:
Text Classification, Imbalanced Data, Data Augmentation, Thai TextAbstract
Insufficient and imbalanced data are critical issues for developing a text-based classification because they hinder its performance, including underfitting and inaccuracy. This work presents a method to apply text data generation methods to systematically increase data quantity and leverage the number difference between classes. Data augmentation methods such as synonym replacement and data synthesis methods are exploited to generate additional Thai text data based on existing data. For training towards classification models, generated data and original data are merged to solve small and uneven dataset issues and improve classification performance. From experimental results, the overall F1 score obtained from all datasets with generated data is significantly higher than the baseline if the original dataset is used. The improvement of the major categories is to gain higher precision, while the minor categories have had their recall greatly improved. For a single-generation method, data augmentation by synonym replacement produces the highest F1 score of 0.83. The combination of synonym replacement and GPT4 yields the best result in average classification performance for a 0.86 F1 score
References
[1] Thanajiranthorn, C., & Songram, P. (2020). Efficient Rule Generation for Associative Classification. Algorithms 2020, Vol. 13, 299, 13(11), 299.https://doi.org/10.3390/A13110299
[2] Zhu, X., & Goldberg, A. B. (2009). Introduction to Semi-Supervised Learning. Introduction to Semi-Supervised Learning.https://doi.org/10.1007/978-3-031-01548-9
[3] Sun, S., Luo, C., & Chen, J. (2017). A review of natural language processing techniques for opinion mining systems. Information Fusion, 36, 10–25. https://doi.org/10.1016/J.INFFUS.2016.10.004
[4] Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1), 20–29.https://doi.org/10.1145/1007730.1007735
[5] He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239
[6] Arreerard, R., & Senivongse, T. (2018). Thai defamatory text classification on social media. Proceedings - 2018 IEEE/ACIS 3rd International Conference on Big Data, Cloud Computing, Data Science and Engineering, BCD 2018, 73–78. https://doi.org/10.1109/BCD2018.2018.00019
[7] Hemtanon, S., Phetkrachang, K., & Yangyuen, W. (2023). Classification and keyword extraction of online harassment text in Thai social network. Bulletin of Electrical Engineering and Informatics, 12(6), 3837–3842. https://doi.org/10.11591/EEI.V12I6.5939
[8] Chumwatana, T. (2015). Using sentiment analysis technique for analyzing Thai customer satisfaction from social media. http://www.uum.edu.my
[9] Tanantong, T., Sanglerdsinlapachai, N., & Donkhampai, U. (2020). Sentiment. Classification on Thai Social Media Using a Domain-Specific Trained Lexicon. 17th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, ECTI-CON 2020, 580–583. https://doi.org/10.1109/ECTI-CON49241.2020.9158329
[10] Khamphakdee, N., & Seresangtakul, P. (2023). An Efficient Deep Learning for Thai Sentiment Analysis. Data 8(5), 90. https://doi.org/10.3390/DATA8050090
[11] Klaithin, S., & Haruechaiyasak, C. (2016). Traffic information extraction and classification from Thai Twitter. 2016 13th International Joint Conference on Computer Science and Software Engineering, JCSSE 2016. https://doi.org/10.1109/JCSSE.2016.7748851
[12] Wongsap, N., Lou, L., Jumun, S., Prapphan, T., Kongyoung, S., & Kaothanthong, N. (2018). Thai Clickbait Headline News Classification and its Characteristic. 2018 International Conference on Embedded Systems and Intelligent Technology & International Conference on Information and Communication Technology for Embedded Systems (ICESIT-ICICTES). https://doi.org/10.1109/ICESIT-ICICTES.2018.8442064
[13] Song, C., Xu, W., Wang, Z., Yu, S., Zeng, P., & Ju, Z. (2020). Analysis on the Impact of Data Augmentation on Target Recognition for UAV-Based Transmission Line Inspection. Complexity, 2020(1), 3107450.https://doi.org/10.1155/2020/3107450
[14] Wei, J., & Zou, K. (2019). EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. EMNLP-IJCNLP 2019 - 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 6382–6388. https://doi.org/10.18653/V1/D19-1670
[15] Jin, D., Jin, Z., Zhou, J. T., & Szolovits, P. (2019). Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. AAAI 2020 - 34th AAAI Conference on Artificial Intelligence, 8018–8025. https://doi.org/10.1609/aaai.v34i05.6311
[16] Sennrich, R., Haddow, B., & Birch, A. (2015). Improving Neural Machine Translation Models with Monolingual Data. 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 1, 86–96.https://doi.org/10.18653/v1/p16-1009
[17]Kobayashi, S. (2018). ContextualAugmentation: Data Augmentation by Words with Paradigmatic Relations. NAACL HLT 2018 - 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 2, 452–457. https://doi.org/10.18653/V1/N18-2072
[18] Ruangrajitpakor, T., Kingkaewkanthong, A., & Supnithi, T. (2018). Towards Electronic Version of the Royin Thai Dictionary from Information-Heavily Semi-structured Data Source. Journal of Intelligent Informatics and Smart Technology. https://ph05.tci-thaijo.org/index.php/JIIST/article/view/115
[19] Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 1857 LNCS, 1–15.https://doi.org/10.1007/3-540-45014-9_1
[20] Malakar, S., & Chiracharit, W. (2020). Thai Text Detection and Classification Using Convolutional Neural Network. 2020 59th Annual Conference of the Society of Instrument and Control Engineers of Japan, SICE 2020, 99–102. https://doi.org/10.23919/SICE48898.2020.9240290
[21]Jitboonyapinit, C., Maneerat, P., & Chirawichitchai, N. (2023). Sentiment Analysis on Thai Social Media Using Convolutional Neural Networks and Long Short-Term Memory. International Scientific Journal of Engineering and Technology (ISJET), 7(1),74-80.https://ph02.tci-thaijo.org/index.php/isjet/article/view/246935
[22] Gatchalee, P., Waijanya, S., & Promrit, N. (2023). Thai text classification experiment using CNN and transformer models for timely-timeless content marketing. ICIC Express Letter, 17(1), 91–101. https://doi.org/10.24507/ICICEL.17.01.91
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 International Journal of Computational and Experimental Science and Engineering

This work is licensed under a Creative Commons Attribution 4.0 International License.