Text Classification of Persian Documents With Deep Learning

This chapter explores the performance of deep learning models including RNN, LSTM, GRU, and CNN in Persian document classification. The Hamshahri dataset is used for evaluation. The dataset comprises 345 MB of news text, encompassing 166,774 documents across 82 categories. The research implemented models in Python using the Google CoLab environment and the NLTK and Hazm libraries for text preprocessing. Preprocessing steps included tokenization, normalization, stemming, and stop word removal. The deep learning models evaluated were RNN, LSTM, GRU, and CNN, with performance metrics such as precision, accuracy, and F1 score. The CNN model exhibited superior performance and stability across various preprocessing scenarios, while RNN was highly sensitive to preprocessing changes. The chapter also investigates the effects of different preprocessing methods, finding that stemming and stop word removal significantly impacted model performance. Overall, the CNN model demonstrated the best adaptation to the linguistic characteristics of Persian text, underscoring its efficacy for this task.

MoreLess

Year of publication:	2024
Authors:	Aghighi, Ramin ; Bashiri, Hassan
Published in:	Advanced Interdisciplinary Applications of Deep Learning for Data Science. - IGI Global Scientific Publishing, ISBN 9798369347614. - 2024, p. 143-170

More details

Type of publication:	Article
Type of publication (narrower categories):	chapter
Language:	English
Other identifiers:	10.4018/979-8-3693-4759-1.ch006 [DOI]
Source:	Other ZBW resources

Persistent link: https://www.econbiz.de/10015537667