Text Classification of Persian Documents With Deep Learning
This chapter explores the performance of deep learning models including RNN, LSTM, GRU, and CNN in Persian document classification. The Hamshahri dataset is used for evaluation. The dataset comprises 345 MB of news text, encompassing 166,774 documents across 82 categories. The research implemented models in Python using the Google CoLab environment and the NLTK and Hazm libraries for text preprocessing. Preprocessing steps included tokenization, normalization, stemming, and stop word removal. The deep learning models evaluated were RNN, LSTM, GRU, and CNN, with performance metrics such as precision, accuracy, and F1 score. The CNN model exhibited superior performance and stability across various preprocessing scenarios, while RNN was highly sensitive to preprocessing changes. The chapter also investigates the effects of different preprocessing methods, finding that stemming and stop word removal significantly impacted model performance. Overall, the CNN model demonstrated the best adaptation to the linguistic characteristics of Persian text, underscoring its efficacy for this task.