I used newspaper articles in my experiments until now. I decided to use texts which extracted from other platforms, so I collected texts from eksisozluk platform. Ekşisözlük is a kind of local Reddit. I tried to perform a comparison experiment by using Turkish gerunds as features.
Here my experiment components:
-
Corpus: Eksisozluk dataset of 5 authors represented by nicknames, 100 texts for each author. Average word count is 461, 80% of the dataset is used as training data and 20% of the dataset is used as test data.
-
Features: Features are Turkish gerunds. These words are derived from the verbs but used as nouns, adjectives, and adverbs in a sentence. I listed the most widely used verbs in Turkish, after that I derived gerunds by using gerund suffixes. Finally, I obtained 590 verbal nouns, 587 verbal adjectives and 916 verbal adverbs (with proper vowel versions).
-
Algorithms: Algorithms are LinearSVM, Multi-Layer Perceptron (MLP), Naive Bayes (NB), k-Nearest Neighbor (kNN) and Decision Tree.
Now, the results are below.
-
SVM
The performance of SVM with gerund frequencies as features is not satisfied, it classified just 3 of 5 authors with correct matching minimum 12 of 20 test documents.
-
MLP
The performance of MLP with gerund features is slightly better than SVM. For example, it classified 4 of 5 authors with correct matching minimum 12 of 20 test documents.
-
NB
The performance of NB is average and close to other results. For example, it classified 3 of 5 authors with correct matching minimum 12 of 20 test documents.
-
Decision Tree
The performance of Decision tree is not enough, average F1-score is 0.39. It did not make satisfied correct matching.
-
kNN
The performance of kNN not enough but slightly better than decision tree, average F1-score is 0.44. It classified only one of 5 authors with correct matching 16 of 20 test documents.
As a result, NB, kNN and decision tree are not suitable algorithms for this approach. SVM and MLP performed better than other algorithms.