I drew a short diagram of history of authorship attribution.
Gerunds are derived from the verbs but used as nouns in a sentence. Gerunds are created by adding derivational suffixes to verbs in Turkish language. According to derivational suffix, the gerunds can be used as nouns, adjectives or adverbs in the sentence.
To make noun(infinitive) suffixes; -me, -mek, -ma, -mak, -iş, -ış, -uş, -üş (because of the backness harmony of Turkish)
- Kardeşim okumayı öğrendi. (reading)
- Bu bakışından hoşlanmadım. (looking)
- Yarın okula gitmek istemiyorum. (going)
To make adjective(participle) suffixes; -an, -asi, -mez, -ar, -dik, -ecek, -miş, -di(ği) and proper vowel versions. For example;
- Gelecek yıl işe başlayacak. (next year)
- Polisler olası kazaları önlemek için kontrolü sağlıyorlardı. (possible accident)
- Salonda hep bildik yüzler vardı. (familiar face)
To make adverb suffixes; -esiye, -ip, -meden, -ince, -ken, -eli, -dikçe, -erek, -ir … -mez, -diğinde, -e … -e, -meksizin, -cesine and proper vowel versions. For example;
- Bu mülâkat için ölesiye hazırlandım. (deadly)
- Yemeğimi bitirir bitirmez gelirim. (as soon as I finish)
- Ödevin bitince parkta buluşalım. (when you finish your homework)
Turkish is convenient to derive gerunds because the language has many gerunds suffixes. Starting from this point, I listed the most widely used verbs in Turkish, after that I derived gerunds by using gerund suffixes. Finally, I obtained 590 verbal nouns, 587 verbal adjectives and 916 verbal adverbs (with proper vowel versions).
I implemented some functions that processing the gerunds as features for the classification method. I used these functions via SVM on Radikal dataset. The program produced 2662 features on Radikal dataset.
Here my first results are;
According to the first practice implementation of gerunds gives F1-score between 0.68 and 0.82. The first results are compared with reviewed Turkish studies, we can say that these results are promising. Because, the average F1-score is 0.76 and it was resulted from only gerunds frequency.
I prepare a survey about author detection on Turkish for a while. I had gathered twelve studies, and then I examined them regarding preferred stylometric features and used algorithms. There are eight types of stylometric features; token-based, vocabulary richness, word frequency, word n-gram, character-based, character n-gram, part of speech and functional words.
The numbers on the Y axis refer that how many study use this feature. The most used feature is word frequency, the second is token-based feature.
On the other hand, there are eight most preferred algorithms in the Turkish author detection studies. These algorithms are Naive Bayesian, Neural Networks, SVM, Decision Tree, Random Forest, k-NN, k-Means and other (Gaussian classifier, Histogram, similarity based etc.)
As shown on the graph the most preferred algorithm is Naive Bayesian, the second used algorithm is SVM, and the third one is Random Forest.