Turkish Gerunds as Features

Gerunds are derived from the verbs but used as nouns in a sentence. Gerunds are created by adding derivational suffixes to verbs in Turkish language. According to derivational suffix, the gerunds can be used as nouns, adjectives or adverbs in the sentence.

To make noun(infinitive) suffixes; -me, -mek, -ma, -mak, -iş, -ış, -uş, -üş (because of the backness harmony of Turkish)

  • Kardeşim okumayı öğrendi. (reading)
  • Bu bakışından hoşlanmadım. (looking)
  • Yarın okula gitmek istemiyorum. (going)

To make adjective(participle) suffixes; -an, -asi, -mez, -ar, -dik, -ecek, -miş, -di(ği) and proper vowel versions. For example;

  • Gelecek yıl işe başlayacak. (next year)
  • Polisler olası kazaları önlemek için kontrolü sağlıyorlardı. (possible accident)
  • Salonda hep bildik yüzler vardı. (familiar face)

To make adverb suffixes; -esiye, -ip, -meden, -ince, -ken, -eli, -dikçe, -erek, -ir … -mez, -diğinde, -e … -e, -meksizin, -cesine and proper vowel versions. For example;

  • Bu mülâkat için ölesiye hazırlandım. (deadly)
  • Yemeğimi bitirir bitirmez gelirim. (as soon as I finish)
  • Ödevin bitince parkta buluşalım. (when you finish your homework)

Turkish is convenient to derive gerunds because the language has many gerunds suffixes. Starting from this point, I listed the most widely used verbs in Turkish, after that I derived gerunds by using gerund suffixes. Finally, I obtained 590 verbal nouns, 587 verbal adjectives and 916 verbal adverbs (with proper vowel versions).

I implemented some functions that processing the gerunds as features for the classification method. I used these functions via SVM on Radikal dataset. The program produced 2662 features on Radikal dataset.

gerunds

 

Here my first results are;

Precision Recall F1-Score
AH 0.87 0.67 0.76
AO 0.76 0.76 0.76
BO 0.72 0.78 0.75
EB 0.66 0.70 0.68
FT 0.79 0.84 0.82
OC 0.71 0.80 0.75
TE 0.83 0.76 0.79
AVG. 0.76 0.76 0.76

According to the first practice implementation of gerunds gives F1-score between 0.68 and 0.82. The first results are compared with reviewed Turkish studies, we can say that these results are promising. Because, the average F1-score is 0.76 and it was resulted from only gerunds frequency.

 

 

Popular stylometric features of Turkish author detection

I prepare a survey about author detection on Turkish for a while. I had gathered twelve studies, and then I examined them regarding preferred stylometric features and used algorithms. There are eight types of stylometric features; token-based, vocabulary richness, word frequency, word n-gram, character-based, character n-gram, part of speech and functional words.

stylometric

The numbers on the Y axis refer that how many study use this feature. The most used feature is word frequency, the second is token-based feature.

On the other hand, there are eight most preferred algorithms in the Turkish author detection studies. These algorithms are Naive Bayesian, Neural Networks, SVM, Decision Tree, Random Forest, k-NN, k-Means and other (Gaussian classifier, Histogram, similarity based etc.)

algorithmic

As shown on the graph the most preferred algorithm is Naive Bayesian, the second used algorithm is SVM, and the third one is Random Forest.

Authorship detection n-gram feature using with SVM

I used the combined tf-idf vectorizer of both word and character bigrams as feature in an authorship detection example.

First of all, what is n-gram ? N-gram is defined that a adjacent sequence of n items from a given sequence of text or speech, in which the n should be an integer greater than zero. Language models take advantage of the ordering of words, are called n-gram language models. N-grams models can be envisioned sliding a small window which is only n words are visible at the same time on the given text. The simplest n-gram model is unigram model which n is one. That means the window shows only one word at a time. The more complicated models when n is two is called bigram or n is three is called trigram are commonly more informative than unigram.[1]

The authorship detection example used Reuter_50_50 dataset as train and test data. [2] Reuter_50_50 dataset contains 50 authors and 50 texts belong each author. As a beginning the example gets only eight authors texts. These authors are 'AaronPressman', 'AlanCrosby', 'AlexanderSmith', 'BenjaminKangLim', 'BernardHickey', 'BradDorfman', 'DarrenSchuettler', 'DavidLawder'.

Here are the results;
n_samples: 400, n_features: 32885 for both train and test data.
Table is shown Precision, Recall and F1-Score of L1 penalty SVC and L2 penalty SVC for each eight authors.

Precision Recall F1 Score
AaronPressman L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.94 0.94
AlanCrosby L2 penalty 0.98 1.00 0.99
L1 penalty 0.94 0.94 0.94
AlexanderSmith L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.90 0.92
BenjaminKangLim L2 penalty 1.00 1.00 1.00
L1 penalty 0.98 0.96 0.97
BernardHickey L2 penalty 1.00 0.98 0.99
L1 penalty 0.98 0.98 0.98
BradDorfman L2 penalty 0.69 1.00 0.82
L1 penalty 0.61 0.96 0.74
DarrenSchuettler L2 penalty 1.00 0.90 0.95
L1 penalty 1.00 0.94 0.97
DavidLawder L2 penalty 1.00 0.62 0.77
L1 penalty 0.93 0.50 0.65

References:
1. http://nlpwp.org/book/chap-ngrams.xhtml
2. http://archive.ics.uci.edu/ml/datasets/Reuter_50_50