Popular stylometric features of Turkish author detection

I prepare a survey about author detection on Turkish for a while. I had gathered twelve studies, and then I examined them regarding preferred stylometric features and used algorithms. There are eight types of stylometric features; token-based, vocabulary richness, word frequency, word n-gram, character-based, character n-gram, part of speech and functional words.


The numbers on the Y axis refer that how many study use this feature. The most used feature is word frequency, the second is token-based feature.

On the other hand, there are eight most preferred algorithms in the Turkish author detection studies. These algorithms are Naive Bayesian, Neural Networks, SVM, Decision Tree, Random Forest, k-NN, k-Means and other (Gaussian classifier, Histogram, similarity based etc.)


As shown on the graph the most preferred algorithm is Naive Bayesian, the second used algorithm is SVM, and the third one is Random Forest.


Authorship detection n-gram feature using with SVM

I used the combined tf-idf vectorizer of both word and character bigrams as feature in an authorship detection example.

First of all, what is n-gram ? N-gram is defined that a adjacent sequence of n items from a given sequence of text or speech, in which the n should be an integer greater than zero. Language models take advantage of the ordering of words, are called n-gram language models. N-grams models can be envisioned sliding a small window which is only n words are visible at the same time on the given text. The simplest n-gram model is unigram model which n is one. That means the window shows only one word at a time. The more complicated models when n is two is called bigram or n is three is called trigram are commonly more informative than unigram.[1]

The authorship detection example used Reuter_50_50 dataset as train and test data. [2] Reuter_50_50 dataset contains 50 authors and 50 texts belong each author. As a beginning the example gets only eight authors texts. These authors are 'AaronPressman', 'AlanCrosby', 'AlexanderSmith', 'BenjaminKangLim', 'BernardHickey', 'BradDorfman', 'DarrenSchuettler', 'DavidLawder'.

Here are the results;
n_samples: 400, n_features: 32885 for both train and test data.
Table is shown Precision, Recall and F1-Score of L1 penalty SVC and L2 penalty SVC for each eight authors.

Precision Recall F1 Score
AaronPressman L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.94 0.94
AlanCrosby L2 penalty 0.98 1.00 0.99
L1 penalty 0.94 0.94 0.94
AlexanderSmith L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.90 0.92
BenjaminKangLim L2 penalty 1.00 1.00 1.00
L1 penalty 0.98 0.96 0.97
BernardHickey L2 penalty 1.00 0.98 0.99
L1 penalty 0.98 0.98 0.98
BradDorfman L2 penalty 0.69 1.00 0.82
L1 penalty 0.61 0.96 0.74
DarrenSchuettler L2 penalty 1.00 0.90 0.95
L1 penalty 1.00 0.94 0.97
DavidLawder L2 penalty 1.00 0.62 0.77
L1 penalty 0.93 0.50 0.65

1. http://nlpwp.org/book/chap-ngrams.xhtml
2. http://archive.ics.uci.edu/ml/datasets/Reuter_50_50