Popular stylometric features of Turkish author detection

I prepare a survey about author detection on Turkish for a while. I had gathered twelve studies, and then I examined them regarding preferred stylometric features and used algorithms. There are eight types of stylometric features; token-based, vocabulary richness, word frequency, word n-gram, character-based, character n-gram, part of speech and functional words.


The numbers on the Y axis refer that how many study use this feature. The most used feature is word frequency, the second is token-based feature.

On the other hand, there are eight most preferred algorithms in the Turkish author detection studies. These algorithms are Naive Bayesian, Neural Networks, SVM, Decision Tree, Random Forest, k-NN, k-Means and other (Gaussian classifier, Histogram, similarity based etc.)


As shown on the graph the most preferred algorithm is Naive Bayesian, the second used algorithm is SVM, and the third one is Random Forest.


What a year of community

Hello from a delayed post. I’m going to write about my year of community in this post. I joined to some communities and related events over the past year. I prepared a chronologic list of groups and events which I am a member or participant of them below. Some of the events are local events that for Turkish speakers.

    1. 17th Jan 2015, I was a participant of We listen to women in IT event was organized by Google Anita Borg Scholarship Community.
    2. 23rd Jan 2015, I was officially a member of Kadın Yazılımcı (Women Techmakers). Kadın Yazılımcı is an environment that females share their notions and experiences about computing or not computing to encourage successors. I published 3 posts about python, 2 posts about algorithms and 1 post about my PyCon comments from the website of Kadın Yazılımcı in the last year.
    3. 15th Mar 2015, I was a participant of Women Techmakers Conference. Also, I was a booth attendance of Kadın Yazılımcı in this conference.
    4. 10th Apr 2015, I was a participant of PyCon 2015. I gained financial grant from the organization and did some volunteer works such as booth attendence on the PyLadies booth.pyladies
    5. 11th Jul 2015, I gave a presentation Text classification via scikit-learn in the PyIstanbul event. pyistanbul which is a group of Istanbul-based Python developers.
    6. 25th Jul 2015, I was a participant of PhpKonf and also I was one of the panelist of Kadın Yazılımcı panel in the conference.
    7. 13th Sep 2015, I was one of the organizer and one of the mentor of the first DjangoGirls Istanbul event. Also I was one of the proofreader of DjangoGirls tutorial Turkish translation project.
    8. 12th Dec 2015, I was one of the mentor&organizer&participant of DjangoGirls Istanbul. That was an amazing event! dsc_4825_nice_23413134480_o

I hope this year I write more often.

Homemade Turkish POS Tagger

As you see easily the rapid increase in the number of the online texts has also accelerated the studies carried out on information retrieval. Especially the content generated on social platforms within the online texts is further increasing day by day. Social online platforms have opened the way for a large number of texts from any language. Based on this progress, I decided to study authorship detection on Turkish texts. Unfortunately, authorship attribution on Turkish is far less than on English studies, so I forced into developing some basic tools. For example, I could not find suitable for POS tagger and have developed own tagger for Turkish using the Brill tagger.

Here my Turkish pos tagger code.

I read the train data from a treebank file. (METU-SABANCI)

I use nltk’s unigram, bigram and trigram taggers for back off tagger.

I apply 5-fold cross validation to my tagger and I get 90%-93% accuracy.

    sentence = "Uzun bir süre sonra kendime geldim ."
    decoded_sentence = sentence.decode('utf-8')
    tr_brill = TRTagger()
    print tr_brill.turkish_pos_tagger(decoded_sentence)
    uzun-Adj bir-Det süre-Noun sonra-Adv kendime-Pron geldim-Verb .-Punc