As you see easily the rapid increase in the number of the online texts has also accelerated the studies carried out on information retrieval. Especially the content generated on social platforms within the online texts is further increasing day by day. Social online platforms have opened the way for a large number of texts from any language. Based on this progress, I decided to study authorship detection on Turkish texts. Unfortunately, authorship attribution on Turkish is far less than on English studies, so I forced into developing some basic tools. For example, I could not find suitable for POS tagger and have developed own tagger for Turkish using the Brill tagger.
Here my Turkish pos tagger code.
I read the train data from a treebank file. (METU-SABANCI)
I use nltk’s unigram, bigram and trigram taggers for back off tagger.
I apply 5-fold cross validation to my tagger and I get 90%-93% accuracy.
sentence = "Uzun bir süre sonra kendime geldim ." decoded_sentence = sentence.decode('utf-8') tr_brill = TRTagger() print tr_brill.turkish_pos_tagger(decoded_sentence) ''' uzun-Adj bir-Det süre-Noun sonra-Adv kendime-Pron geldim-Verb .-Punc '''