Homemade Turkish POS Tagger

As you see easily the rapid increase in the number of the online texts has also accelerated the studies carried out on information retrieval. Especially the content generated on social platforms within the online texts is further increasing day by day. Social online platforms have opened the way for a large number of texts from any language. Based on this progress, I decided to study authorship detection on Turkish texts. Unfortunately, authorship attribution on Turkish is far less than on English studies, so I forced into developing some basic tools. For example, I could not find suitable for POS tagger and have developed own tagger for Turkish using the Brill tagger.

Here my Turkish pos tagger code.

I read the train data from a treebank file. (METU-SABANCI)

I use nltk’s unigram, bigram and trigram taggers for back off tagger.

I apply 5-fold cross validation to my tagger and I get 90%-93% accuracy.


    sentence = "Uzun bir süre sonra kendime geldim ."
    decoded_sentence = sentence.decode('utf-8')
    tr_brill = TRTagger()
    print tr_brill.turkish_pos_tagger(decoded_sentence)
    '''
    uzun-Adj bir-Det süre-Noun sonra-Adv kendime-Pron geldim-Verb .-Punc
    '''
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s