Homemade Turkish POS Tagger

As you see easily the rapid increase in the number of the online texts has also accelerated the studies carried out on information retrieval. Especially the content generated on social platforms within the online texts is further increasing day by day. Social online platforms have opened the way for a large number of texts from any language. Based on this progress, I decided to study authorship detection on Turkish texts. Unfortunately, authorship attribution on Turkish is far less than on English studies, so I forced into developing some basic tools. For example, I could not find suitable for POS tagger and have developed own tagger for Turkish using the Brill tagger.

Here my Turkish pos tagger code.

I read the train data from a treebank file. (METU-SABANCI)

I use nltk’s unigram, bigram and trigram taggers for back off tagger.

I apply 5-fold cross validation to my tagger and I get 90%-93% accuracy.


    sentence = "Uzun bir süre sonra kendime geldim ."
    decoded_sentence = sentence.decode('utf-8')
    tr_brill = TRTagger()
    print tr_brill.turkish_pos_tagger(decoded_sentence)
    '''
    uzun-Adj bir-Det süre-Noun sonra-Adv kendime-Pron geldim-Verb .-Punc
    '''

my PyCon notes

I was at PyCon and it was my first PyCon, so I’ll talk about PyCon right now. It is surely beyond doubt that, PyCon is a great event. Before my notes on speakings, I want to mention about financial grant of the organisation. I received a financial grant to cover my transoceanic travel expenses, yay! Also, I did two volunteer works during conference. First, I helped registration desk stuff. Second, I worked on the pyladies stand that I sold approximately 20 pyladies t-shirts, also I met great persons during my volunteering time.

Now, speakings can take to the stage. First day I sit in on especially machine learning related speakings. (Talking titles refer to pyvideo.org links, you can watch easily.)

Machine Learning 101 pandas, scikit-learn, gensim, Theano, continuum packages for machine learning
“Words, words, words”: Reading Shakespeare with Python text analysis, meta data, rhyme distribution (*it is a similar but light version of my authorship detection project)
Data Science in Advertising: Or a future when we love ads Real-Time Bidded (RTB) advertising, Click Through Rate (CTR) Prediction, Auto-Bidding systems, Traffic Prediction
Grids, Streets and Pipelines: Building a linguistic street map with scikit-learn geojson, hyperparameters, geopandas
How to interpret your own genome using (mostly) Python gemini, genome sequence
Losing your Loops: Fast Numerical Computing with NumPy aggregation functions, universal functions, broadcasting, and fancy indexing (*that is my favourite! it’s so clear, simple and useful)
How to build a brain with Python simulate the brain, Nengo, Spaun
Keynote – Guido van Rossum python 3, diversity
A Beginner’s Guide to Test-driven Development TDD
Cutting Off the Internet: Testing Applications that Use Requests requests,vcr, httpretty, mock, and betamax
Techniques for Debugging Hard Problems always read source, read all source
Finding Spammers & Scammers through Rate Tracking with Python & Redis velocity engine, keyspaces and facets

I should talk about poster session, I like clear and simple project. I saw a few clear&simple poster project and liked them, great jobs!

bonus, bonus, bonus:

CCPZbYiUAAAEOci

A praise to FeatureUnion

At the dark age of my project, I needed multiple and parallel feature extraction from my dataset, then I found a proper scikit-learn tool which is FeatureUnion. This tool concatenates results of given multiple transformer objects. I should extract both Part-Of-Speech tag and punctuation features. Starting from this point, I decided to use FeatureUnion in my project. I figured out combination of each POS tag vector is like below

pos

and of course I applied the same solution for punctuation vectors.

punctuation

All code is about feature union is below


 combined_features_pos = FeatureUnion([("noun", noun_vector),
                                       ("verb", verb_vector),
                                       ("adjective", adjective_vector),
                                       ("adverb", adverb_vector),
                                       ("pronoun", pronoun_vector),
                                       ("conjunction", conjunction_vector),
                                       ("number", number_vector)])

combined_features_punct = FeatureUnion([("comma", comma_vector),
                                        ("period", period_vector),
                                        ("colon", colon_vector),
                                        ("semicolon", semicolon_vector),
                                        ("question", question_mark_vector),
                                        ("exclamation", exclamation_mark_vector),
                                        ("triple_dot", triple_dot_vector)])

It’s not enough for me, I combine two combined features via FeatureUnion


combined_features = FeatureUnion([("pos", combined_features_pos), 
                                  ("punct", combined_features_punct)])

Finally, here my last combined features.