my PyCon notes

I was at PyCon and it was my first PyCon, so I’ll talk about PyCon right now. It is surely beyond doubt that, PyCon is a great event. Before my notes on speakings, I want to mention about financial grant of the organisation. I received a financial grant to cover my transoceanic travel expenses, yay! Also, I did two volunteer works during conference. First, I helped registration desk stuff. Second, I worked on the pyladies stand that I sold approximately 20 pyladies t-shirts, also I met great persons during my volunteering time.

Now, speakings can take to the stage. First day I sit in on especially machine learning related speakings. (Talking titles refer to links, you can watch easily.)

Machine Learning 101 pandas, scikit-learn, gensim, Theano, continuum packages for machine learning
“Words, words, words”: Reading Shakespeare with Python text analysis, meta data, rhyme distribution (*it is a similar but light version of my authorship detection project)
Data Science in Advertising: Or a future when we love ads Real-Time Bidded (RTB) advertising, Click Through Rate (CTR) Prediction, Auto-Bidding systems, Traffic Prediction
Grids, Streets and Pipelines: Building a linguistic street map with scikit-learn geojson, hyperparameters, geopandas
How to interpret your own genome using (mostly) Python gemini, genome sequence
Losing your Loops: Fast Numerical Computing with NumPy aggregation functions, universal functions, broadcasting, and fancy indexing (*that is my favourite! it’s so clear, simple and useful)
How to build a brain with Python simulate the brain, Nengo, Spaun
Keynote – Guido van Rossum python 3, diversity
A Beginner’s Guide to Test-driven Development TDD
Cutting Off the Internet: Testing Applications that Use Requests requests,vcr, httpretty, mock, and betamax
Techniques for Debugging Hard Problems always read source, read all source
Finding Spammers & Scammers through Rate Tracking with Python & Redis velocity engine, keyspaces and facets

I should talk about poster session, I like clear and simple project. I saw a few clear&simple poster project and liked them, great jobs!

bonus, bonus, bonus:



A praise to FeatureUnion

At the dark age of my project, I needed multiple and parallel feature extraction from my dataset, then I found a proper scikit-learn tool which is FeatureUnion. This tool concatenates results of given multiple transformer objects. I should extract both Part-Of-Speech tag and punctuation features. Starting from this point, I decided to use FeatureUnion in my project. I figured out combination of each POS tag vector is like below


and of course I applied the same solution for punctuation vectors.


All code is about feature union is below

 combined_features_pos = FeatureUnion([("noun", noun_vector),
                                       ("verb", verb_vector),
                                       ("adjective", adjective_vector),
                                       ("adverb", adverb_vector),
                                       ("pronoun", pronoun_vector),
                                       ("conjunction", conjunction_vector),
                                       ("number", number_vector)])

combined_features_punct = FeatureUnion([("comma", comma_vector),
                                        ("period", period_vector),
                                        ("colon", colon_vector),
                                        ("semicolon", semicolon_vector),
                                        ("question", question_mark_vector),
                                        ("exclamation", exclamation_mark_vector),
                                        ("triple_dot", triple_dot_vector)])

It’s not enough for me, I combine two combined features via FeatureUnion

combined_features = FeatureUnion([("pos", combined_features_pos), 
                                  ("punct", combined_features_punct)])

Finally, here my last combined features.

Custom vectorizer for scikit learn

Scikit-learn provides skillful text vectorizers, which are utilities to build feature vectors from text documents, such as CountVectorizer, TfidfVectorizer, and HashingVectorizer. A vectorizer converts a collection of text documents to a matrix of intended features, within this context count vectorizer gives a matrix of token counts, hashing vectorizer gives a matrix of token occurences and ifidf vectorizer gives a matrix of tf-idf features.

I need a custom vectorizer for my project. I want to get punctuation vector that means the custom punctuation vectorizer should give a matrix of only punctuation counts in given a collection of text data. Implementation of punctuation vectorizer is simpler than I figure out. I inherited my vectorizer class from CountVectorizer and I do all job in prepare_doc method, so the key point is prepare_doc method of vectorizer class.

def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

All code of PunctVectorizer is below;

from sklearn.feature_extraction.text import CountVectorizer

class PunctVectorizer(CountVectorizer):

    def __init__(...all parameters are here...):
        super(PunctVectorizer, self).__init__()

    def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        return lambda doc : preprocess(self.decode(self.prepare_doc(doc)))

A little example result of PunctVectorizer is below,

punct_vector = PunctVectorizer()
data_matrix = punct_vector.fit_transform(data)

print punct_vector.vocabulary_
output: {u'"': 0, u'$': 1, u"'": 3, u'&': 2, u')': 5, u'(': 4, u'+': 6, u'-': 8, u',': 7, u'/': 10, u'.': 9, u';': 12, u':': 11, u'=': 13, u'?': 14}

print data_matrix.getrow(0)
output: (0, 9)	25
  (0, 8)	12
  (0, 7)	22
  (0, 6)	2
  (0, 5)	1
  (0, 4)	1
  (0, 3)	4
  (0, 0)	6

The last output shows us count of punctuation characters in first document of given a collection of text data. By the way, (0,9) 25 , (0,8) 12 means the first document contains 25 ‘.’ characters, 12 ‘-‘ characters and so on.