A praise to FeatureUnion

At the dark age of my project, I needed multiple and parallel feature extraction from my dataset, then I found a proper scikit-learn tool which is FeatureUnion. This tool concatenates results of given multiple transformer objects. I should extract both Part-Of-Speech tag and punctuation features. Starting from this point, I decided to use FeatureUnion in my project. I figured out combination of each POS tag vector is like below

pos

and of course I applied the same solution for punctuation vectors.

punctuation

All code is about feature union is below


 combined_features_pos = FeatureUnion([("noun", noun_vector),
                                       ("verb", verb_vector),
                                       ("adjective", adjective_vector),
                                       ("adverb", adverb_vector),
                                       ("pronoun", pronoun_vector),
                                       ("conjunction", conjunction_vector),
                                       ("number", number_vector)])

combined_features_punct = FeatureUnion([("comma", comma_vector),
                                        ("period", period_vector),
                                        ("colon", colon_vector),
                                        ("semicolon", semicolon_vector),
                                        ("question", question_mark_vector),
                                        ("exclamation", exclamation_mark_vector),
                                        ("triple_dot", triple_dot_vector)])

It’s not enough for me, I combine two combined features via FeatureUnion


combined_features = FeatureUnion([("pos", combined_features_pos), 
                                  ("punct", combined_features_punct)])

Finally, here my last combined features.

Custom vectorizer for scikit learn

Scikit-learn provides skillful text vectorizers, which are utilities to build feature vectors from text documents, such as CountVectorizer, TfidfVectorizer, and HashingVectorizer. A vectorizer converts a collection of text documents to a matrix of intended features, within this context count vectorizer gives a matrix of token counts, hashing vectorizer gives a matrix of token occurences and ifidf vectorizer gives a matrix of tf-idf features.

I need a custom vectorizer for my project. I want to get punctuation vector that means the custom punctuation vectorizer should give a matrix of only punctuation counts in given a collection of text data. Implementation of punctuation vectorizer is simpler than I figure out. I inherited my vectorizer class from CountVectorizer and I do all job in prepare_doc method, so the key point is prepare_doc method of vectorizer class.


def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

All code of PunctVectorizer is below;


from sklearn.feature_extraction.text import CountVectorizer

class PunctVectorizer(CountVectorizer):

    def __init__(...all parameters are here...):
        super(PunctVectorizer, self).__init__()

    def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        return lambda doc : preprocess(self.decode(self.prepare_doc(doc)))

A little example result of PunctVectorizer is below,


punct_vector = PunctVectorizer()
data_matrix = punct_vector.fit_transform(data)

print punct_vector.vocabulary_
output: {u'"': 0, u'$': 1, u"'": 3, u'&': 2, u')': 5, u'(': 4, u'+': 6, u'-': 8, u',': 7, u'/': 10, u'.': 9, u';': 12, u':': 11, u'=': 13, u'?': 14}

print data_matrix.getrow(0)
output: (0, 9)	25
  (0, 8)	12
  (0, 7)	22
  (0, 6)	2
  (0, 5)	1
  (0, 4)	1
  (0, 3)	4
  (0, 0)	6

The last output shows us count of punctuation characters in first document of given a collection of text data. By the way, (0,9) 25 , (0,8) 12 means the first document contains 25 ‘.’ characters, 12 ‘-‘ characters and so on.