Custom vectorizer for scikit learn

Scikit-learn provides skillful text vectorizers, which are utilities to build feature vectors from text documents, such as CountVectorizer, TfidfVectorizer, and HashingVectorizer. A vectorizer converts a collection of text documents to a matrix of intended features, within this context count vectorizer gives a matrix of token counts, hashing vectorizer gives a matrix of token occurences and ifidf vectorizer gives a matrix of tf-idf features.

I need a custom vectorizer for my project. I want to get punctuation vector that means the custom punctuation vectorizer should give a matrix of only punctuation counts in given a collection of text data. Implementation of punctuation vectorizer is simpler than I figure out. I inherited my vectorizer class from CountVectorizer and I do all job in prepare_doc method, so the key point is prepare_doc method of vectorizer class.


def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

All code of PunctVectorizer is below;


from sklearn.feature_extraction.text import CountVectorizer

class PunctVectorizer(CountVectorizer):

    def __init__(...all parameters are here...):
        super(PunctVectorizer, self).__init__()

    def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        return lambda doc : preprocess(self.decode(self.prepare_doc(doc)))

A little example result of PunctVectorizer is below,


punct_vector = PunctVectorizer()
data_matrix = punct_vector.fit_transform(data)

print punct_vector.vocabulary_
output: {u'"': 0, u'$': 1, u"'": 3, u'&': 2, u')': 5, u'(': 4, u'+': 6, u'-': 8, u',': 7, u'/': 10, u'.': 9, u';': 12, u':': 11, u'=': 13, u'?': 14}

print data_matrix.getrow(0)
output: (0, 9)	25
  (0, 8)	12
  (0, 7)	22
  (0, 6)	2
  (0, 5)	1
  (0, 4)	1
  (0, 3)	4
  (0, 0)	6

The last output shows us count of punctuation characters in first document of given a collection of text data. By the way, (0,9) 25 , (0,8) 12 means the first document contains 25 ‘.’ characters, 12 ‘-‘ characters and so on.

Advertisements

6 thoughts on “Custom vectorizer for scikit learn

  1. Hello would you mind sharing which blog platform you’re using? I’m looking to start my own blog in the near future but I’m having a hard time making a decision between BlogEngine/Wordpress/B2evolution and Drupal. The reason I ask is because your design seems different then most blogs and I’m looking for something unique. P.S Sorry for being off-topic but I had to ask!

    Like

  2. Hi there just wanted to give you a quick heads up. The words in your content seem to be running off the screen in Safari. I’m not sure if this is a format issue or something to do with internet browser compatibility but I thought I’d post to let you know. The design look great though! Hope you get the issue resolved soon. Kudos

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s