Custom vectorizer for scikit learn

Scikit-learn provides skillful text vectorizers, which are utilities to build feature vectors from text documents, such as CountVectorizer, TfidfVectorizer, and HashingVectorizer. A vectorizer converts a collection of text documents to a matrix of intended features, within this context count vectorizer gives a matrix of token counts, hashing vectorizer gives a matrix of token occurences and ifidf vectorizer gives a matrix of tf-idf features.

I need a custom vectorizer for my project. I want to get punctuation vector that means the custom punctuation vectorizer should give a matrix of only punctuation counts in given a collection of text data. Implementation of punctuation vectorizer is simpler than I figure out. I inherited my vectorizer class from CountVectorizer and I do all job in prepare_doc method, so the key point is prepare_doc method of vectorizer class.


def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

All code of PunctVectorizer is below;


from sklearn.feature_extraction.text import CountVectorizer

class PunctVectorizer(CountVectorizer):

    def __init__(...all parameters are here...):
        super(PunctVectorizer, self).__init__()

    def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        return lambda doc : preprocess(self.decode(self.prepare_doc(doc)))

A little example result of PunctVectorizer is below,


punct_vector = PunctVectorizer()
data_matrix = punct_vector.fit_transform(data)

print punct_vector.vocabulary_
output: {u'"': 0, u'$': 1, u"'": 3, u'&': 2, u')': 5, u'(': 4, u'+': 6, u'-': 8, u',': 7, u'/': 10, u'.': 9, u';': 12, u':': 11, u'=': 13, u'?': 14}

print data_matrix.getrow(0)
output: (0, 9)	25
  (0, 8)	12
  (0, 7)	22
  (0, 6)	2
  (0, 5)	1
  (0, 4)	1
  (0, 3)	4
  (0, 0)	6

The last output shows us count of punctuation characters in first document of given a collection of text data. By the way, (0,9) 25 , (0,8) 12 means the first document contains 25 ‘.’ characters, 12 ‘-‘ characters and so on.

Advertisements

Authorship detection n-gram feature using with SVM

I used the combined tf-idf vectorizer of both word and character bigrams as feature in an authorship detection example.

First of all, what is n-gram ? N-gram is defined that a adjacent sequence of n items from a given sequence of text or speech, in which the n should be an integer greater than zero. Language models take advantage of the ordering of words, are called n-gram language models. N-grams models can be envisioned sliding a small window which is only n words are visible at the same time on the given text. The simplest n-gram model is unigram model which n is one. That means the window shows only one word at a time. The more complicated models when n is two is called bigram or n is three is called trigram are commonly more informative than unigram.[1]

The authorship detection example used Reuter_50_50 dataset as train and test data. [2] Reuter_50_50 dataset contains 50 authors and 50 texts belong each author. As a beginning the example gets only eight authors texts. These authors are 'AaronPressman', 'AlanCrosby', 'AlexanderSmith', 'BenjaminKangLim', 'BernardHickey', 'BradDorfman', 'DarrenSchuettler', 'DavidLawder'.

Here are the results;
n_samples: 400, n_features: 32885 for both train and test data.
Table is shown Precision, Recall and F1-Score of L1 penalty SVC and L2 penalty SVC for each eight authors.

Precision Recall F1 Score
AaronPressman L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.94 0.94
AlanCrosby L2 penalty 0.98 1.00 0.99
L1 penalty 0.94 0.94 0.94
AlexanderSmith L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.90 0.92
BenjaminKangLim L2 penalty 1.00 1.00 1.00
L1 penalty 0.98 0.96 0.97
BernardHickey L2 penalty 1.00 0.98 0.99
L1 penalty 0.98 0.98 0.98
BradDorfman L2 penalty 0.69 1.00 0.82
L1 penalty 0.61 0.96 0.74
DarrenSchuettler L2 penalty 1.00 0.90 0.95
L1 penalty 1.00 0.94 0.97
DavidLawder L2 penalty 1.00 0.62 0.77
L1 penalty 0.93 0.50 0.65

References:
1. http://nlpwp.org/book/chap-ngrams.xhtml
2. http://archive.ics.uci.edu/ml/datasets/Reuter_50_50

Classification of Text Documents

This post is about classification of text documents via Support Vector Machines. Before text classification, I will try to give a general overview of classification. Machine learning is a subfield of computer science and statistics that study of algorithms which consist of a step in the learning from data.[1a] Machine learning can be divided two types are supervised and unsupervised learning.

1) Supervised learning is working with example inputs and their desired outputs, and the task is to learn a general rule that maps inputs to outputs. Supervised learning is divided into two branches are classification and regression. We will talk about classification that the simple explanation of classification is to try to label given some input variables with the correct category or class.[2]

2) Unsupervised learning is working without pre-specified dependent attributes that trying to find hidden patterns in unlabeled data.[1b]

I want to add a basic information about confusion matrix which is showing the predicted and actual classifications.

Confusion matrix

  • The accuracy (AC) is the proportion of the total number of predictions that were correct.ac
  • The Recall or True Positive rate (TP) is the proportion of positive cases that were correctly identified. tp
  • The False Positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive. fp
  • The True Negative rate (TN) is defined as the proportion of negatives cases that were classified correctly. tn
  • The False Negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative. fn
  • Precision (P) is the proportion of the predicted positive cases that were correct.[3] p

There is another metric is the F1 score that gives a single score based on the precision and recall for the class. f1

The precision, recall and F1 score are all equal to 1 in a perfect classification.

I want to classify text documents with Support Vector Machines (SVM) classification algorithm.[4] There are many classification algorithms, SVMs provides consistently achieve good performance on categorization tasks, show better performance existing methods considerably.[5] Hence, I found some example of text classification used common classification algorithms.[6] This example uses a dataset which consisted of 20 different newsgroups.[7]
I will talk about SVM details, and example of text classification in next post.

References
1a, 1b. Ron Kovahi; Foster Provost (1998). “Glossary of terms”. Machine Learning 30: 271–274. (http://ai.stanford.edu/~ronnyk/glossary.html)
2. http://scikit-learn.org/stable/tutorial/basic/tutorial.html
3. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
4. Cortes, C.; Vapnik, V. (1995). “Support-vector networks”. Machine Learning 20, 273-297.
5. http://www.cs.cornell.edu/people/tj/publications/joachims_97b.pdf
6. http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
7. http://qwone.com/~jason/20Newsgroups/