SVM text classification example

Here the first part of this post. Now, we can talk about second part of the text classification example.

I simplified ‘Classification of text documents using sparse features’ example which is running only three type of SVC classifiers on the four categories. L1 penalty SVC, L2 penalty SVC and Linear SVC with L1-based feature selection classification algorithms are used in simplified example. Also 'alt.atheism', 'comp.graphics', 'sci.space' and 'talk.religion.misc' text data categories are chosen to use as train and test data.

The example uses 2034 documents as train set and 1353 documents as test set.
It extracts n_samples: 2034, n_features: 17260 from the training dataset and n_samples: 1353, n_features: 17260 from the test dataset using a tf-idf sparse vectorizer. n_features is the number of distinct words in the corpus in the bags of words representation.

Now we show some details of SVC types are used in our simplified example.
L1 penalty SVC uses training function train time takes 0.251s test time takes 0.002s.
L2 penalty SVC uses training function train time takes 0.119s test time takes 0.002s.
LinearSVC with L1-based feature selection uses training function train time takes 0.287s test time takes 0.007s.
The simplified example results are below.

Precision Recall F1 Score
alt.atheism L2 penalty 0.87 0.83 0.85
L1 penalty 0.86 0.76 0.80
L1LinearSVC 0.84 0.80 0.82
comp.graphics L2 penalty 0.92 0.98 0.95
L1 penalty 0.90 0.97 0.94
L1LinearSVC 0.91 0.96 0.93
sci.space L2 penalty 0.95 0.95 0.95
L1 penalty 0.93 0.94 0.94
L1LinearSVC 0.92 0.94 0.93
talk.religion.misc L2 penalty 0.83 0.80 0.81
L1 penalty 0.76 0.78 0.77
L1LinearSVC 0.80 0.76 0.78

We can say all types of classifiers are indicate close results on each text categories. Results of sci.space and comp.graphics categories are successful than other categories. In my opinion, these technical texts could include more characteristic words (technical terms), so these terms has distinguishable advantages on classification process.

Advertisements