Classification of Text Documents

This post is about classification of text documents via Support Vector Machines. Before text classification, I will try to give a general overview of classification. Machine learning is a subfield of computer science and statistics that study of algorithms which consist of a step in the learning from data.[1a] Machine learning can be divided two types are supervised and unsupervised learning.

1) Supervised learning is working with example inputs and their desired outputs, and the task is to learn a general rule that maps inputs to outputs. Supervised learning is divided into two branches are classification and regression. We will talk about classification that the simple explanation of classification is to try to label given some input variables with the correct category or class.[2]

2) Unsupervised learning is working without pre-specified dependent attributes that trying to find hidden patterns in unlabeled data.[1b]

I want to add a basic information about confusion matrix which is showing the predicted and actual classifications.

Confusion matrix

  • The accuracy (AC) is the proportion of the total number of predictions that were correct.ac
  • The Recall or True Positive rate (TP) is the proportion of positive cases that were correctly identified. tp
  • The False Positive rate (FP) is the proportion of negatives cases that were incorrectly classified as positive. fp
  • The True Negative rate (TN) is defined as the proportion of negatives cases that were classified correctly. tn
  • The False Negative rate (FN) is the proportion of positives cases that were incorrectly classified as negative. fn
  • Precision (P) is the proportion of the predicted positive cases that were correct.[3] p

There is another metric is the F1 score that gives a single score based on the precision and recall for the class. f1

The precision, recall and F1 score are all equal to 1 in a perfect classification.

I want to classify text documents with Support Vector Machines (SVM) classification algorithm.[4] There are many classification algorithms, SVMs provides consistently achieve good performance on categorization tasks, show better performance existing methods considerably.[5] Hence, I found some example of text classification used common classification algorithms.[6] This example uses a dataset which consisted of 20 different newsgroups.[7]
I will talk about SVM details, and example of text classification in next post.

References
1a, 1b. Ron Kovahi; Foster Provost (1998). “Glossary of terms”. Machine Learning 30: 271–274. (http://ai.stanford.edu/~ronnyk/glossary.html)
2. http://scikit-learn.org/stable/tutorial/basic/tutorial.html
3. http://www2.cs.uregina.ca/~dbd/cs831/notes/confusion_matrix/confusion_matrix.html
4. Cortes, C.; Vapnik, V. (1995). “Support-vector networks”. Machine Learning 20, 273-297.
5. http://www.cs.cornell.edu/people/tj/publications/joachims_97b.pdf
6. http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html
7. http://qwone.com/~jason/20Newsgroups/