Authorship detection n-gram feature using with SVM

I used the combined tf-idf vectorizer of both word and character bigrams as feature in an authorship detection example.

First of all, what is n-gram ? N-gram is defined that a adjacent sequence of n items from a given sequence of text or speech, in which the n should be an integer greater than zero. Language models take advantage of the ordering of words, are called n-gram language models. N-grams models can be envisioned sliding a small window which is only n words are visible at the same time on the given text. The simplest n-gram model is unigram model which n is one. That means the window shows only one word at a time. The more complicated models when n is two is called bigram or n is three is called trigram are commonly more informative than unigram.[1]

The authorship detection example used Reuter_50_50 dataset as train and test data. [2] Reuter_50_50 dataset contains 50 authors and 50 texts belong each author. As a beginning the example gets only eight authors texts. These authors are 'AaronPressman', 'AlanCrosby', 'AlexanderSmith', 'BenjaminKangLim', 'BernardHickey', 'BradDorfman', 'DarrenSchuettler', 'DavidLawder'.

Here are the results;
n_samples: 400, n_features: 32885 for both train and test data.
Table is shown Precision, Recall and F1-Score of L1 penalty SVC and L2 penalty SVC for each eight authors.

Precision Recall F1 Score
AaronPressman L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.94 0.94
AlanCrosby L2 penalty 0.98 1.00 0.99
L1 penalty 0.94 0.94 0.94
AlexanderSmith L2 penalty 0.96 0.98 0.97
L1 penalty 0.94 0.90 0.92
BenjaminKangLim L2 penalty 1.00 1.00 1.00
L1 penalty 0.98 0.96 0.97
BernardHickey L2 penalty 1.00 0.98 0.99
L1 penalty 0.98 0.98 0.98
BradDorfman L2 penalty 0.69 1.00 0.82
L1 penalty 0.61 0.96 0.74
DarrenSchuettler L2 penalty 1.00 0.90 0.95
L1 penalty 1.00 0.94 0.97
DavidLawder L2 penalty 1.00 0.62 0.77
L1 penalty 0.93 0.50 0.65



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s