Turkish Gerunds as Features

Gerunds are derived from the verbs but used as nouns in a sentence. Gerunds are created by adding derivational suffixes to verbs in Turkish language. According to derivational suffix, the gerunds can be used as nouns, adjectives or adverbs in the sentence.

To make noun(infinitive) suffixes; -me, -mek, -ma, -mak, -iş, -ış, -uş, -üş (because of the backness harmony of Turkish)

  • Kardeşim okumayı öğrendi. (reading)
  • Bu bakışından hoşlanmadım. (looking)
  • Yarın okula gitmek istemiyorum. (going)

To make adjective(participle) suffixes; -an, -asi, -mez, -ar, -dik, -ecek, -miş, -di(ği) and proper vowel versions. For example;

  • Gelecek yıl işe başlayacak. (next year)
  • Polisler olası kazaları önlemek için kontrolü sağlıyorlardı. (possible accident)
  • Salonda hep bildik yüzler vardı. (familiar face)

To make adverb suffixes; -esiye, -ip, -meden, -ince, -ken, -eli, -dikçe, -erek, -ir … -mez, -diğinde, -e … -e, -meksizin, -cesine and proper vowel versions. For example;

  • Bu mülâkat için ölesiye hazırlandım. (deadly)
  • Yemeğimi bitirir bitirmez gelirim. (as soon as I finish)
  • Ödevin bitince parkta buluşalım. (when you finish your homework)

Turkish is convenient to derive gerunds because the language has many gerunds suffixes. Starting from this point, I listed the most widely used verbs in Turkish, after that I derived gerunds by using gerund suffixes. Finally, I obtained 590 verbal nouns, 587 verbal adjectives and 916 verbal adverbs (with proper vowel versions).

I implemented some functions that processing the gerunds as features for the classification method. I used these functions via SVM on Radikal dataset. The program produced 2662 features on Radikal dataset.

gerunds

 

Here my first results are;

Precision Recall F1-Score
AH 0.87 0.67 0.76
AO 0.76 0.76 0.76
BO 0.72 0.78 0.75
EB 0.66 0.70 0.68
FT 0.79 0.84 0.82
OC 0.71 0.80 0.75
TE 0.83 0.76 0.79
AVG. 0.76 0.76 0.76

According to the first practice implementation of gerunds gives F1-score between 0.68 and 0.82. The first results are compared with reviewed Turkish studies, we can say that these results are promising. Because, the average F1-score is 0.76 and it was resulted from only gerunds frequency.

 

 

Advertisements

Custom vectorizer for scikit learn

Scikit-learn provides skillful text vectorizers, which are utilities to build feature vectors from text documents, such as CountVectorizer, TfidfVectorizer, and HashingVectorizer. A vectorizer converts a collection of text documents to a matrix of intended features, within this context count vectorizer gives a matrix of token counts, hashing vectorizer gives a matrix of token occurences and ifidf vectorizer gives a matrix of tf-idf features.

I need a custom vectorizer for my project. I want to get punctuation vector that means the custom punctuation vectorizer should give a matrix of only punctuation counts in given a collection of text data. Implementation of punctuation vectorizer is simpler than I figure out. I inherited my vectorizer class from CountVectorizer and I do all job in prepare_doc method, so the key point is prepare_doc method of vectorizer class.


def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

All code of PunctVectorizer is below;


from sklearn.feature_extraction.text import CountVectorizer

class PunctVectorizer(CountVectorizer):

    def __init__(...all parameters are here...):
        super(PunctVectorizer, self).__init__()

    def prepare_doc(self, doc):
        punc_list = ['!', '"', '#', '$', '%', '&', '\'' ,'(' ,')', '*', '+', ',', '-', '.' ,'/' ,':' ,';' ,'' ,'?' ,'@' ,'[' ,'\\' ,']' ,'^' ,'_' ,'`' ,'{' ,'|' ,'}' ,'~']
        doc = doc.replace("\\r\\n"," ")
        for character in doc:
            if character not in punc_list:
                doc = doc.replace(character, "")
        return doc

    def build_analyzer(self):
        preprocess = self.build_preprocessor()
        return lambda doc : preprocess(self.decode(self.prepare_doc(doc)))

A little example result of PunctVectorizer is below,


punct_vector = PunctVectorizer()
data_matrix = punct_vector.fit_transform(data)

print punct_vector.vocabulary_
output: {u'"': 0, u'$': 1, u"'": 3, u'&': 2, u')': 5, u'(': 4, u'+': 6, u'-': 8, u',': 7, u'/': 10, u'.': 9, u';': 12, u':': 11, u'=': 13, u'?': 14}

print data_matrix.getrow(0)
output: (0, 9)	25
  (0, 8)	12
  (0, 7)	22
  (0, 6)	2
  (0, 5)	1
  (0, 4)	1
  (0, 3)	4
  (0, 0)	6

The last output shows us count of punctuation characters in first document of given a collection of text data. By the way, (0,9) 25 , (0,8) 12 means the first document contains 25 ‘.’ characters, 12 ‘-‘ characters and so on.