Email classification with limited labeled data
Engström, Peter (2020)
Engström, Peter
Åbo Akademi
2020
Julkaisu on tekijänoikeussäännösten alainen. Teosta voi lukea ja tulostaa henkilökohtaista käyttöä varten. Käyttö kaupallisiin tarkoituksiin on kielletty.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi-fe2020110989759
https://urn.fi/URN:NBN:fi-fe2020110989759
Tiivistelmä
In this thesis I evaluate different ways of classifying email messages in the absence of a large number of pre-classified training messages. When using machine learning to classify incoming emails, this lack of pre-labeled emails is problematic. These labels can only be acquired by going through each email and manually labeling it, which for a large number of emails can take very long.
There are existing studies on machine learning for email classification, but the amount of labeled data available for such studies is generally large. However, there have been studies of non-email text classification with smaller amounts of labeled data, namely in the field of semi-supervised learning.
I experiment with supervised and semi-supervised (specifically, self-training) algorithms. The supervised methods used are random forest, naïve Bayes, support-vector machine, and AdaBoost. The experiments are conducted using 1000 labeled training emails. Three classes are used for the classification of data. Class “DB” represents the group that processes emails involving database stored procedures; class “CIM”, emails involving the Customer Interaction Management system; and class “Integration”, emails involving data integration.
The results show that self-training offers no significant advantage over the supervised methods for this task. Of the supervised methods, the support-vector machine achieves the best performance for each of the three classes.
There are existing studies on machine learning for email classification, but the amount of labeled data available for such studies is generally large. However, there have been studies of non-email text classification with smaller amounts of labeled data, namely in the field of semi-supervised learning.
I experiment with supervised and semi-supervised (specifically, self-training) algorithms. The supervised methods used are random forest, naïve Bayes, support-vector machine, and AdaBoost. The experiments are conducted using 1000 labeled training emails. Three classes are used for the classification of data. Class “DB” represents the group that processes emails involving database stored procedures; class “CIM”, emails involving the Customer Interaction Management system; and class “Integration”, emails involving data integration.
The results show that self-training offers no significant advantage over the supervised methods for this task. Of the supervised methods, the support-vector machine achieves the best performance for each of the three classes.