Clustering system based on text mining using the K-means algorithm : news headlines clustering
Lama, Prabin (2013)
Lama, Prabin
Turun ammattikorkeakoulu
2013
All rights reserved
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-2013123122159
https://urn.fi/URN:NBN:fi:amk-2013123122159
Tiivistelmä
The increasing scope of the web and the large amount of electronic data piling up throughout the web has provoked the exploration of hidden information from their text content.
News articles published on different news portals throughout the web are the sources of the information.These can also be very good topics for the research on text mining. Clustering of similar news headlines and putting them under a single platform with the corresponding links to the news portal sites can be a very efficient option to the exploration of the same news article across multiple different news portals, which is, in fact, a tedious and time-consuming task.
This thesis presents the model which analyzes the news headlines across the different news portals, uses document pre-processing techniques and creates clusters of similar news headlines.
Data available on the web are structured, semi-structured or unstructured. Webpages are usually semi-structured because of the presence of html tags. The XML representation of semi- structured data facilitates the clustering of similar documents by the use of distance-based clustering techniques.
News headlines from different news portals are extracted and stored in an XML file. The XML file is then preprocessed using document preprocessing techniques. Techniques like tokenization, stop word removal, lemmatization and synonym expansion are used during the document preprocessing. The selected news headlines are then represented using the vector space modeling and term-frequency weighting scheme. Finally, the K-means clustering algorithm is applied to find similarities among the news headlines and create clusters of similar news headlines. A sample webpage is used to display the clusters of the news headlines with their corresponding links.
News articles published on different news portals throughout the web are the sources of the information.These can also be very good topics for the research on text mining. Clustering of similar news headlines and putting them under a single platform with the corresponding links to the news portal sites can be a very efficient option to the exploration of the same news article across multiple different news portals, which is, in fact, a tedious and time-consuming task.
This thesis presents the model which analyzes the news headlines across the different news portals, uses document pre-processing techniques and creates clusters of similar news headlines.
Data available on the web are structured, semi-structured or unstructured. Webpages are usually semi-structured because of the presence of html tags. The XML representation of semi- structured data facilitates the clustering of similar documents by the use of distance-based clustering techniques.
News headlines from different news portals are extracted and stored in an XML file. The XML file is then preprocessed using document preprocessing techniques. Techniques like tokenization, stop word removal, lemmatization and synonym expansion are used during the document preprocessing. The selected news headlines are then represented using the vector space modeling and term-frequency weighting scheme. Finally, the K-means clustering algorithm is applied to find similarities among the news headlines and create clusters of similar news headlines. A sample webpage is used to display the clusters of the news headlines with their corresponding links.