A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database

Fahed Yoseph; Markku Heikkilä

doi:10.1109/iCMLDE49015.2019.00023

A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database

Fahed Yoseph, Markku Heikkilä

Research output: Chapter in Book/Conference proceeding › Conference contribution › Scientific › peer-review

3 Citations (Scopus)

69 Downloads (Pure)

Abstract

Finding outliers, rare events from a collection of patterns, has become an emerging issue in the area of machine learning concerned with detecting and eventually removing anomalous objects in data. A key challenge with outliers/anomalies detection is because they are not a well-formulated issue. Outliers are defined as the extreme values that deviate from the overall patterns in data; they may indicate experimental errors, variability in measurement, or a novelty. Detecting outliers in large databases can lead to the discovery of hidden knowledge. However, identifying and removing outliers often helps to assure that the observations represent the problem correctly. Though there are several techniques for detecting outliers/anomalies in a given database, thus, no single technique is proven to be the standard universal choice. Depending on the nature of the target application, different implementations require the use of different outlier detection methods. The clustering method is a very powerful method in the field of machine learning and defines outliers in terms of their distance to the cluster centers. In this study, we propose a clustering-based approach to identifying outliers in a retail point-of-sales dataset. To select the best clustering algorithm for the purpose, two algorithms are applied, K-means for hard, crisp clustering, and (FCM) Fuzzy C-means for soft clustering. The experimental results show that the K-means algorithm outperforms the (FCM) Fuzzy C-means algorithm in terms of outlier detection efficiency, and it is an effective outlier detection solution.

Original language	English
Title of host publication	2019 International Conference on Machine Learning and Data Engineering (iCMLDE)
Editors	Phill Kyu Rhee, Kuo-Yuan Hwa, Tun-Wen Pai, Daniel Howard, Md Rezaul Bashar
Publisher	IEEE
Pages	65–71
ISBN (Print)	978-1-7281-0404-1
DOIs	https://doi.org/10.1109/iCMLDE49015.2019.00023
Publication status	Published - 2019
MoE publication type	A4 Article in a conference publication
Event	International Conference on Machine Learning and Data Engineering (iCMLDE) - 2019 International Conference on Machine Learning and Data Engineering (iCMLDE) Duration: 2 Dec 2019 → 4 Dec 2019

Conference

Conference	International Conference on Machine Learning and Data Engineering (iCMLDE)
Period	02/12/19 → 04/12/19

Keywords

Clustering
Noise
Outlier detection
Point-of-sales analysis

Access to Document

10.1109/iCMLDE49015.2019.00023

A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database.pdfAccepted author manuscript, 897 KBLicence: Publisher rights policy

http://urn.fi/URN:NBN:fi-fe2020100883280

Cite this

@inproceedings{8da0e13ab0214f5894a2cdc40441ccb9,

title = "A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database",

abstract = "Finding outliers, rare events from a collection of patterns, has become an emerging issue in the area of machine learning concerned with detecting and eventually removing anomalous objects in data. A key challenge with outliers/anomalies detection is because they are not a well-formulated issue. Outliers are defined as the extreme values that deviate from the overall patterns in data; they may indicate experimental errors, variability in measurement, or a novelty. Detecting outliers in large databases can lead to the discovery of hidden knowledge. However, identifying and removing outliers often helps to assure that the observations represent the problem correctly. Though there are several techniques for detecting outliers/anomalies in a given database, thus, no single technique is proven to be the standard universal choice. Depending on the nature of the target application, different implementations require the use of different outlier detection methods. The clustering method is a very powerful method in the field of machine learning and defines outliers in terms of their distance to the cluster centers. In this study, we propose a clustering-based approach to identifying outliers in a retail point-of-sales dataset. To select the best clustering algorithm for the purpose, two algorithms are applied, K-means for hard, crisp clustering, and (FCM) Fuzzy C-means for soft clustering. The experimental results show that the K-means algorithm outperforms the (FCM) Fuzzy C-means algorithm in terms of outlier detection efficiency, and it is an effective outlier detection solution.",

keywords = "Clustering, Noise, Outlier detection, Point-of-sales analysis, Clustering, Noise, Outlier detection, Point-of-sales analysis, Clustering, Noise, Outlier detection, Point-of-sales analysis",

author = "Fahed Yoseph and Markku Heikkil{\"a}",

note = "Bett om fulltext 3.6.2020, embargotid 24 m{\aa}n./EH; International Conference on Machine Learning and Data Engineering (iCMLDE) ; Conference date: 02-12-2019 Through 04-12-2019",

year = "2019",

doi = "10.1109/iCMLDE49015.2019.00023",

language = "English",

isbn = "978-1-7281-0404-1",

pages = "65–71",

editor = "{Kyu Rhee}, Phill and Kuo-Yuan Hwa and Tun-Wen Pai and Daniel Howard and {Rezaul Bashar}, Md",

booktitle = "2019 International Conference on Machine Learning and Data Engineering (iCMLDE)",

publisher = "IEEE",

}

Yoseph, F & Heikkilä, M 2019, A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database. in P Kyu Rhee, K-Y Hwa, T-W Pai, D Howard & M Rezaul Bashar (eds), 2019 International Conference on Machine Learning and Data Engineering (iCMLDE). IEEE, pp. 65–71, International Conference on Machine Learning and Data Engineering (iCMLDE), 02/12/19. https://doi.org/10.1109/iCMLDE49015.2019.00023

A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database. / Yoseph, Fahed; Heikkilä, Markku.
2019 International Conference on Machine Learning and Data Engineering (iCMLDE). ed. / Phill Kyu Rhee; Kuo-Yuan Hwa; Tun-Wen Pai; Daniel Howard; Md Rezaul Bashar. IEEE, 2019. p. 65–71.

Research output: Chapter in Book/Conference proceeding › Conference contribution › Scientific › peer-review

TY - GEN

T1 - A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database

AU - Yoseph, Fahed

AU - Heikkilä, Markku

N1 - Bett om fulltext 3.6.2020, embargotid 24 mån./EH

PY - 2019

Y1 - 2019

N2 - Finding outliers, rare events from a collection of patterns, has become an emerging issue in the area of machine learning concerned with detecting and eventually removing anomalous objects in data. A key challenge with outliers/anomalies detection is because they are not a well-formulated issue. Outliers are defined as the extreme values that deviate from the overall patterns in data; they may indicate experimental errors, variability in measurement, or a novelty. Detecting outliers in large databases can lead to the discovery of hidden knowledge. However, identifying and removing outliers often helps to assure that the observations represent the problem correctly. Though there are several techniques for detecting outliers/anomalies in a given database, thus, no single technique is proven to be the standard universal choice. Depending on the nature of the target application, different implementations require the use of different outlier detection methods. The clustering method is a very powerful method in the field of machine learning and defines outliers in terms of their distance to the cluster centers. In this study, we propose a clustering-based approach to identifying outliers in a retail point-of-sales dataset. To select the best clustering algorithm for the purpose, two algorithms are applied, K-means for hard, crisp clustering, and (FCM) Fuzzy C-means for soft clustering. The experimental results show that the K-means algorithm outperforms the (FCM) Fuzzy C-means algorithm in terms of outlier detection efficiency, and it is an effective outlier detection solution.

AB - Finding outliers, rare events from a collection of patterns, has become an emerging issue in the area of machine learning concerned with detecting and eventually removing anomalous objects in data. A key challenge with outliers/anomalies detection is because they are not a well-formulated issue. Outliers are defined as the extreme values that deviate from the overall patterns in data; they may indicate experimental errors, variability in measurement, or a novelty. Detecting outliers in large databases can lead to the discovery of hidden knowledge. However, identifying and removing outliers often helps to assure that the observations represent the problem correctly. Though there are several techniques for detecting outliers/anomalies in a given database, thus, no single technique is proven to be the standard universal choice. Depending on the nature of the target application, different implementations require the use of different outlier detection methods. The clustering method is a very powerful method in the field of machine learning and defines outliers in terms of their distance to the cluster centers. In this study, we propose a clustering-based approach to identifying outliers in a retail point-of-sales dataset. To select the best clustering algorithm for the purpose, two algorithms are applied, K-means for hard, crisp clustering, and (FCM) Fuzzy C-means for soft clustering. The experimental results show that the K-means algorithm outperforms the (FCM) Fuzzy C-means algorithm in terms of outlier detection efficiency, and it is an effective outlier detection solution.

KW - Clustering

KW - Noise

KW - Outlier detection

KW - Point-of-sales analysis

KW - Clustering

KW - Noise

KW - Outlier detection

KW - Point-of-sales analysis

KW - Clustering

KW - Noise

KW - Outlier detection

KW - Point-of-sales analysis

U2 - 10.1109/iCMLDE49015.2019.00023

DO - 10.1109/iCMLDE49015.2019.00023

M3 - Conference contribution

SN - 978-1-7281-0404-1

SP - 65

EP - 71

BT - 2019 International Conference on Machine Learning and Data Engineering (iCMLDE)

A2 - Kyu Rhee, Phill

A2 - Hwa, Kuo-Yuan

A2 - Pai, Tun-Wen

A2 - Howard, Daniel

A2 - Rezaul Bashar, Md

PB - IEEE

T2 - International Conference on Machine Learning and Data Engineering (iCMLDE)

Y2 - 2 December 2019 through 4 December 2019

ER -

A Clustering Approach for Outliers Detection in a Big Point-of-Sales Database

Abstract

Conference

Keywords

Access to Document

Fingerprint

Cite this