Predictive modelling for low-quality data
Ustilkina, Mariia (2020)
Ustilkina, Mariia
2020
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202005118366
https://urn.fi/URN:NBN:fi:amk-202005118366
Tiivistelmä
The major objective of this paper is to examine possible solutions that can be potentially beneficial in dealing with noise and insufficiency of data in the domain of predictive modelling, and to perform regression analysis for the project dataset using these findings in order to improve performance.
The project is implemented in Python with an aid of Scikit-Learn machine learning library. Various outlier detection techniques are reviewed and tested in the project, as well as dimensionality reduction and oversampling with SMOTE adapted for regression problems. Prediction analysis is performed with ensemble models, particularly Random Forest and Gradient Boosting Machine.
Both models demonstrate good results in the conditions of data noise and scarcity, avoiding overfitting, however performance of these models does not differ significantly from each other. Outlier detection techniques, especially Local Outlier Factor and Elliptic Envelope, as well as SMOTE oversampling for margin values are proven to be beneficial for the chosen task.
The project is implemented in Python with an aid of Scikit-Learn machine learning library. Various outlier detection techniques are reviewed and tested in the project, as well as dimensionality reduction and oversampling with SMOTE adapted for regression problems. Prediction analysis is performed with ensemble models, particularly Random Forest and Gradient Boosting Machine.
Both models demonstrate good results in the conditions of data noise and scarcity, avoiding overfitting, however performance of these models does not differ significantly from each other. Outlier detection techniques, especially Local Outlier Factor and Elliptic Envelope, as well as SMOTE oversampling for margin values are proven to be beneficial for the chosen task.