Machine Learning Method Comparison in Agricultural Data Analysis
Nevavuori, Petteri (2017)
Avaa tiedosto
Lataukset:
Nevavuori, Petteri
2017
Johtaminen ja tietotekniikka (Pori)
Talouden ja rakentamisen tiedekunta - Faculty of Business and Built Environment
This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Hyväksymispäivämäärä
2017-06-07
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:tty-201705261546
https://urn.fi/URN:NBN:fi:tty-201705261546
Tiivistelmä
The aim of this master’s thesis was to compare machine learning methods in clustering and regression tasks with data collected from Finnish dairy farms by Mtech Digital Solutions Oy. Clustering techniques focus on finding similarities between the items of the dataset by examining the data itself. Regression techniques then are used to build predictive models for the dataset. Common theme to all machine learning methods is that they are used to examine data that is manually too complex to handle by applying statistical and mathematical algorithms.
The data has been collected during a timeframe spanning tens of years and has been used by agricultural experts to provide insights and counselling to farmers across Finland on-site. Recent advances in the field of machine learning have however sparked the thesis’ employer’s interest to employ data-driven modelling and information acquisition practices for standardized and invariant conclusions about the health and progression of farms. There were two datasets formed – one for the clustering task and one for the regression task. The clustering dataset contained information about dairy farms’ production and cattle-related health treatment records. The regression dataset then encompassed all the metrics about farms as businesses.
Overall eight machine learning methods were compared, four clustering and four regression methods, respectively. The clustering methods were Hierarchical Clustering, k-Means, Self-Organizing Maps and BIRCH and the regression methods were Ordinary Least Squares, Decision Tree Regression, Multilayer Perceptron and XGBoost. The conclusion for clustering was that k-Means performed the best out of clustering methods, while every method’s performance was relatively equal with BIRCH being the only exception. The conclusion for regression method comparison was that XGBoost delivered the best results by performing well score-wise and providing the needed information about most important features.
The data has been collected during a timeframe spanning tens of years and has been used by agricultural experts to provide insights and counselling to farmers across Finland on-site. Recent advances in the field of machine learning have however sparked the thesis’ employer’s interest to employ data-driven modelling and information acquisition practices for standardized and invariant conclusions about the health and progression of farms. There were two datasets formed – one for the clustering task and one for the regression task. The clustering dataset contained information about dairy farms’ production and cattle-related health treatment records. The regression dataset then encompassed all the metrics about farms as businesses.
Overall eight machine learning methods were compared, four clustering and four regression methods, respectively. The clustering methods were Hierarchical Clustering, k-Means, Self-Organizing Maps and BIRCH and the regression methods were Ordinary Least Squares, Decision Tree Regression, Multilayer Perceptron and XGBoost. The conclusion for clustering was that k-Means performed the best out of clustering methods, while every method’s performance was relatively equal with BIRCH being the only exception. The conclusion for regression method comparison was that XGBoost delivered the best results by performing well score-wise and providing the needed information about most important features.