Evaluation of Big Data Platforms for Industrial Process Data
Loading...
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu |
Master's thesis
Unless otherwise stated, all rights belong to the author. You may download, display and print this publication for Your own personal use. Commercial use is prohibited.
Author
Date
2017-08-28
Department
Major/Subject
Cloud Computing and Services
Mcode
SCI3081
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
53+6
Series
Abstract
When the number of IoT devices, as well as human activities on the Internet, has increased fast in recent years, data generated has also witnessed an exponential growth in volume. Therefore, various frameworks and software such as Cassandra, Hive, and Spark have been developed to store and explore this massive amount of data. In particular, the waves of Big Data have also reached the industrial businesses. As the number of sensors installed in machines and mills significantly increases, log data is generated from these devices in higher frequencies and enormously complex calculations are applied to this data. The thesis is aimed at evaluating how effectively the current Big Data frameworks and tools manipulate industrial Big Data, especially process data. After surveying several techniques and potential frameworks and tools, the thesis focuses on building a prototype of a data pipeline. The prototype must satisfy a set of use cases. The data pipeline contains several components including Spark, Impala, and Sqoop. Also, it uses Parquet as the file format and stores the Parquet files in S3. Several experiments were also conducted in AWS, to validate the requirements in the use cases. The workload used for these tests was around 690 GBs of Parquet files. This amount of data includes one million channels, divided into one thousand groups, and the data sampling rate was one data point per second. The results of the experiments show that the performance of current big data frameworks may fulfill the performance requirements and the features in the use cases and industrial businesses in general.Description
Supervisor
Heljanko, KeijoThesis advisor
Juhola, OlliKeywords
big data, hadoop, spark SQL, performance