Evaluation of Big Data Platforms for Industrial Process Data

Loading...
Thumbnail Image
Journal Title
Journal ISSN
Volume Title
Perustieteiden korkeakoulu | Master's thesis
Date
2017-08-28
Department
Major/Subject
Cloud Computing and Services
Mcode
SCI3081
Degree programme
Master's Programme in ICT Innovation
Language
en
Pages
53+6
Series
Abstract
When the number of IoT devices, as well as human activities on the Internet, has increased fast in recent years, data generated has also witnessed an exponential growth in volume. Therefore, various frameworks and software such as Cassandra, Hive, and Spark have been developed to store and explore this massive amount of data. In particular, the waves of Big Data have also reached the industrial businesses. As the number of sensors installed in machines and mills significantly increases, log data is generated from these devices in higher frequencies and enormously complex calculations are applied to this data. The thesis is aimed at evaluating how effectively the current Big Data frameworks and tools manipulate industrial Big Data, especially process data. After surveying several techniques and potential frameworks and tools, the thesis focuses on building a prototype of a data pipeline. The prototype must satisfy a set of use cases. The data pipeline contains several components including Spark, Impala, and Sqoop. Also, it uses Parquet as the file format and stores the Parquet files in S3. Several experiments were also conducted in AWS, to validate the requirements in the use cases. The workload used for these tests was around 690 GBs of Parquet files. This amount of data includes one million channels, divided into one thousand groups, and the data sampling rate was one data point per second. The results of the experiments show that the performance of current big data frameworks may fulfill the performance requirements and the features in the use cases and industrial businesses in general.
Description
Supervisor
Heljanko, Keijo
Thesis advisor
Juhola, Olli
Keywords
big data, hadoop, spark SQL, performance
Other note
Citation