Assignment – 02Network Performance Modelling and Analysis CS-5615
Critical evaluation of a performance study
Performance Modeling of Big Data
This paper was written by Mr. Krushnaraj Kamtekar under the guidance of Prof. Raj Jain both of them from Washington University in St. Louis, USA. This paper was written and submitted as a project report for Computer Systems Analysis course in Spring 2015.
In Early days of computers, Data were stored in a permanent system such as secondary storage devices and stored directly in a set of files using file management system. In the file system mostly allowed access to individual files at a time and data in these flat files does have any relationship and different application needed to manage data from these files. The file management system of organizing and managing data overcome problems with the manual system. But, there are some problems in the file management system for Data Processing some of such problem as Data redundancy, Lack of security and Limited Data Sharing, Time-consuming in analysis of data, The traditional file processing system needs more complex program to get the required results, Administration is also Complex In old traditional file system administration becomes also difficult, Limited functionality.
Database management system was introduced to overcome problems of file management systems. DBMS provides a way to create, retrieve, update and manage data and acts as an interface between the user and the database. The database structure itself is stored as a set of files and these stored Data can be accessed the via DBMS system using Structured Query Language(SQL). DBMS provides improved data sharing, improved data access & security, better data integration, data consistency.
A recent statistic shows that 90% percent of the data in the world today has been created in the last two years. As the world becomes more connected with an ever-increasing number of electronic devices, social media, Internet-based business/ media platforms, and services. Big data produces the wide variety of data such as structured, unstructured and semi-structured. Earlier days, Structured data such as spreadsheets and databases were the only sources of data considered by most of the applications. Nowadays, Due to technology evaluation data is generated in many formats such as emails, photos, videos, monitoring devices, PDFs, audio, etc.
Traditional database management applications are not able to deal with the large dataset due to the vast volume of data, rapid data growth and varieties of data called such dataset as big data. We need technologies and tools to deal with big data such technologies called big data technology.
Main three characteristics of big data using V’s form such as. i)Volume: refers to the size of data which usually larger than traditional datasets. ii) Velocity: refers to the speed that information generation in the system from multiple sources and mostly expected to be processed in real time. iii) Variety: a wide range of data sources being processed such as structured, unstructured and semi-structured with the relative quality.
Other than above-mentioned characteristics of big data various individuals and organizations have suggested some other big data characteristics in the form of V’s such as Veracity: biases, noise and abnormality in data due to the variety of sources and the complexity. Variability: Variation in the data leads to wide variation in quality, Value: The ultimate challenge of big data is delivering value, Visualisation – refers to the ability of the information to initiate decision-making, Variety – refers to how balanced the information is across different platforms and languages.
Currently, there are lots of big data technologies available to deal with the above-mentioned characteristic of big data. Most of the big data tools support distributed processing and some of them are based on stream processing and some of them are batch processing. Big data technologies are very important and viral topic current treads. Proper performance study of big data can provide details to developers, researchers, technical people to get the better understanding of big data technologies and tools.
The performance study discussed the importance of big data in the current age and explained the main 4 characteristics of big data such as Volume, Variety, Velocity & Value. Discussed traditional Database Systems, their limitations and why traditional database management system not capable enough to deal with high volume, complex(structured, unstructured and semi-structured data) and rapidly changing dataset. Further, explained valuable patterns and hidden knowledge in big data which used to identify business trends, decision making, intruders detection, reduce production and maintenance costs, predict the consumer’s expectations.
Explained steps involving for processing big dataset such as data collection, data storage, data cleaning, analysis, pattern recognition and knowledge discovery ; insight finding. problems faced while extracting valuable patterns and hidden knowledge from big data and explained system level limitation/ concerns like speed, storage, synchronization, concurrency and optimization while dealing with big data.
How modern architecture(cloud-architecture), technologies, and tools(open sources) make big data processing easy and Recent advancement in hardware, memory, and networking facilitated big data processing. This performance study done proper analysis of the performance modeling of several big data techniques and tools used for data processing.
Discussed Apache Hadoop ; how it’s used for processing large datasets in the distributed manner and how MapReduce programming paradigm help in processing big data using parallel, distributed clusters. Further explained how MapReduce’s Map function used for map a jobs into several parallel tasks and allocated to nodes that work on a subset of data and How Reduce function used for parallelly processes group data and produces output as combine results.
Overall this paper spotted some of the problems associated with big data processing systems and tried to model them for a better solution. This paper’s main approach to identify bottlenecks in big data technologies and tools to significantly improve the performance of the data processing. This performance modeling study of big data can help to improve the performance of big data system using proper system design and tuning, Efficient application design and improving scalability.
Evaluation of the performance study/ literature review discussed
Discussed some commonly used techniques used in performance modeling. Big data applications performance might be affected if defects found in underlying architecture. proposed a model used to evaluate behavior of the big data application performance under situations. Resources utilization might leads to performance and efficiency issues mostly due to bottleneck in system resources. The most important steps in performance modeling is identifying such bottleneck and propose most suitable solutions to improve its performance which underlying architecture of big data system creates bottlenecks.
This paper proposed the following five Performance Modeling to improve the performance of big data processing.
MapReduce Tasks by using a numerical model for concurrency MapReduce task using Map setup time and early shuffle characteristics. Which can be achieved by tuning Apache Hadoop configuration parameters.
Performance Modeling for Data Transfers using GridFTP protocol model provides faster data transfer and fairness factor so that other data transfers activities.
In-Memory Datasets by using swizzling pointers which removes the overhead of the traditional buffer pool.
Optimizing Data in the Cloud by using Cumulon which simplifies the development and deployment of such programs using matrix-based statistical analysis.
Interpretation of Big Data Cloud by using a highly scalable and lightweight performance analyzer tool (HiTune).
In this performance modeling, They have used execution time, Data transfer rate, protocol throughput, query throughput, speed , Resource utilization as the performance metrics. And Dataset size, system capacity, Hardware type, Both system and application(Hadoop) configuration, Buffer pooling, Traffic rate as system parameters. In the performance evaluation stages they used Word count, K-Means & Sort as the system workloads.
Completeness of the material covered
This paper covered the importance of data in the modern world and how it facilitates the decision-making process. Problems of traditional database management system & why DBMS inadequate to deal with big data. Discussed modern big data technologies and tools how they used for processing big dataset. Most importantly explained about performance issue and bottlenecks in the big data system’s underlying architectures and proposed techniques to improve the performance.
Appropriateness and completeness of graphs/figures
Figure 11 is trying to explain the basic flow of MapReduce using graphical representation. It has clearly show how the map function map jobs in to nodes with the subset of data and how the reduce function process ; merge set of data in each node and procedure final output. But It does not include the details level of MapReduce task like splitting, shuffling ; sorting.
Figure 21. Is has simple explanation about how a user request being serve in system, But it figure does not include metrics ; measurement details of how to measure the if a user request perform the service correctly using metrics like speed, time, success rate, Resource and if a user request perform the service incorrectly using metrics like reliability, probability, error rate and if a user request failed to perform the service using metrics like availability, downtime, failure rate.
Figure 31. This figure has representation of legend box ; both X-axis and Y-axis labeled as informative and properly scaled. Provide details of execution time comparison between hadoop Vs tuned hadoop in different workloads such as Word Count, K-Mean and Sort.
Figure 41. It also has representation of legend box ; both X-axis and Y-axis labeled as informative and properly scaled. Provide details of Protocol Throughput comparison between GridFTP Vs Background TCP in distributed data processing environment.
Figure 51. It also has representation of legend box ; both X-axis and Y-axis labeled as informative and properly scaled. Provide details of compare performance between Main Memory vs Swizzling Pointer using throughput and working dataset.
Other relevant issues.
In this performance study, Only focus on Apache Hadoop and MapReduce technologies and identified performance issues and bottleneck in the system. Proposed several ways to improve the performance. But currently, there are lots of technologies developed and available as open source tools which give performance far better than Hadoop. For example Spark, Flink, Zamza, Tez.
Only Word Count, K-Means ; sort used as workloads for evaluation But there other standard workloads available with some well know benchmarker.
For Evaluation, They were not used any standards dataset used for evaluating performance of the big data technologies.
They used Hadoop configuration as one of the features of this performance study, But mostly configuration depends on the architecture of the hardware system. Some set of configuration might work well in some hardware while the same configuration will not give the promised performance in others hardware.
What would have done differently.
They could have been used some standard available dataset for the performance study of big data technologies and tools. Used compare performance with some most famous big data technologies like Spark, Flink, Zamza, etc.
They could have been used standard Big Data benchmark tools to evaluate performance of Hadoop and Tuned Hadoop system. hibench benchmarks2 which provide some Workloads such as Terasort, enhanced DFSIO, Nutch Indexing, Page Rank, Scan, Join, Aggregate, Bayesian classification.
1 Krushnaraj Kamtekar, Prof. Raj Jain. Performance Modeling of Big Data. Papers submitted for performance analysis of computer systems course. in Spring 2015.
2 Samadi, Y., Zbakh, M., Tadonki, C.: Comparative study between hadoop and spark based on hibench benchmarks. in 2016.