PERFORMANCE ANALYSIS OF TWO BIG DATA TECHNOLOGIES ON A CLOUD DISTRIBUTED ARCHITECTURE. RESULTS FOR NON-AGGREGATE QUERIES ON MEDIUM-SIZED DATA
DOI:
https://doi.org/10.1515/saeb-2016-0134Keywords:
Big Data, cloud computing, performance benchmarks, Hadoop, Hive, PostgreSQL, Postgres XL, RAbstract
Big Data systems manage and process huge volumes of data constantly generated by various technologies in a myriad of formats. Big Data advocates (and preachers) have claimed that, relative to classical, relational/SQL Data Base Management Systems, Big Data technologies such as NoSQL, Hadoop and in-memory data stores perform better. This paper compares data processing performance of two systems belonging to SQL (PostgreSQL/Postgres XL) and Big Data (Hadoop/Hive) camps on a distributed five-node cluster deployed in cloud. Unlike benchmarks in use (YCSB, TPC), a series of R modules were devised for generating random non-aggregate queries on different subschema (with increasing data size) of TPC-H database. Overall performance of the two systems was compared. Subsequently a number of models were developed for relating performance on the system and also on various query parameters such as the number of attributes in SELECT and WHERE clause, number of joins, number of processing rows etc.
JEL Codes - M15References
Buhl, H.U., Röglinger, M. and Moser, F., 2013. Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?. Business & Information Systems Engineering, 5(2), pp.65-69
Cattell, R., 2010. Scalable SQL and NoSQL Data Stores, ACM SIGMOD Record, 39(4), pp. 12-27
Cogean, D.I., Fotache, M. and Greavu-Serban, V., 2013. NoSQL in Higher Education. A Case Study. In: Proc. of the 12th International Conference on Informatics in Economy (IE 2013), Bucuresti: Romania, pp.352-360
Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R. and Sears, R., 2010. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on Cloud computing (SoCC '10). ACM, New York, NY, USA, pp. 143-154. DOI=http://dx.doi.org/10.1145/1807128.1807152
Doulkeridis, C. and Norvag, K., 2014. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal, 23(3), pp. 355-380. DOI=http://dx.doi.org/10.1007/s00778-013-0319-9
Faraway J., 2015. Linear Models with R, 2nd Edition, Boca Raton FL: CRC Press
Feng, Li, Beng, C. O., Tamer O., and Sai W., 2014. Distributed data management using MapReduce. ACM Computing Surveys, 46(3), Article 31 (January 2014), 42 pages. DOI: http://dx.doi.org/10.1145/2503009
Fotache, M., Hrubaru, I., 2016. Big Data Technology on Medium-Sized Data. Preliminary Results for Non-Aggregate Queries, In: Proc. of the 15th International Conference on Informatics in Economy (IE 2016), Cluj-Napoca, Romania, June 2-5
Fotache, M., Strîmbei, C., Hrubaru, I. and Cogean, D.I., 2014. Scratching Big Data Surface: Comparing Simple Queries in PostgreSQL and MongoDB. In: Proc. of the 13th International Conference on Informatics in Economy (IE 2014), Bucharest, Romania, May 15-18, 2014, pp.411-416
Fox, J., 2003. Effect Displays in R for Generalised Linear Models. Journal of Statistical Software, 8(15), pp.1-27, Available at: http://www.jstatsoft.org/v08/i15/ [Accessed 15 June 2016]
Fox, J., 2016. Applied Regression Analysis and Generalized Linear Models, 3rd Edition. Thousand Oaks CA: Sage
Fox, J. and Weisberg, S., 2011. An R Companion to Applied Regression, 2nd Edition. Thousand Oaks CA: Sage
Giraudoux, P., 2016. pgirmess: Data Analysis in Ecology. R package version 1.6.5., Available at: https://CRAN.R-project.org/package=pgirmess [Accessed 26 September 2016]
Gross, J. and Ligges, U., 2015. nortest: Tests for Normality. R package version 1.0-4, Available at: https://CRAN.R-project.org/package=nortest [Accessed 5 September 2016]
Hothorn, T. and Hornik, K., 2015. exactRankTests: Exact Distributions for Rank and Permutation Tests. R package version 0.8-28. Available at: https://CRAN.Rproject.org/package=exactRankTests [Accessed 10 September 2016]
Hrubaru, I. and Fotache, M., 2015. On a Hadoop Cliché: Physical and Logical Models Separation, In: Proc. of the 14th International Conference on Informatics in Economy (IE 2015), Bucharest, Romania, pp. 357-363
Jacobs, A., 2009. The pathologies of big data. Communications of the ACM, 52(8), pp.36-44. DOI=http://dx.doi.org/10.1145/1536616.1536632
James, G., Witten, D., Hastie, T., and Tibshirani, R., 2014. An Introduction to Statistical Learning With Applications in R. New York NY: Springer
Kejser, T., 2014. TPC-H: Data And Query Generation, Available at: http://kejser.org/tpc-h-data-and-query-generation/, [Accessed 10 April 2016]
Kloke, J. and McKean, J.W., 2012. Rfit: Rank-based estimation for linear models, The R Journal, 4(2), pp.57-64.
Kloke J. and McKean J.W., 2015. Nonparametric Statistical Methods Using R, Boca Raton FL: CRC Press
Kowalczyk, M. and Buxmann, P., 2014. Big Data and Information Processing in Organizational Decision Processes, Business & Information Systems Engineering, 6(5), pp.267-278
Li, F., Ooi, B.C., Özsu, M. T. and Wu, S., 2014. Distributed data management using MapReduce, ACM Computing Surveys, 46(3), Article 31
Lublinsky B., Smith K. and Yabukovich A., 2013. Professional Hadoop Solutions, Indianapolis IN: John Wiley & Sons
Lungu,I. and Tudorica, B.G., 2013. The Development of a Benchmark Tool for NoSQL Databases, Database Systems Journal, 4(2), pp.13-20
Pavlo, A. and Aslett, M., 2016. What's Really New with NewSQL?, ACM SIGMOD Record, 45(2), pp.45-55.
Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D. and R Core Team, 2016. nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-128, Available at: http://CRAN.R-project.org/package=nlme
PostgresXL, 2016. Postgres XL Overview. [online] Available at: http://www.postgres-xl.org/overview/ [Accessed 10 September 2016]
Sakr, S., Liu, A. and Fayoumi, A.G., 2013. The family of mapreduce and large-scale data processing systems. ACM Computing Surveys, 46(1), Article 11, 44 pages. DOI=http://dx.doi.org/10.1145/2522968.2522979
Solt, F. and Hu, Y., 2016. interplot: Plot the Effects of Variables in Interaction Terms. R package version 0.14., Available at: http://CRAN.R-project.org/package=interplot [Accessed 1 October 2016]
Stonebraker, M., 2012a. What Does 'Big Data' Mean?, Communications of the ACM (BLOG@CACM) [online], September 21, 2012, Available at: http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext [Accessed 20 March 2016].
Stonebraker, M., 2012b. New opportunities for New SQL, Communications of the ACM, 55(11), pp.10-11
Stonebraker, M., 2015. Hadoop at a Crossroads, Communications of the ACM, 58(1), pp. 18-19. DOI: http://dx.doi.org/10.1145/2686591
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P. and Murthy, R., 2009. Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, 2(2), pp. 1626-1629. DOI=http://dx.doi.org/10.14778/1687553.1687609
TPC Benchmark H (Decision Support) Standard Specification Revision 2.17.1, 2014, Available at: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf [Accessed 10 April 2016]
Trancoso, P., 2015. Moving to memoryland: in-memory computation for existing applications, In: Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15), Ischia:Italy, ACM, New York, NY, USA, Article 32, 6 pages.
Venables, W. N. and Ripley, B. D., 2002. Modern Applied Statistics with S, 4th Edition, New York: Springer
Wei, T. and Simko, V., 2016. corrplot: Visualization of a Correlation Matrix. R package version 0.77. [online], Available at: https://CRAN.R-project.org/package=corrplot [Accessed 22 September 2016]
Wickham, H., 2016. ggplot2: Elegant Graphics for Data Analysis, 2nd Edition, New York: Springer
White T., 2015. Hadoop - The Definitive Guide, Sebastopol, CA: O’Reilly Media
Ylijoki, O. and Porras, J., 2016. Perspectives to Definition of Big Data: A Mapping Study and Discussion, Journal of Innovation Management, 4(1), pp. 69-91
Zeileis, A. and Hothorn, T., 2002. Diagnostic Checking in Regression Relationships. R News, 2(3), pp.7-10. Available at: http://CRAN.R-project.org/doc/Rnews/ [Accessed 20 April 2016].
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2016 SCIENTIFIC ANNALS OF ECONOMICS AND BUSINESS
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
All accepted papers are published on an Open Access basis.
The Open Access License is based on the Creative Commons license.
The non-commercial use of the article will be governed by the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License as currently displayed on https://creativecommons.org/licenses/by-nc-nd/4.0
Under the Creative Commons Attribution-NonCommercial-NoDerivatives license, the author(s) and users are free to share (copy, distribute and transmit the contribution) under the following conditions:
1. they must attribute the contribution in the manner specified by the author or licensor,
2. they may not use this contribution for commercial purposes,
3. they may not alter, transform, or build upon this work.