PERFORMANCE ANALYSIS OF TWO BIG DATA TECHNOLOGIES ON A CLOUD DISTRIBUTED ARCHITECTURE. RESULTS FOR NON-AGGREGATE QUERIES ON MEDIUM-SIZED DATA

Authors

  • Marin FOTACHE
  • Ionuț HRUBARU

DOI:

https://doi.org/10.1515/saeb-2016-0134

Keywords:

Big Data, cloud computing, performance benchmarks, Hadoop, Hive, PostgreSQL, Postgres XL, R

Abstract

Big Data systems manage and process huge volumes of data constantly generated by various technologies in a myriad of formats. Big Data advocates (and preachers) have claimed that, relative to classical, relational/SQL Data Base Management Systems, Big Data technologies such as NoSQL, Hadoop and in-memory data stores perform better. This paper compares data processing performance of two systems belonging to SQL (PostgreSQL/Postgres XL) and Big Data (Hadoop/Hive) camps on a distributed five-node cluster deployed in cloud. Unlike benchmarks in use (YCSB, TPC), a series of R modules were devised for generating random non-aggregate queries on different subschema (with increasing data size) of TPC-H database. Overall performance of the two systems was compared. Subsequently a number of models were developed for relating performance on the system and also on various query parameters such as the number of attributes in SELECT and WHERE clause, number of joins, number of processing rows etc.

JEL Codes - M15

References

Buhl, H.U., Röglinger, M. and Moser, F., 2013. Big Data: A Fashionable Topic with(out) Sustainable Relevance for Research and Practice?. Business & Information Systems Engineering, 5(2), pp.65-69

Cattell, R., 2010. Scalable SQL and NoSQL Data Stores, ACM SIGMOD Record, 39(4), pp. 12-27

Cogean, D.I., Fotache, M. and Greavu-Serban, V., 2013. NoSQL in Higher Education. A Case Study. In: Proc. of the 12th International Conference on Informatics in Economy (IE 2013), Bucuresti: Romania, pp.352-360

Cooper, B.F., Silberstein, A., Tam, E., Ramakrishnan, R. and Sears, R., 2010. Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on Cloud computing (SoCC '10). ACM, New York, NY, USA, pp. 143-154. DOI=http://dx.doi.org/10.1145/1807128.1807152

Doulkeridis, C. and Norvag, K., 2014. A survey of large-scale analytical query processing in MapReduce. The VLDB Journal, 23(3), pp. 355-380. DOI=http://dx.doi.org/10.1007/s00778-013-0319-9

Faraway J., 2015. Linear Models with R, 2nd Edition, Boca Raton FL: CRC Press

Feng, Li, Beng, C. O., Tamer O., and Sai W., 2014. Distributed data management using MapReduce. ACM Computing Surveys, 46(3), Article 31 (January 2014), 42 pages. DOI: http://dx.doi.org/10.1145/2503009

Fotache, M., Hrubaru, I., 2016. Big Data Technology on Medium-Sized Data. Preliminary Results for Non-Aggregate Queries, In: Proc. of the 15th International Conference on Informatics in Economy (IE 2016), Cluj-Napoca, Romania, June 2-5

Fotache, M., Strîmbei, C., Hrubaru, I. and Cogean, D.I., 2014. Scratching Big Data Surface: Comparing Simple Queries in PostgreSQL and MongoDB. In: Proc. of the 13th International Conference on Informatics in Economy (IE 2014), Bucharest, Romania, May 15-18, 2014, pp.411-416

Fox, J., 2003. Effect Displays in R for Generalised Linear Models. Journal of Statistical Software, 8(15), pp.1-27, Available at: http://www.jstatsoft.org/v08/i15/ [Accessed 15 June 2016]

Fox, J., 2016. Applied Regression Analysis and Generalized Linear Models, 3rd Edition. Thousand Oaks CA: Sage

Fox, J. and Weisberg, S., 2011. An R Companion to Applied Regression, 2nd Edition. Thousand Oaks CA: Sage

Giraudoux, P., 2016. pgirmess: Data Analysis in Ecology. R package version 1.6.5., Available at: https://CRAN.R-project.org/package=pgirmess [Accessed 26 September 2016]

Gross, J. and Ligges, U., 2015. nortest: Tests for Normality. R package version 1.0-4, Available at: https://CRAN.R-project.org/package=nortest [Accessed 5 September 2016]

Hothorn, T. and Hornik, K., 2015. exactRankTests: Exact Distributions for Rank and Permutation Tests. R package version 0.8-28. Available at: https://CRAN.Rproject.org/package=exactRankTests [Accessed 10 September 2016]

Hrubaru, I. and Fotache, M., 2015. On a Hadoop Cliché: Physical and Logical Models Separation, In: Proc. of the 14th International Conference on Informatics in Economy (IE 2015), Bucharest, Romania, pp. 357-363

Jacobs, A., 2009. The pathologies of big data. Communications of the ACM, 52(8), pp.36-44. DOI=http://dx.doi.org/10.1145/1536616.1536632

James, G., Witten, D., Hastie, T., and Tibshirani, R., 2014. An Introduction to Statistical Learning With Applications in R. New York NY: Springer

Kejser, T., 2014. TPC-H: Data And Query Generation, Available at: http://kejser.org/tpc-h-data-and-query-generation/, [Accessed 10 April 2016]

Kloke, J. and McKean, J.W., 2012. Rfit: Rank-based estimation for linear models, The R Journal, 4(2), pp.57-64.

Kloke J. and McKean J.W., 2015. Nonparametric Statistical Methods Using R, Boca Raton FL: CRC Press

Kowalczyk, M. and Buxmann, P., 2014. Big Data and Information Processing in Organizational Decision Processes, Business & Information Systems Engineering, 6(5), pp.267-278

Li, F., Ooi, B.C., Özsu, M. T. and Wu, S., 2014. Distributed data management using MapReduce, ACM Computing Surveys, 46(3), Article 31

Lublinsky B., Smith K. and Yabukovich A., 2013. Professional Hadoop Solutions, Indianapolis IN: John Wiley & Sons

Lungu,I. and Tudorica, B.G., 2013. The Development of a Benchmark Tool for NoSQL Databases, Database Systems Journal, 4(2), pp.13-20

Pavlo, A. and Aslett, M., 2016. What's Really New with NewSQL?, ACM SIGMOD Record, 45(2), pp.45-55.

Pinheiro, J., Bates, D., DebRoy, S., Sarkar, D. and R Core Team, 2016. nlme: Linear and Nonlinear Mixed Effects Models. R package version 3.1-128, Available at: http://CRAN.R-project.org/package=nlme

PostgresXL, 2016. Postgres XL Overview. [online] Available at: http://www.postgres-xl.org/overview/ [Accessed 10 September 2016]

Sakr, S., Liu, A. and Fayoumi, A.G., 2013. The family of mapreduce and large-scale data processing systems. ACM Computing Surveys, 46(1), Article 11, 44 pages. DOI=http://dx.doi.org/10.1145/2522968.2522979

Solt, F. and Hu, Y., 2016. interplot: Plot the Effects of Variables in Interaction Terms. R package version 0.14., Available at: http://CRAN.R-project.org/package=interplot [Accessed 1 October 2016]

Stonebraker, M., 2012a. What Does 'Big Data' Mean?, Communications of the ACM (BLOG@CACM) [online], September 21, 2012, Available at: http://cacm.acm.org/blogs/blog-cacm/155468-what-does-big-data-mean/fulltext [Accessed 20 March 2016].

Stonebraker, M., 2012b. New opportunities for New SQL, Communications of the ACM, 55(11), pp.10-11

Stonebraker, M., 2015. Hadoop at a Crossroads, Communications of the ACM, 58(1), pp. 18-19. DOI: http://dx.doi.org/10.1145/2686591

Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P. and Murthy, R., 2009. Hive: a warehousing solution over a map-reduce framework, Proceedings of the VLDB Endowment, 2(2), pp. 1626-1629. DOI=http://dx.doi.org/10.14778/1687553.1687609

TPC Benchmark H (Decision Support) Standard Specification Revision 2.17.1, 2014, Available at: http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.1.pdf [Accessed 10 April 2016]

Trancoso, P., 2015. Moving to memoryland: in-memory computation for existing applications, In: Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15), Ischia:Italy, ACM, New York, NY, USA, Article 32, 6 pages.

Venables, W. N. and Ripley, B. D., 2002. Modern Applied Statistics with S, 4th Edition, New York: Springer

Wei, T. and Simko, V., 2016. corrplot: Visualization of a Correlation Matrix. R package version 0.77. [online], Available at: https://CRAN.R-project.org/package=corrplot [Accessed 22 September 2016]

Wickham, H., 2016. ggplot2: Elegant Graphics for Data Analysis, 2nd Edition, New York: Springer

White T., 2015. Hadoop - The Definitive Guide, Sebastopol, CA: O’Reilly Media

Ylijoki, O. and Porras, J., 2016. Perspectives to Definition of Big Data: A Mapping Study and Discussion, Journal of Innovation Management, 4(1), pp. 69-91

Zeileis, A. and Hothorn, T., 2002. Diagnostic Checking in Regression Relationships. R News, 2(3), pp.7-10. Available at: http://CRAN.R-project.org/doc/Rnews/ [Accessed 20 April 2016].

Downloads

Published

2017-01-03

How to Cite

FOTACHE, M., & HRUBARU, I. (2017). PERFORMANCE ANALYSIS OF TWO BIG DATA TECHNOLOGIES ON A CLOUD DISTRIBUTED ARCHITECTURE. RESULTS FOR NON-AGGREGATE QUERIES ON MEDIUM-SIZED DATA. Scientific Annals of Economics and Business, 63(SI), 21–50. https://doi.org/10.1515/saeb-2016-0134