Beeswax: A Multi-Query optimization tool for Big Data
Presented By
Dr. Mohamed Khafagy
Egyptian Big Data Research Group
Beeswax Group
What is Big Data?
A massive volume of both structured
and unstructured data that is so large
to process with traditional database
and software techniques
Type of Data
• Relational Data (Tables/Transaction/Legacy Data)
• Text Data (Web)
• Semi-structured Data (XML)
• Graph Data
• Social Network, Semantic Web (RDF), …
• Streaming Data
•
How much data?
• Google processes 20 PB a day (2008)
• Wayback Machine has 3 PB + 100 TB/month (3/2009)
• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
• eBay has 6.5 PB of user data + 50 TB/day (5/2009)
• CERN’s Large Hydron Collider (LHC) generates 15 PB a year
What to do with these data?
• Aggregation and Statistics
• Data warehouse and OLAP
• Indexing, Searching, and Querying
• Keyword based search
• Pattern matching (XML/RDF)
• Knowledge discovery
• Data Mining
• Statistical Modeling
• Machine Learning
HUGE MARKET OPPORTUNITY
Big Data Analytics is a game changer in every industry and is a huge market opportunity
eCommerce Services
Facetted Search
Web analytics
SEO- analytics
Online- Advertising
Ad serving
Profiling
Targeting
Social Networks
Trend analysis
Fraud detection
Automatic trading
Risk analysis
Finance
Customer attrition prevention
Network monitoring
Targeting
Prepaid account mgmt
Telco
Smart metering
Smart grids
Wind parks
Mining
Solar Panels
Energy
Oil and Gas Many
More
M2M
Sensors
Genetics
Intelligence
Weather
M an y Applic ations
All Industries
Education
Production
Mining
Example of Big Data Projects
• Climate
Predict amount of rainfall on the Nile basin countries
• Bank
• Fraud detection and Money laundering
• Transportation
• Predicting Traffic Jam
• Education
• Help the student to choose the appropriate specialization
• Social Network /Opinion mining
Big Data Features
Time Cost
Money
Big Data Analysis
Beeswax and HIVE
Why Beeswax
1 • Support running advanced SQL query
2 • Enhance the performance of JOIN queries
3 • Reuse Intermediate Results in Multi-session
4 • Support Data Sharing for
• Multi-User
• Multi-Query
• Multi-Session
Customer Supplier
Date Part
Line Order
Star Schema Benchmark SSB
based on TPC-H benchmark to measure performance
Billion tuples / ~2TBs
Beeswax Evaluation
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
Beeswax Modules
Reuse Intermediate Results in Multi-session
Enhance the Performance of JOIN Queries
Exploit Data Sharing
Support Advanced Query
Beeswax Architecture
SQL Query Parser
Inp ut Que ry (s)
ReWrite Query
Query(s) Results
Ex ec ute Opt imize d Que ry (s)
JOUM JOMR QRMapper
MOTH Beeswax Meta data and Index
HOME iHOME
R esul ts
Beeswax Modules
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
HiveQL Optimization in Multi-Session Environment
( HOME)
Beeswax Modules
First Module
HOME Main Goals
•Reuse intermediate results in multi-session by
storing previous results
HOME System
HOME
Pervious Results
HOME Cases
1 • Exists Query
2 • Subset of Results
3 • ORDER BY Clause
4 • GROUP BY Clause
5 • HAVING Clause
6 • WHERE Clause
• JOIN Clause
Query Parser
JOIN Query Extractor
HAVING Conditions JOIN Conditions ORDER BY Columns GROUP BY Columns WHERE Conditions
Tables Names Columns Names
HOME Cases Indicator
WHERE Clauses Existed- Condition
WHERE Clauses Sub- Condition
GROUP BY Clauses HAVING Clauses ORDER BY Clauses
Existed Query Subset of Columns
WHERE Clauses New-Condition
Case 1 Case 2
Case 3 Case 4
Case 5
Case 6
Case 7
Case 8
HOME Repository
Storage HOME
Metafile
Execute Optimized Query on Pervious
Results
HOME Query Optimizer
Query in HOME Metafile
Yes
Indicated Case
Query Rewriter No
Save in HOME Metafile
Execute Input HiveQL
Store Results Input HiveQL
Output Results
HOME System Architecture
Improvement wrt Hive Case Name
Case Number
100%
Exists Query 1
32%
Subset of Results 2
70%
ORDER BY Clause 3
89%
GROUP BY Clause 4
89%
HAVING Clause 5
38%
Add Condition WHERE Clause
6 Same Condition 36%
27%
Different Condition
81%
Add Condition JOIN Clause
7 Same Condition 89%
88%
Different Condition
Home Evaluation
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
Beeswax Modules
Second Module
indexing HiveQL Optimization
For JOIN Over Multi-Session Environment ( iHOME)
Beeswax Modules
Second Module
HOME Drawback
Wasting in Storage Space
Data Repetition
• iHOME overcome drawback
• by storing index of previous results instead of restoring it
iHOME System
iHOME
Index of Pervious Results HOME
Pervious Results
Improvement wrt Hive Case Name
Case Number
89%
Exists Query 1
87%
Subset of Columns 2
86%
ORDER BY Clause 3
88%
GROUP BY Clause 4
80%
HAVING Clause 5
81%
Add Condition JOIN Clause
7 Same Condition 82%
88%
Different Condition
iHOME Evaluation
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
Beeswax Modules
Third Module
Why QRMapper –cont..
Existed SQL-on-Hadoop Translators
Did not support advanced/complex SQL Query
UNION, MINUS, INTERSECT, ….
Inefficient compared to hand-optimized MapReduce programs
Auto-generated jobs for queries
Long jobs- execution time overheads
Unnecessary jobs -waste cluster resources
Non-optimized jobs - affect HPC performance
1 • Improve performance of HiveQL
2
• Support compatibility of Hive for advanced query
QRMapper Translator Objectives
QRMapper Translator Cases
Case 1: UNION Query
Case 2: MINUS Query
Case 3: INTERSECT Query
Case 4: Sub Query
HAVING clause
WHERE clauses
EXISTS – NOT EXISTS
ALL– ANY
Complex SQL Query Complex SQL Query Complex SQL Query Complex SQL Query
Query Results
Query Results Query Results Query Results
QRMapper Translator Simple Query
Parser
QRMapper Translator High-level Design
SQL Clauses
QRMapperExisted SQL-on-Hadoop Translators
Hive Ysmart S2mart QMapper
SELECT,LOAD INSERT from query √ √ √ √ √
Expressions in WHERE and HAVING √ √ √ √ √
GROUP BY, ORDER BY, SORTBY √ √ √ √ √
Sub-queries in FROM clause √ X X X √
Sub-queries in WHERE clause ( IN, NOT IN) √ X X X X
(EXISTS)/(NOT EXISTS) √ X X X √
Sub-queries in HAVING clause √ X X X X
Intra Query Relation ships X X √ √ X
GROUP BY, ORDER BY √ √ X X X
UNION √ X √ √ X
LEFT,RIGHT and FULL INNER/OUTER JOIN √ X √ √ √
QRMapper Evaluation
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
Beeswax Modules
Fourth Module
JOIN Order In Map-Reduce (JOMR) Beeswax Modules
Fourth Module
JOMR Main Goals
1 • Optimize Multi-JOIN query in Map-Reduce Jobs
2
•Change JOIN order of tables in multi-JOIN query
• Save intermediate results costs
JOMR Case Study (cont..)
• Return to Search Strategy
Find the best execution plan
N
O
L C
JOMR Case Study (cont..)
Query Rewriting
Original query
customer orders lineitem nation
Optimized query
customer nation orders lineitem
JOMR Evaluation
0 50 100 150 200 250
2 3 4 5 6
A vg . r unt ime (s ec )
Number of tables
Org.
JOMR
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
Beeswax Modules
Fifth Module
JOIN Once Use Many (JOUM)
Beeswax Modules
Fifth Module
JOUM Main Goals
1
• Generate optimized pipeline for Hive JOIN query execution
2 • Optimize Hive temporary storage size
JOUM Indexed Materialized Table
L O L O L O
materialized Table
Load into Hive
Materialized Table Builder
Reduce Reduce
Reduce
MAP MAP
MAP
Filling-In UDF Query Result
No Shuffle
JOUM JOIN Pipeline
No Shuffle
Indexing
L O L O L O
Index Materialized Table
Load into Hive
Materialized Indexed Table Builder
Reduce Reduce
Reduce
MAP MAP
MAP
Filling-In UDF
Query Result
JOUM Experimental Results
Evaluation the Temporary Storage
JOUM temporary storage required for running TPC-H Query #1
Redundant: 34.1%
Joined Indexed: 16.4%
Records Number In Million
JOUM Experimental Results
Evaluation Execution time
TPC-H Query #1 JOUM execution time by number of records Query #1.
Redundant: 58%
Joined Indexed: 71%
iHOME
HOME
QR
Mapper MOTH
JOMR JOUM
Beeswax Modules
Sixth Module
Multi-Query Optimizer using T uple Size and Histogram
(MOTH)
Beeswax Modules
Sixth Module
Big Data Multi-Query Optimization I/O
Q1 Q2
Q4 Q5
Q3
Q6
Q7
Input Multi-Query
Q4
Q2
Q1 Q5
Q6
Materialization and Grouping Techniques Non-Concurrent and Concurrent Execution
Output
Optimized Multi-Query Plan
Exploiting Sharing
Fine-grained vs. coarse-grained
Regular grain size Irregular grain size
The reused-based opportunities techniques in Big Data multi-
query optimization
High-level abstraction of the MOTH system
iHOME