• Tidak ada hasil yang ditemukan

Beeswax: A Multi-Query optimization tool for Big Data

N/A
N/A
Protected

Academic year: 2023

Membagikan "Beeswax: A Multi-Query optimization tool for Big Data"

Copied!
55
0
0

Teks penuh

(1)

Beeswax: A Multi-Query optimization tool for Big Data

Presented By

Dr. Mohamed Khafagy

(2)

Egyptian Big Data Research Group

Beeswax Group

(3)

What is Big Data?

A massive volume of both structured

and unstructured data that is so large

to process with traditional database

and software techniques

(4)

Type of Data

• Relational Data (Tables/Transaction/Legacy Data)

• Text Data (Web)

• Semi-structured Data (XML)

Graph Data

• Social Network, Semantic Web (RDF), …

• Streaming Data

(5)

How much data?

• Google processes 20 PB a day (2008)

• Wayback Machine has 3 PB + 100 TB/month (3/2009)

• Facebook has 2.5 PB of user data + 15 TB/day (4/2009)

• eBay has 6.5 PB of user data + 50 TB/day (5/2009)

• CERN’s Large Hydron Collider (LHC) generates 15 PB a year

(6)

What to do with these data?

• Aggregation and Statistics

• Data warehouse and OLAP

• Indexing, Searching, and Querying

• Keyword based search

• Pattern matching (XML/RDF)

• Knowledge discovery

• Data Mining

• Statistical Modeling

• Machine Learning

(7)
(8)

HUGE MARKET OPPORTUNITY

Big Data Analytics is a game changer in every industry and is a huge market opportunity

eCommerce Services

 Facetted Search

 Web analytics

 SEO- analytics

 Online- Advertising

 Ad serving

 Profiling

 Targeting

Social Networks

 Trend analysis

 Fraud detection

 Automatic trading

 Risk analysis

Finance

 Customer attrition prevention

 Network monitoring

 Targeting

 Prepaid account mgmt

Telco

 Smart metering

 Smart grids

 Wind parks

 Mining

 Solar Panels

Energy

Oil and Gas Many

More

 M2M

 Sensors

 Genetics

 Intelligence

 Weather

M an y Applic ations

All Industries

Education

 Production

 Mining

(9)

Example of Big Data Projects

• Climate

Predict amount of rainfall on the Nile basin countries

• Bank

• Fraud detection and Money laundering

• Transportation

• Predicting Traffic Jam

• Education

• Help the student to choose the appropriate specialization

• Social Network /Opinion mining

(10)

Big Data Features

(11)

Time Cost

Money

Big Data Analysis

(12)
(13)
(14)

Beeswax and HIVE

(15)

Why Beeswax

1 • Support running advanced SQL query

2 • Enhance the performance of JOIN queries

3 • Reuse Intermediate Results in Multi-session

4 • Support Data Sharing for

• Multi-User

• Multi-Query

• Multi-Session

(16)

Customer Supplier

Date Part

Line Order

 Star Schema Benchmark SSB

 based on TPC-H benchmark to measure performance

 Billion tuples / ~2TBs

Beeswax Evaluation

(17)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Reuse Intermediate Results in Multi-session

Enhance the Performance of JOIN Queries

Exploit Data Sharing

Support Advanced Query

(18)

Beeswax Architecture

SQL Query Parser

Inp ut Que ry (s)

ReWrite Query

Query(s) Results

Ex ec ute Opt imize d Que ry (s)

JOUM JOMR QRMapper

MOTH Beeswax Meta data and Index

HOME iHOME

R esul ts

(19)

Beeswax Modules

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

(20)

HiveQL Optimization in Multi-Session Environment

( HOME)

Beeswax Modules

First Module

(21)

HOME Main Goals

•Reuse intermediate results in multi-session by

storing previous results

HOME System

HOME

Pervious Results

(22)

HOME Cases

1 • Exists Query

2 • Subset of Results

3 • ORDER BY Clause

4 • GROUP BY Clause

5 • HAVING Clause

6 • WHERE Clause

• JOIN Clause

(23)

Query Parser

JOIN Query Extractor

HAVING Conditions JOIN Conditions ORDER BY Columns GROUP BY Columns WHERE Conditions

Tables Names Columns Names

HOME Cases Indicator

WHERE Clauses Existed- Condition

WHERE Clauses Sub- Condition

GROUP BY Clauses HAVING Clauses ORDER BY Clauses

Existed Query Subset of Columns

WHERE Clauses New-Condition

Case 1 Case 2

Case 3 Case 4

Case 5

Case 6

Case 7

Case 8

HOME Repository

Storage HOME

Metafile

Execute Optimized Query on Pervious

Results

HOME Query Optimizer

Query in HOME Metafile

Yes

Indicated Case

Query Rewriter No

Save in HOME Metafile

Execute Input HiveQL

Store Results Input HiveQL

Output Results

HOME System Architecture

(24)

Improvement wrt Hive Case Name

Case Number

100%

Exists Query 1

32%

Subset of Results 2

70%

ORDER BY Clause 3

89%

GROUP BY Clause 4

89%

HAVING Clause 5

38%

Add Condition WHERE Clause

6 Same Condition 36%

27%

Different Condition

81%

Add Condition JOIN Clause

7 Same Condition 89%

88%

Different Condition

Home Evaluation

(25)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Second Module

(26)

indexing HiveQL Optimization

For JOIN Over Multi-Session Environment ( iHOME)

Beeswax Modules

Second Module

(27)

HOME Drawback

 Wasting in Storage Space

 Data Repetition

• iHOME overcome drawback

• by storing index of previous results instead of restoring it

iHOME System

iHOME

Index of Pervious Results HOME

Pervious Results

(28)

Improvement wrt Hive Case Name

Case Number

89%

Exists Query 1

87%

Subset of Columns 2

86%

ORDER BY Clause 3

88%

GROUP BY Clause 4

80%

HAVING Clause 5

81%

Add Condition JOIN Clause

7 Same Condition 82%

88%

Different Condition

iHOME Evaluation

(29)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Third Module

(30)

Why QRMapper –cont..

Existed SQL-on-Hadoop Translators

 Did not support advanced/complex SQL Query

 UNION, MINUS, INTERSECT, ….

 Inefficient compared to hand-optimized MapReduce programs

 Auto-generated jobs for queries

 Long jobs- execution time overheads

 Unnecessary jobs -waste cluster resources

 Non-optimized jobs - affect HPC performance

(31)

1 • Improve performance of HiveQL

2

• Support compatibility of Hive for advanced query

QRMapper Translator Objectives

(32)

QRMapper Translator Cases

 Case 1: UNION Query

 Case 2: MINUS Query

 Case 3: INTERSECT Query

 Case 4: Sub Query

HAVING clause

WHERE clauses

 EXISTS – NOT EXISTS

 ALL– ANY

(33)

Complex SQL Query Complex SQL Query Complex SQL Query Complex SQL Query

Query Results

Query Results Query Results Query Results

QRMapper Translator Simple Query

Parser

QRMapper Translator High-level Design

(34)

SQL Clauses

QRMapper

Existed SQL-on-Hadoop Translators

Hive Ysmart S2mart QMapper

SELECT,LOAD INSERT from query

Expressions in WHERE and HAVING √ √ √ √

GROUP BY, ORDER BY, SORTBY

Sub-queries in FROM clause X X X

Sub-queries in WHERE clause ( IN, NOT IN) X X X X

(EXISTS)/(NOT EXISTS) X X X

Sub-queries in HAVING clause X X X X

Intra Query Relation ships X X X

GROUP BY, ORDER BY X X X

UNION X X

LEFT,RIGHT and FULL INNER/OUTER JOIN X

QRMapper Evaluation

(35)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Fourth Module

(36)

JOIN Order In Map-Reduce (JOMR) Beeswax Modules

Fourth Module

(37)

JOMR Main Goals

1 • Optimize Multi-JOIN query in Map-Reduce Jobs

2

•Change JOIN order of tables in multi-JOIN query

• Save intermediate results costs

(38)

JOMR Case Study (cont..)

Return to Search Strategy

Find the best execution plan

N

O

L C

(39)

JOMR Case Study (cont..)

Query Rewriting

Original query

customer orders lineitem nation

Optimized query

customer nation orders lineitem

(40)

JOMR Evaluation

0 50 100 150 200 250

2 3 4 5 6

A vg . r unt ime (s ec )

Number of tables

Org.

JOMR

(41)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Fifth Module

(42)

JOIN Once Use Many (JOUM)

Beeswax Modules

Fifth Module

(43)

JOUM Main Goals

1

• Generate optimized pipeline for Hive JOIN query execution

2 • Optimize Hive temporary storage size

(44)

JOUM Indexed Materialized Table

L O L O L O

materialized Table

Load into Hive

Materialized Table Builder

Reduce Reduce

Reduce

MAP MAP

MAP

Filling-In UDF Query Result

No Shuffle

JOUM JOIN Pipeline

No Shuffle

Indexing

L O L O L O

Index Materialized Table

Load into Hive

Materialized Indexed Table Builder

Reduce Reduce

Reduce

MAP MAP

MAP

Filling-In UDF

Query Result

(45)

JOUM Experimental Results

Evaluation the Temporary Storage

JOUM temporary storage required for running TPC-H Query #1

Redundant: 34.1%

Joined Indexed: 16.4%

Records Number In Million

(46)

JOUM Experimental Results

Evaluation Execution time

TPC-H Query #1 JOUM execution time by number of records Query #1.

Redundant: 58%

Joined Indexed: 71%

(47)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Sixth Module

(48)

Multi-Query Optimizer using T uple Size and Histogram

(MOTH)

Beeswax Modules

Sixth Module

(49)

Big Data Multi-Query Optimization I/O

Q1 Q2

Q4 Q5

Q3

Q6

Q7

Input Multi-Query

Q4

Q2

Q1 Q5

Q6

Materialization and Grouping Techniques Non-Concurrent and Concurrent Execution

Output

Optimized Multi-Query Plan

Exploiting Sharing

(50)

Fine-grained vs. coarse-grained

Regular grain size Irregular grain size

(51)

The reused-based opportunities techniques in Big Data multi-

query optimization

(52)

High-level abstraction of the MOTH system

(53)

iHOME

HOME

QR

Mapper MOTH

JOMR JOUM

Beeswax Modules

Reuse Intermediate Results in Multi-session

Enhance the Performance of JOIN Queries

Exploit Data Sharing

Support Advanced Query

(54)

Beeswax Extendibility

Complete and concrete Beeswax tool can be extended to other

Big Data analysis systems

Platform-Independent Tool

Deployed on Cloud

Big Data Analysis-as-Service (BDAaS)

Time Cost

Flink

(55)

Thank You

Referensi

Dokumen terkait

• Bank umum adalah bank yang melaksanakan kegiatan usaha secara konvensional dan atau berdasarkan prinsip syariah yang dalam kegiatannya memberikan jasa dalam lalu

The clinical characteristics included etiology of ARDS, ARDS severity, onset of ARDS, days from admission to ARDS onset, duration of mechanical ventilation prior to ARDS,

Securing forest tenure and resource rights is a critical cornerstone and a first prerequisite for promoting community forestry through mobilising local communities

UNIT LAYANAN PENGADAAN (PROCUREMENT UNIT) KABUPATEN GUNUNG MAS TAHUN 2015.. KELOMPOK KERJA PENGADAAN JASA KONSULTANSI DAN PENGADAAN

Paket pengadaan ini terbuka untuk penyedia jasa yang teregistrasi pada Layanan Pengadaan Secara Elektronik (LPSE) dan memenuhi persyaratan.

SELEKSI OLIMPIADE SAINS SMP TINGKAT KABUPATEN/KOTA TAHUN 2016. KEPALA DINAS PENDIDIKAN PROVINSI

Perumusan masalah penelitian ini adalah untuk melihat dasar pembentukan sikap nasabah terhadap teknologi swa layan ATM yang diharapkan akan mengarah pada pembentukan niat mereka

Pada Dinas Kesehatan Provinsi Sulawesi Tengah aparat sebagai implementor dalam bekerja merupakan satu kesatuan Tim sehinggga ketika mengimpelentasikan kebijakan surat tanda