ApacheKylin在大数据系统中应用课件_第1页
ApacheKylin在大数据系统中应用课件_第2页
ApacheKylin在大数据系统中应用课件_第3页
ApacheKylin在大数据系统中应用课件_第4页
ApacheKylin在大数据系统中应用课件_第5页
已阅读5页,还剩31页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Apache

KylinOLAP

on

HadoopApacheKylinOLAPonHadoop1http://kylin.ioAgenda

What’s

Apache

Kylin?Tech

HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA2Extreme

OLAP

Engine

for

Big

DataKylin

is

an

open

source

Distributed

Analytics

Engine

from

eBay

thatprovides

SQL

interface

and

multi-dimensional

analysis

(OLAP)

onHadoop

supporting

extremely

large

datasetsWhat’s

Kylinkylin

/

ˈkiːˈlɪn

/

麒麟--n.

(in

Chinese

art)

a

mythical

animal

of

composite

form•

Open

Sourced

on

Oct

1st,

2014•

Be

accepted

as

Apache

Incubator

Project

on

Nov

25th,

2014ExtremeOLAPEngineforBigDa3

Big

Data

Era

More

and

more

data

becoming

available

on

Hadoop

Limitations

in

existing

Business

Intelligence

(BI)

Tools

Limited

support

for

HadoopData

size

growing

exponentiallyHigh

latency

of

interactive

queriesScale-Up

architecture

Challenges

to

adopt

Hadoop

as

interactive

analysis

system

Majority

of

analyst

groups

are

SQL

savvyNo

mature

SQL

interface

on

HadoopOLAP

capability

on

Hadoop

ecosystem

not

ready

yetBigDataEraLimitedsupport45

Why

notBuild

an

engine

from

scratch?5 Whynot5

Extreme

Scale

OLAP

EngineKylin

is

designed

to

query

10+

billions

of

rows

on

Hadoop

ANSI

SQL

Interface

on

HadoopKylin

offers

ANSI

SQL

on

Hadoop

and

supports

most

ANSI

SQL

query

functions

Seamless

Integration

with

BI

ToolsKylin

currently

offers

integration

capability

with

BI

Tools

like

Tableau.

Interactive

Query

CapabilityUsers

can

interact

with

Hive

tables

at

sub-second

latency

MOLAP

CubeDefine

a

data

model

from

Hive

tables

and

pre-build

in

Kylin

Scale

Out

ArchitectureQuery

server

cluster

supports

thousands

concurrent

users

and

provide

high

availabilityFeatures

HighlightsExtremeScaleOLAPEngineKyli6

Compression

and

Encoding

SupportIncremental

Refresh

of

CubesApproximate

Query

Capability

for

distinct

count

(HyperLogLog)Leverage

HBase

Coprocessor

for

query

latencyJob

Management

and

MonitoringEasy

Web

interface

to

manage,

build,

monitor

and

query

cubesSecurity

capability

to

set

ACL

at

Cube/Project

LevelSupport

LDAP

IntegrationFeatures

Highlights…CompressionandEncodingSupp7Cube

DesignerCubeDesigner8Job

ManagementJobManagement9Query

and

VisualizationQueryandVisualization10Tableau

IntegrationTableauIntegration11CaseCubeSizeRawRecordsUserSessionAnalysis26TB28+billionrowsClassifiedTrafficAnalysis21TB20+billionrowsGeoXBehaviorAnalysis560GB1.2+billionrows

eBay

90%

query

<

5

seconds

Baidu

Baidu

Map

internal

analysis

Many

other

Proof

of

Concepts

Bloomberg

Law,

British

GAS,

JD,

Microsoft,

StubHub,

Tableau

…Who

are

using

KylinCaseCubeSizeRawRecordsUserSess12http://kylin.ioAgenda

What’s

Apache

Kylin?Tech

HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA13OLAPCubeKylin

Architecture

Overview15SQL-Based

Tool

(BI

Tools:

Tableau…)

JDBC/ODBC

SQL

Online

AnalysisData

Flow

Offline

Data

Flow

Clients/Users

interactive

with

Kylin

via

SQL

OLAP

Cube

is

transparent

to

users

Mid

Latency

-

MinutesHadoop

Hive

Star

Schema

DataLow

Latency

-Seconds

Data

Cube

(HBase)

Key

Value

Data3rd

Party

App(Web

App,

Mobile…)

REST

API

SQL

REST

Server

Query

Engine

Routing

MetadataCube

Build

Engine

(MapReduce…)OLAPCubeKylinArchitectureOve14Cube:

…Fact

Table:

…Dimensions:

…Measures:

…Storage(HBase):

…DimDimDimFact

SourceStar

SchemaColumn

FamilyRow

Key

row

A

row

B

row

CColumn

Val

1

Val

2

Val

3

TargetHBase

Storage

MappingCube

MetadataData

Modeling

End

UserCube

ModelerAdminCube:…FactTable:…DimDimDim 15time,

itemtime,

item,

locationtime,

item,

location,

suppliertimeitemlocationsuppliertime,

locationTime,

supplieritem,

locationitem,

supplierlocation,

suppliertime,

item,

suppliertime,

location,

supplieritem,

location,

supplier1-D

cuboids2-D

cuboids3-D

cuboids4-D(base)

cuboid•Base

vs.

aggregate

cells;

ancestor

vs.

descendant

cells;

parent

vs.

child

cells1.2.3.4.5.(9/15,

milk,

Urbana,

Dairy_land)

-

<time,

item,

location,

supplier>(9/15,

milk,

Urbana,

*)

-

<time,

item,

location>(*,

milk,

Urbana,

*)

-

<item,

location>(*,

milk,

Chicago,

*)

-

<item,

location>(*,

milk,

*,

*)

-

<item>••OLAP

Cube

Balance

between

Space

and

Time

Cuboid

=

one

combination

of

dimensions

Cube

=

all

combination

of

dimensions

(all

cuboids)

0-D(apex)

cuboidtime,itemtime,item,location16Cube

Build

Job

FlowCubeBuildJobFlow17How

To

Store

Cube?

HBase

SchemaHowToStoreCube?–HBaseSch18

Dynamic

data

management

framework.Formerly

known

as

Optiq,

Calcite

is

an

Apache

incubator

project,

used

byApache

Drill

and

Apache

Hive,

among

others.How

to

Query

Cube?Query

Engine

CalciteDynamicdatamanagementframe19•••••Metadata

SPI

Provide

table

schema

from

Kylin

metadataOptimize

Rule

Translate

the

logic

operator

into

Kylin

operatorRelational

Operator

Find

right

cube

Translate

SQL

into

storage

engine

API

call

Generate

physical

execute

plan

by

linq4j

java

implementationResult

Enumerator

Translate

storage

engine

result

into

java

implementation

result.SQL

Function

Add

HyperLogLog

for

distinct

count

Implement

date

time

related

functions

(i.e.

Quarter)How

to

Query

Cube?Kylin

Extensions

on

Calcite•MetadataSPIHowtoQueryCube20Query

Engine

Kylin

Explain

PlanSELECT

test_cal_dt.week_beg_dt,test_category.category_name,

test_category.lvl2_name,

test_category.lvl3_name,test_kylin_fact.lstg_format_name,

test_sites.site_name,

SUM(test_kylin_fact.price)

AS

GMV,

COUNT(*)

AS

TRANS_CNTFROM

test_kylin_factLEFT

JOIN

test_cal_dt

ON

test_kylin_fact.cal_dt

=

test_cal_dt.cal_dtLEFT

JOIN

test_category

ON

test_kylin_fact.leaf_categ_id

=

test_category.leaf_categ_id

AND

test_kylin_fact.lstg_site_id

=test_category.site_idLEFT

JOIN

test_sites

ON

test_kylin_fact.lstg_site_id

=

test_sites.site_idWHERE

test_kylin_fact.seller_id

=

123456OR

test_kylin_fact.lstg_format_name

=

’New'GROUP

BY

test_cal_dt.week_beg_dt,

test_category.category_name,

test_category.lvl2_name,

test_category.lvl3_name,test_kylin_fact.lstg_format_name,test_sites.site_nameOLAPToEnumerableConverterOLAPProjectRel(WEEK_BEG_DT=[$0],

category_name=[$1],

CATEG_LVL2_NAME=[$2],CATEG_LVL3_NAME=[$3],LSTG_FORMAT_NAME=[$4],

SITE_NAME=[$5],

GMV=[CASE(=($7,

0),

null,

$6)],

TRANS_CNT=[$8])OLAPAggregateRel(group=[{0,

1,

2,

3,

4,

5}],

agg#0=[$SUM0($6)],

agg#1=[COUNT($6)],

TRANS_CNT=[COUNT()])

OLAPProjectRel(WEEK_BEG_DT=[$13],

category_name=[$21],

CATEG_LVL2_NAME=[$15],

CATEG_LVL3_NAME=[$14],LSTG_FORMAT_NAME=[$5],

SITE_NAME=[$23],

PRICE=[$0])

OLAPFilterRel(condition=[OR(=($3,

123456),

=($5,

’New'))])OLAPJoinRel(condition=[=($2,

$25)],

joinType=[left])OLAPJoinRel(condition=[AND(=($6,

$22),

=($2,

$17))],

joinType=[left])OLAPJoinRel(condition=[=($4,$12)],

joinType=[left])OLAPTableScan(table=[[DEFAULT,

TEST_KYLIN_FACT]],

fields=[[0,1,

2,

3,

4,

5,

6,

7,

8,

9,

10,

11]])OLAPTableScan(table=[[DEFAULT,

TEST_CAL_DT]],

fields=[[0,1]])OLAPTableScan(table=[[DEFAULT,

test_category]],

fields=[[0,1,

2,

3,

4,

5,

6,

7,

8]])OLAPTableScan(table=[[DEFAULT,

TEST_SITES]],

fields=[[0,1,

2]])QueryEngine–KylinExplainP21

Plugin-able

storage

engine

Common

iterator

interface

for

storage

engineIsolate

query

engine

from

underline

storage

Translate

cube

query

into

HBase

table

scan

Columns,

Groups

Cuboid

IDFilters

->

Scan

Range

(Row

Key)Aggregations

->

Measure

Columns

(Row

Values)

Scan

HBase

table

and

translate

HBase

result

into

cube

result

HBase

Result

(key

+

value)

->

Cube

Result

(dimensions

+

measures)How

to

Query

Cube?Storage

EnginePlugin-ablestorageengineCo22

Curse

of

dimensionality:

N

dimension

cube

has

2N

cuboid

Full

Cube

vs.

Partial

Cube

Hugh

data

volume

Dictionary

EncodingIncremental

BuildingHow

to

Optimize

Cube?Cube

OptimizationCurseofdimensionality:Ndi23

Full

Cube

Pre-aggregate

all

dimension

combinations“Curse

of

dimensionality”:

N

dimension

cube

has

2N

cuboid.

Partial

Cube

To

avoid

dimension

explosion,

we

divide

the

dimensions

intodifferent

aggregation

groups

2N+M+L

2N

+

2M

+

2L

For

cube

with

30

dimensions,

if

we

divide

these

dimensions

into

3group,

the

cuboid

number

will

reduce

from

1

Billion

to

3

Thousands

230

210

+

210

+

210

Tradeoff

between

online

aggregation

and

offline

pre-aggregationHow

to

Optimize

Cube?Full

Cube

vs.

Partial

CubeFullCubePre-aggregatealld24How

to

Optimize

Cube?Partial

CubeHowtoOptimizeCube?PartialC25

Data

cube

has

lost

of

duplicated

dimension

valuesDictionary

maps

dimension

values

into

IDs

that

will

reduce

the

memory

and

storagefootprint.Dictionary

is

based

on

TrieHow

to

Optimize

Cube?Dictionary

EncodingDatacubehaslostofduplica26How

to

Optimize

Cube?Incremental

BuildHowtoOptimizeCube?Increment27CubeInvertedIndexStorageformatPre-aggregatedcuboidsSharding,columnarstorage,withinvertedindexonrowblocksQuerymethodCuboidscanningMassiveparallelprocessingStrengthPre-aggregatehugehistoricdatatosmallsummariesSwiftresponsetoreal-timedataWeaknessTaketimetobuildSlowatscanninglargedatavolumeStreaming,

ongoing

effort

Cube

is

great,

but…

Sometimes

we

want

to

drill

down

to

row

level

informationCube

takes

time

to

build,

how

about

real-time

analysis?

Streaming

with

inverted

indexCubeInvertedIndexStorageformat28streamingKarfkahourly/dailybatchminutes

batch

Inverted

IndexReal-time

StoreKylin

0.8,

Lambda

Architecture

SQL

Query

Hybrid

Storage

Interface

CubeHistoric

StorestreamingKarfkahourly/dailybat29http://kylin.ioAgenda

What’s

Apache

Kylin?Tech

HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA30Kylin

vs.

Hive#QueryTypeReturn

DatasetQueryOn

Kylin

(s)QueryOn

Hive

(s)Comments1High

LevelAggregation40.129157.4371,217

times23Analysis

QueryDrill

Down

toDetail22,669325,0291.61512.058109.206113.12368

times9

times4Drill

Down

toDetail524,78022.426383.21278

times5Data

Dump972,00249.054N/A100

50

0200150SQL

#1SQL

#2SQL

#3HiveKylinHighLevelAggregatio

nAnalysis

QueryDrillDownto

DetailLow

LevelAggregatio

nTransactio

n

LevelBased

on

12+B

records

caseKylinvs.Hive#QueryTypeReturn31Performance

--

ConcurrencyLinear

scale

out

with

more

nodesPerformance--ConcurrencyLine32Performance

-

Query

Latency90%

queries

<5sGreen

Line:

90%tile

queriesGray

Line:

95%tile

queriesPerformance-QueryLatency90%33http://kylin.ioAgenda

What’s

Apache

Kylin

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

最新文档

评论

0/150

提交评论