版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Apache
KylinOLAP
on
HadoopApacheKylinOLAPonHadoop1http://kylin.ioAgenda
What’s
Apache
Kylin?Tech
HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA2Extreme
OLAP
Engine
for
Big
DataKylin
is
an
open
source
Distributed
Analytics
Engine
from
eBay
thatprovides
SQL
interface
and
multi-dimensional
analysis
(OLAP)
onHadoop
supporting
extremely
large
datasetsWhat’s
Kylinkylin
/
ˈkiːˈlɪn
/
麒麟--n.
(in
Chinese
art)
a
mythical
animal
of
composite
form•
Open
Sourced
on
Oct
1st,
2014•
Be
accepted
as
Apache
Incubator
Project
on
Nov
25th,
2014ExtremeOLAPEngineforBigDa3
Big
Data
Era
More
and
more
data
becoming
available
on
Hadoop
Limitations
in
existing
Business
Intelligence
(BI)
Tools
Limited
support
for
HadoopData
size
growing
exponentiallyHigh
latency
of
interactive
queriesScale-Up
architecture
Challenges
to
adopt
Hadoop
as
interactive
analysis
system
Majority
of
analyst
groups
are
SQL
savvyNo
mature
SQL
interface
on
HadoopOLAP
capability
on
Hadoop
ecosystem
not
ready
yetBigDataEraLimitedsupport45
Why
notBuild
an
engine
from
scratch?5 Whynot5
Extreme
Scale
OLAP
EngineKylin
is
designed
to
query
10+
billions
of
rows
on
Hadoop
ANSI
SQL
Interface
on
HadoopKylin
offers
ANSI
SQL
on
Hadoop
and
supports
most
ANSI
SQL
query
functions
Seamless
Integration
with
BI
ToolsKylin
currently
offers
integration
capability
with
BI
Tools
like
Tableau.
Interactive
Query
CapabilityUsers
can
interact
with
Hive
tables
at
sub-second
latency
MOLAP
CubeDefine
a
data
model
from
Hive
tables
and
pre-build
in
Kylin
Scale
Out
ArchitectureQuery
server
cluster
supports
thousands
concurrent
users
and
provide
high
availabilityFeatures
HighlightsExtremeScaleOLAPEngineKyli6
Compression
and
Encoding
SupportIncremental
Refresh
of
CubesApproximate
Query
Capability
for
distinct
count
(HyperLogLog)Leverage
HBase
Coprocessor
for
query
latencyJob
Management
and
MonitoringEasy
Web
interface
to
manage,
build,
monitor
and
query
cubesSecurity
capability
to
set
ACL
at
Cube/Project
LevelSupport
LDAP
IntegrationFeatures
Highlights…CompressionandEncodingSupp7Cube
DesignerCubeDesigner8Job
ManagementJobManagement9Query
and
VisualizationQueryandVisualization10Tableau
IntegrationTableauIntegration11CaseCubeSizeRawRecordsUserSessionAnalysis26TB28+billionrowsClassifiedTrafficAnalysis21TB20+billionrowsGeoXBehaviorAnalysis560GB1.2+billionrows
eBay
90%
query
<
5
seconds
Baidu
Baidu
Map
internal
analysis
Many
other
Proof
of
Concepts
Bloomberg
Law,
British
GAS,
JD,
Microsoft,
StubHub,
Tableau
…Who
are
using
KylinCaseCubeSizeRawRecordsUserSess12http://kylin.ioAgenda
What’s
Apache
Kylin?Tech
HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA13OLAPCubeKylin
Architecture
Overview15SQL-Based
Tool
(BI
Tools:
Tableau…)
JDBC/ODBC
SQL
Online
AnalysisData
Flow
Offline
Data
Flow
Clients/Users
interactive
with
Kylin
via
SQL
OLAP
Cube
is
transparent
to
users
Mid
Latency
-
MinutesHadoop
Hive
Star
Schema
DataLow
Latency
-Seconds
Data
Cube
(HBase)
Key
Value
Data3rd
Party
App(Web
App,
Mobile…)
REST
API
SQL
REST
Server
Query
Engine
Routing
MetadataCube
Build
Engine
(MapReduce…)OLAPCubeKylinArchitectureOve14Cube:
…Fact
Table:
…Dimensions:
…Measures:
…Storage(HBase):
…DimDimDimFact
SourceStar
SchemaColumn
FamilyRow
Key
row
A
row
B
row
CColumn
Val
1
Val
2
Val
3
TargetHBase
Storage
MappingCube
MetadataData
Modeling
End
UserCube
ModelerAdminCube:…FactTable:…DimDimDim 15time,
itemtime,
item,
locationtime,
item,
location,
suppliertimeitemlocationsuppliertime,
locationTime,
supplieritem,
locationitem,
supplierlocation,
suppliertime,
item,
suppliertime,
location,
supplieritem,
location,
supplier1-D
cuboids2-D
cuboids3-D
cuboids4-D(base)
cuboid•Base
vs.
aggregate
cells;
ancestor
vs.
descendant
cells;
parent
vs.
child
cells1.2.3.4.5.(9/15,
milk,
Urbana,
Dairy_land)
-
<time,
item,
location,
supplier>(9/15,
milk,
Urbana,
*)
-
<time,
item,
location>(*,
milk,
Urbana,
*)
-
<item,
location>(*,
milk,
Chicago,
*)
-
<item,
location>(*,
milk,
*,
*)
-
<item>••OLAP
Cube
–
Balance
between
Space
and
Time
Cuboid
=
one
combination
of
dimensions
Cube
=
all
combination
of
dimensions
(all
cuboids)
0-D(apex)
cuboidtime,itemtime,item,location16Cube
Build
Job
FlowCubeBuildJobFlow17How
To
Store
Cube?
–
HBase
SchemaHowToStoreCube?–HBaseSch18
Dynamic
data
management
framework.Formerly
known
as
Optiq,
Calcite
is
an
Apache
incubator
project,
used
byApache
Drill
and
Apache
Hive,
among
others.How
to
Query
Cube?Query
Engine
–
CalciteDynamicdatamanagementframe19•••••Metadata
SPI
–
Provide
table
schema
from
Kylin
metadataOptimize
Rule
–
Translate
the
logic
operator
into
Kylin
operatorRelational
Operator
–
Find
right
cube
–
Translate
SQL
into
storage
engine
API
call
–
Generate
physical
execute
plan
by
linq4j
java
implementationResult
Enumerator
–
Translate
storage
engine
result
into
java
implementation
result.SQL
Function
–
Add
HyperLogLog
for
distinct
count
–
Implement
date
time
related
functions
(i.e.
Quarter)How
to
Query
Cube?Kylin
Extensions
on
Calcite•MetadataSPIHowtoQueryCube20Query
Engine
–
Kylin
Explain
PlanSELECT
test_cal_dt.week_beg_dt,test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,test_kylin_fact.lstg_format_name,
test_sites.site_name,
SUM(test_kylin_fact.price)
AS
GMV,
COUNT(*)
AS
TRANS_CNTFROM
test_kylin_factLEFT
JOIN
test_cal_dt
ON
test_kylin_fact.cal_dt
=
test_cal_dt.cal_dtLEFT
JOIN
test_category
ON
test_kylin_fact.leaf_categ_id
=
test_category.leaf_categ_id
AND
test_kylin_fact.lstg_site_id
=test_category.site_idLEFT
JOIN
test_sites
ON
test_kylin_fact.lstg_site_id
=
test_sites.site_idWHERE
test_kylin_fact.seller_id
=
123456OR
test_kylin_fact.lstg_format_name
=
’New'GROUP
BY
test_cal_dt.week_beg_dt,
test_category.category_name,
test_category.lvl2_name,
test_category.lvl3_name,test_kylin_fact.lstg_format_name,test_sites.site_nameOLAPToEnumerableConverterOLAPProjectRel(WEEK_BEG_DT=[$0],
category_name=[$1],
CATEG_LVL2_NAME=[$2],CATEG_LVL3_NAME=[$3],LSTG_FORMAT_NAME=[$4],
SITE_NAME=[$5],
GMV=[CASE(=($7,
0),
null,
$6)],
TRANS_CNT=[$8])OLAPAggregateRel(group=[{0,
1,
2,
3,
4,
5}],
agg#0=[$SUM0($6)],
agg#1=[COUNT($6)],
TRANS_CNT=[COUNT()])
OLAPProjectRel(WEEK_BEG_DT=[$13],
category_name=[$21],
CATEG_LVL2_NAME=[$15],
CATEG_LVL3_NAME=[$14],LSTG_FORMAT_NAME=[$5],
SITE_NAME=[$23],
PRICE=[$0])
OLAPFilterRel(condition=[OR(=($3,
123456),
=($5,
’New'))])OLAPJoinRel(condition=[=($2,
$25)],
joinType=[left])OLAPJoinRel(condition=[AND(=($6,
$22),
=($2,
$17))],
joinType=[left])OLAPJoinRel(condition=[=($4,$12)],
joinType=[left])OLAPTableScan(table=[[DEFAULT,
TEST_KYLIN_FACT]],
fields=[[0,1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11]])OLAPTableScan(table=[[DEFAULT,
TEST_CAL_DT]],
fields=[[0,1]])OLAPTableScan(table=[[DEFAULT,
test_category]],
fields=[[0,1,
2,
3,
4,
5,
6,
7,
8]])OLAPTableScan(table=[[DEFAULT,
TEST_SITES]],
fields=[[0,1,
2]])QueryEngine–KylinExplainP21
Plugin-able
storage
engine
Common
iterator
interface
for
storage
engineIsolate
query
engine
from
underline
storage
Translate
cube
query
into
HBase
table
scan
Columns,
Groups
Cuboid
IDFilters
->
Scan
Range
(Row
Key)Aggregations
->
Measure
Columns
(Row
Values)
Scan
HBase
table
and
translate
HBase
result
into
cube
result
HBase
Result
(key
+
value)
->
Cube
Result
(dimensions
+
measures)How
to
Query
Cube?Storage
EnginePlugin-ablestorageengineCo22
Curse
of
dimensionality:
N
dimension
cube
has
2N
cuboid
Full
Cube
vs.
Partial
Cube
Hugh
data
volume
Dictionary
EncodingIncremental
BuildingHow
to
Optimize
Cube?Cube
OptimizationCurseofdimensionality:Ndi23
Full
Cube
Pre-aggregate
all
dimension
combinations“Curse
of
dimensionality”:
N
dimension
cube
has
2N
cuboid.
Partial
Cube
To
avoid
dimension
explosion,
we
divide
the
dimensions
intodifferent
aggregation
groups
2N+M+L
2N
+
2M
+
2L
For
cube
with
30
dimensions,
if
we
divide
these
dimensions
into
3group,
the
cuboid
number
will
reduce
from
1
Billion
to
3
Thousands
230
210
+
210
+
210
Tradeoff
between
online
aggregation
and
offline
pre-aggregationHow
to
Optimize
Cube?Full
Cube
vs.
Partial
CubeFullCubePre-aggregatealld24How
to
Optimize
Cube?Partial
CubeHowtoOptimizeCube?PartialC25
Data
cube
has
lost
of
duplicated
dimension
valuesDictionary
maps
dimension
values
into
IDs
that
will
reduce
the
memory
and
storagefootprint.Dictionary
is
based
on
TrieHow
to
Optimize
Cube?Dictionary
EncodingDatacubehaslostofduplica26How
to
Optimize
Cube?Incremental
BuildHowtoOptimizeCube?Increment27CubeInvertedIndexStorageformatPre-aggregatedcuboidsSharding,columnarstorage,withinvertedindexonrowblocksQuerymethodCuboidscanningMassiveparallelprocessingStrengthPre-aggregatehugehistoricdatatosmallsummariesSwiftresponsetoreal-timedataWeaknessTaketimetobuildSlowatscanninglargedatavolumeStreaming,
ongoing
effort
Cube
is
great,
but…
Sometimes
we
want
to
drill
down
to
row
level
informationCube
takes
time
to
build,
how
about
real-time
analysis?
Streaming
with
inverted
indexCubeInvertedIndexStorageformat28streamingKarfkahourly/dailybatchminutes
batch
Inverted
IndexReal-time
StoreKylin
0.8,
Lambda
Architecture
SQL
Query
Hybrid
Storage
Interface
CubeHistoric
StorestreamingKarfkahourly/dailybat29http://kylin.ioAgenda
What’s
Apache
Kylin?Tech
HighlightsPerformanceRoadmapQ&Ahttp://kylin.ioAgendaWhat’sA30Kylin
vs.
Hive#QueryTypeReturn
DatasetQueryOn
Kylin
(s)QueryOn
Hive
(s)Comments1High
LevelAggregation40.129157.4371,217
times23Analysis
QueryDrill
Down
toDetail22,669325,0291.61512.058109.206113.12368
times9
times4Drill
Down
toDetail524,78022.426383.21278
times5Data
Dump972,00249.054N/A100
50
0200150SQL
#1SQL
#2SQL
#3HiveKylinHighLevelAggregatio
nAnalysis
QueryDrillDownto
DetailLow
LevelAggregatio
nTransactio
n
LevelBased
on
12+B
records
caseKylinvs.Hive#QueryTypeReturn31Performance
--
ConcurrencyLinear
scale
out
with
more
nodesPerformance--ConcurrencyLine32Performance
-
Query
Latency90%
queries
<5sGreen
Line:
90%tile
queriesGray
Line:
95%tile
queriesPerformance-QueryLatency90%33http://kylin.ioAgenda
What’s
Apache
Kylin
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024中国电信山东济南分公司校园招聘易考易错模拟试题(共500题)试卷后附参考答案
- 2024中国林业集团限公司招聘5人易考易错模拟试题(共500题)试卷后附参考答案
- 2024中国大唐集团江西分公司所属企业招聘12人易考易错模拟试题(共500题)试卷后附参考答案
- 2024中国人民财产保险股份限公司甘肃分公司春季招聘52人易考易错模拟试题(共500题)试卷后附参考答案
- 2024中储粮油脂限公司招聘易考易错模拟试题(共500题)试卷后附参考答案
- 2024上海移动春季校园招聘205人易考易错模拟试题(共500题)试卷后附参考答案
- 2024年度融资借款合同标的和借款条件
- 2024年度店铺客户服务与售后服务合同
- 《产业结构升级》课件
- 2024年度技术开发与转让合同书
- 安徽省宿州市省、市示范高中2024-2025学年高二上学期期中教学质量检测语文试题
- 第4课 日本明治维新(说课稿)-2024-2025学年九年级历史下册素养提升说课稿(统编版)
- 2024新外研版七年级上册课本重点知识点及范文归纳
- 2023年江苏常州中考满分作文《方寸之间天地大》4
- 部编二年级上册道德与法治全册教案(共16课)
- 2024年长江产业投资集团限公司招聘【150人】公开引进高层次人才和急需紧缺人才笔试参考题库(共500题)答案详解版
- 初中数学说题比赛1
- 中级会计课程设计
- 小学语文部编版六年级上册词语表《看拼音写词语》专项练习(附参考答案)
- 2024届高考语文复习修改病句专项训练(含答案)
- 全文解读2021年《防范和处置非法集资条例》PPT专题课件
评论
0/150
提交评论