专题论坛大数据课件

上传人：q*** IP属地：贵州上传时间：2022-11-05 格式：PPT 页数：87 大小：10.93MB 积分：25 举报 版权申诉

已阅读5页，还剩82页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

专题论坛大数据课件Big

Data

Smart

Model:Beauty

and

the

BeastProf.

Yike

GuoDepartment

ComputingImperial

College

LondonBigDatavsSmartModel:Prof.Model

Mathematical

Representation

SimplifiedPhysical

World

Modelling

essential

and

inseparable

part

all

scientific

activity.

scientific

model

seeks

represent

empirical

objects,

phenomena,

and

physical

processes

logical

and

objective

way

understand

the

world

object

(called

target

T),

modelM

simplified

mathematical

representation

it.

Model

the

result

abstraction

from

observations

made,

and

it’s

used

give

prediction

Human

SensorHuman

Machine

Human

Machine.Model:MathematicalRepresentNo

Model

Perfect:

•

Inherent

Uncertainty

These

targets

consist

set

continuous

phenomena

(in

both

time

and

space),

and

they

typically

produce

rich

signals.

Because

the

continuity

both

time

and

space

target,

the

signals

are

principle

infinite.

But

observations

(

e.g.

sensor

readings

)

are

made

discrete

points

time

and

space,

they

are

incomprehensive,

and

approximate,

which

brings

the

“uncertainty”.

•

Overfitting

Underfitting:

When

learning

model

from

observations,

such

learning

nonlinear

regression

model,

need

choose

the

parameters

such

Considering

the

fact

that

the

information

from

observations

partial

hard

make

perfect

choice

Such

imperfectness

causes

the

problem

model

error,

underfitting

(small

and

overfitting

(large

k).•

Simplification:

From

observations,

project

from

multi-dimensional

world

simplified

model

with

significant

reduced

dimensionality

focus

the

features

properties

are

interested

in.Nonlinearregression:

K-order

polynomialNoModelIsPerfect:•SimplGeorge

Box

(statistician)

“All

models

are

wrong,

but

some

areuseful.”

Only

models,

from

cosmological

equations

theories

humanbehavior,

seemed

able

consistently,

imperfectly,

explain

the

worldaround

us.

1980Peter

Norvig

(Google)

"All

models

are

wrong,

and

increasinglyyou

can

succeed

without

them."

2008Chris

Anderson

(Wired)

There

now

better

way.

Petabytesallow

say:

"Correlation

enough."

can

stop

looking

for

models.We

can

analyze

the

data

without

hypotheses

about

what

might

show.

Wecan

throw

the

numbers

into

the

biggest

computing

clusters

the

world

hasever

seen

and

let

statistical

algorithms

find

patterns

where

science

cannot.(The

Data

Deluge

Makes

the

Scientific

Method

Obsolete)20124So,

Why

Model

?GeorgeBox(statistician)The

Google

ArgumentAt

the

petabyte

scale,

information

not

matter

simple

three-

and

four-dimensionaltaxonomy

and

order

but

dimensionally

agnostic

statistics.

calls

for

entirely

differentapproach,

one

that

requires

lose

the

tether

data

something

that

can

visualizedin

its

totality.

forces

view

data

mathematically

first

and

establish

context

for

later.For

instance,

Google

conquered

the

advertising

world

with

nothing

than

appliedmathematics.

didn't

pretend

know

anything

about

the

culture

and

conventions

ofadvertising

—

just

assumed

that

better

data,

with

better

analytical

tools,

would

win

the

day.And

Google

was

right.Google's

founding

philosophy

that

don't

know

why

this

page

better

than

thatone:

the

statistics

incoming

links

say

is,

that's

good

enough.

semantic

orcausal

analysis

required.

That's

why

Google

can

translate

languages

without

actually"knowing"

them

(given

equal

corpus

data,

Google

can

translate

Klingon

into

Farsi

aseasily

can

translate

French

into

German).

And

why

can

match

ads

contentwithout

any

knowledge

assumptions

about

the

ads

the

content.TheGoogleArgumentAtthepetaModel

Free

Sensor

Informatics

Query

Driventime10am10am

..10amid12..7temp

…

29Database

Table

raw-dataSensorNetwork3.

Write

output

file/back

the

database4.

Write

data

processing

tools

process/aggregate

the

output

(maybe

using

User1.

Extract

all

readings

into

file2.

Run

MATLAB/R/other

data

processing

tools

DB)

Decide

new

data

acquire

RepeatModel-free

sensing

treats

the

sensory

system

database,

and

sensing

querying

fetch

data

from

physicalworld.

One

the

leading

vendors

[Crossbow]

bundling

query

processor

with

their

devices.ModelFreeSensorInformaticsWikisensing

Model

Free

Sensor

Informatics

SystemBased

Big

Data

ArchitectureWikisensing:AModelFreeSenModel

Free

Sensing

Super

Inefficient•

Data

misrepresentation

without

model•

Latent

information

missing

without

model•

High

demand

computation/storage

without

model•

Require

too

much

interoperability

between

sensorsand

analyticsModelFreeSensingisSuperInBayesian:

Data

Not

the

Enemy

Models

Rather

aGreat

Supporter!Bayesian

probability

formalism

that

allows

reason

about

beliefs

models

underconditions

uncertainty

based

the

observations

(data)

.If

have

observed

that

particular

event

has

happened,

such

Britain

coming

10th

themedal

table

the

2004

Olympics,

then

there

uncertainty

about

it.However,

suppose

the

statement

“Britain

sweeps

the

boards

2012

London

Olympics,winning

than

Gold

Medals!“

made

before

28th

JulySince

this

statement

about

future

event,

nobody

can

state

with

any

certainty

whether

ornot

true.

Different

people

may

have

different

beliefs

the

statement

depending

theirspecific

knowledge

factors

that

might

effect

its

likelihoodThe

belief’s

the

model

were

changing

daily

based

the

performance

data

available

eachday.

the

August,

most

people’s

belief

this

model

should

almost

80%Thus,

general,

person's

subjective

belief

statement

will

depend

some

body

ofknowledge

write

this

P(a|K).

Henry's

belief

different

from

Marcel's

because

theyare

using

different

K's.

However,

even

they

were

using

the

same

they

might

still

havedifferent

beliefs

a.The

expression

P(a|K)

thus

represents

belief

measure.

Sometimes,

for

simplicity,

when

Kremains

constant

just

write

P(a),

but

you

must

aware

that

this

simplification.Bayesian:DataIsNottheEneModel

and

Data

Interaction

Bayesian

Inference10•Bayes

Rule:

Interaction

between

data

and

model•Learning

Sequence

Interactionsp(Y

)

p(Y)P(

ModelandDataInteraction:BBig

Data

Meets

Smart

Models

Bayesian

Approachtowards

Sensor

Informatics•We

need

model

the

representation

our

knowledge

far•••••Data

the

observations

which

may

revise

our

belief

the

models

haveAnalysis

assessing

our

belief

and

updating

our

models

make

them

believableSensing

acquiring

needed

data

update

(enrich)

modelsModels

are

learned

from

data

(observations)

scientists

(theoretical

abstraction)

machine

(machinelearning)

•

Models

are

hypothesis

(

when

making

new

observation)

•

Models

are

knowledge

(when

established

belief)Sensor

Informatics:

Sensing

management

Managing

the

“neediness”

when

and

where

sense

•

Sensing

analytics

Managing

model

updating

how

enrich

models

with

observations

•

Reasoning

Decision

making

based

integration

trusted

models

•P(M

P(D

)

P(M)

P(D)BigDataMeetsSmartModels:

Surprising

Event

When

Observation

Does

not

Fit

Known

Model

Posterior

and

prior

(P(M|D)

P(M)

)

has

great

variance

surprise!How

great

variance?

Surprise

threshold

αKullback-Leibler

divergence:Other

methods:

signficant

level,

Chebyshev’s

Theorem,

…

From

model,

get

C(A,

(e.g.

multivariate

Gaussian

distribution)

100mm

50mmModel

consistentA:

100mmB:

500mmSurprise! SurprisingEvent:WhenanObCamera

example:

Image

Analog

Signal

->Digital

Data

Compressed

Data

InformationWhy

sensing

much

data

and

then

throw

themaway?Why

not

sensing

information

directly?Using

Compressive

Sensing

Technology

OptimizeObservations

Compressive

sensing:

Take

the

advantage

sparseness,

solve

the

under-determined

signals

with

just

small

amount

measurement.

Unobserved

behavior

(behavior

not

captured

the

current

model)

typically

sparse.Reconstruction

method:

L1-min,

Bayesian

CS.Sensing

data

enough

when

can

recover

the

need

information

through

compressive

sensing.Ψ:

Matrix

built

from

the

modelΦ:

Placement

MatrixCameraexample:Image->AnaloHow

Update

Model

–

Parameter

Estimation1Y131.03188.294245.559302.823360.088417.352474.617531.881589.146646.41DEC

2011

21:15:23NODAL

SOLUTIONSTEP=360SUB

=1TIME=1800TEMP

(AVG)RSYS=0SMN

=131.03SMX

=646.41

XEstimating

parameter

maximize

the

likelihoodof

data

given

the

model:HowtoUpdateModel–ParametModel

Example

Digital

CityModelling

City

Life

via

Causality

C(eA,

eB)

used

for

predict

current

value

location

(A)

whenanother

location

(B)

value

given

Location

physical

logical

locations

with

causality

(through

sensory

cortex)(city

areas,

Relationship

topology

(geo

topology

between

and

diffusion

Structure

)

Event:

events,

which

the

dynamics

observable

signal

f(E)

(heavyrainfall)Model:AnExampleinDigitalOntologies

are

adopted

represent

locations

relationships

R*events

and

signals

S.Diffusion:

event

e1∈

n1causes

another

event

∈

n2,when

two

nodes

n1,

arelinked.

Digital

City

Model

looking

into

the

detailsSystem

(L,

E)Model

M(T)

(G,

∅,

B)Training

for

causality

∅:

use

Bayesian

network

represent

theconditional

independencies

between

cause

and

target

variables:1.

Gaussian

Mixture

Models

(GMMs),

estimated

via

expectationmaximization

(EM)

Gaussian

Process

with

Bayesian

Inference.Ontologiesareadoptedtorepr

When

the

surprise

threshold

Diversity

detected

identify

the

incorrect

causality

C(el,

ep),

which

sparse

Compressive

sensing

approachNew

observation->

measurement

thatcould

revise

model

space

tomaximize

the

likelihood

observations

Focusing

diversityPlacementModel

Updating

Model

Driven

Sensing

Surprise

The

dynamics

model

update:

Surprise

Sensing

Model

Updating

The

goal

for

sensing:

Capturingsurprise

The

goal

analysis

RevisingmodelA

model

cannot

overfit

underfit,

when

there

diversity,

could

updated->

consistent

with

the

universe

(target) Whenthesurprise>surpriseModel

UpdateIt’s

Bayesian:

P(M,

P(D

ϴ)

P(M,

ϴ)

P(D)T:

target,

model,

ϴ:

top-down

parameter*

When

fixed:

P(M

P(D

P(M)

P(D)->

The

variance

between

posterior

and

prior

“surprise”->

bottom-up

attention

model

update

(data

assimilation):combining

observations

the

current

state

system

with

the

resultsfrom

model

(the

forecast)

produce

analysis.

The

model

thenadvanced

time

and

its

result

becomes

the

forecast

the

nextanalysis

cycle*

When

updated:

P(M,

ϴ)

P(M

ϴ)P(ϴ)->

top-down

attention

(alertness)

model

updateModelUpdateIt’saBayesian:PAdaptive

Observation:

Sensing

and

Numerical

ModellingCityGML

Ontology

GIS

Geometry

meshAdaptiveObservation:SensingBuilding

Initial

Model

and

Making

Prediction

bySimulationsSetting

boundary

conditions,

numerical

schemas,

model

parameters,

etc.BuildingAnInitialModelandSimulation24

Building

Case

(Fine

Mesh

–

600000

Nodes):

ProcessorsSimulation24BuildingCase(FiSimulationMoving

Vehicles

and

Scalar

Dispersions

Street

CanyonsSimulationMovingVehiclesandUsing

Sensor

Verify

the

Prediction

Results

theModel

Sensing:

Acquiring

data

get

posterior

model,

for

validate

(consistent)

update

model

P(M

P(D

P(M)

P(D)Data

sensingModelvalidateupdateUsingSensortoVerifythePreNew

WikiSensing:

Elastic

Sensing

Environment

forLarge

Scale

Sensor

Informatics•

Elastic

sensing

theory

based

Bayesian

inference•

Big

Data

architecture

for

large

scale

sensory

data

management•

Ontology

for

the

background

knowledge

management•

Model

driven

adaptive

observation

support•

Digital

City

and

digital

life

applicationsNewWikiSensing:ElasticSensiThe

architecture

the

New

WikiSensing

SystemThearchitectureoftheNewWiOntology

Used

Organise

the

Complex

knowledgemanagementUsing

ontology

represent

the

targets,

signals,sensing

methods,

measurements,

etc.Ontology

support

flexible

resolution

Upper

ontology

for

unified

operationOntoSensorOntologyUsedtoOrganisetheConclusion•

Big

data

offers

great

opportunity

for

building

smart

models•

Big

data

provides

new

methodology

for

model

research•

New

informatics

comes

from

the

coupled

integration

the

data

and

the

model

worlds•

Bayesian

theory

provides

nature

foundation

for

such

integration•

Sensor

Informatics

good

example

for

such

paradigm•

new

uniform

framework

sensor

informatics

can

developed

based

the

Bayesian

theory

wherethe

dynamics

data

and

model

capturing

the

essence

building

sensory

system•

are

developing

the

WikiSensing

system

realise

this

paradigmConclusion•BigdataoffersThank

youThankyouUnderstanding

Big

DataHaixun

WangUnderstandingBigDataHaixunWData

ExplosionMB

106

bytesa

typical

book

text

formatGB

109

bytesa

one

hour

video

about

1GB;data

produced

biologyexperiment

one

dayTB

1012

bytesastronomy

data

one

night;US

Library

Congress

has

1000

data;search

log

Bing

per

day

(2009)DataExplosionMB=106bytesaThe

Arecibo

TelescopeWorld’s

largest

radio

telescopeDiameter

305

(1,000

ft)Area

acresLocation:

Arecibo,

Puerto

RicoThe

P-ALFA

surveys800

Terabytes

yearsTheAreciboTelescopeWorld’slSoftware

Driven

Telescopefrom

few,

large,

expensive,directional

dishes

many,

small,cheap,

omni

directional

antennaea

large

number

high-speedinput

streams(2Gbps

per

antenna,

25,000antennae

area

340

indiameter)SoftwareDrivenTelescopefromData

sizeChallenge

It’s

the

data,

stupid!Data

complexityKey/value

storeColumn

storeDocument

storeGraph

SystemsDatasizeChallenge1:It’stheBig

data

drives

tomorrow’s

economy.•

The

value

big

data

lies

its

degree

ofconnectedness.•

Existing

systems

cannot

handle

richconnectedness

big

data.Bigdatadrivestomorrow’secoRDBMS

and

Rich

Relationships•

Performance

multi-way

joins

very

poor

inRDBMS•

Managing

data

rich

connectedness

requiresmulti-way

Joins

RDBMSRDBMSandRichRelationships•Trinity•

general

purpose,

distributed,

memory

graph

system•

Online

graph

query

processing•

Offline

graph

analyticsTrinity•Ageneralpurpose,dTrinity

Performance

Highlight•

Onlinequeryprocessing

:–

visiting

2.2

million

users

hop

neighborhood)

Facebook:

100ms–

foundation

for

graph-based

service,

e.g.,

entity

search•

Offlinegraphanalytics

:–

one

iteration

billion

node

graph:

60sec–

foundation

for

analytics,

e.g.,

social

analyticsTrinityPerformanceHighlight•PeopleSearchDemoPeopleSearchDemoMulti-way

Join

vs.

Graph

TraversalCompanyIncidentProblem…IDCompanyID1ID2ID…IncidentID3ID4ID…ProblemRDBMSTrinityMulti-wayJoinvs.GraphTraveChallenge

Interpretation

Big

Data•

IBM

Watson:–

Runs

2,880

cores,

terabytes

RAM,

and80kW

power•

human

brain:–

Runs

tuna

fish

sandwich

and

glass

waterChallenge2:Interpretationofansweringthe

questionunconstrainednatural

languageinferencing

&reasoningdomain

specificlanguagesimplecalculation

Human(Turing

Test)SIRI

Watson

Wolfram

AlphaGoogle/Bing?

the

Eternal

Questunderstanding

the

question

SQLcalculatoransweringthequestionunconstraTurning

the

Web

intoa

DatabaseTurningtheWeb intoWhat

you

see

when

you

look

homepage

…Haixun

WangMicrosoft

Research

AsiaEmail:

haixunw

microsoft

comTel:

+86-10-58963289Tel:

+1-914-902-0749I

joined

Microsoft

Research

Asia

2009.I

was

with

IBM

Watson

ResearchCenter

from

2000

2009.

received

theB.S.

and

M.S.

Degree

Computer

Sciencefrom

ShanghaiJiaoTongUniversity

in1994

and

1996,

the

Ph.D.

Degree

inComputer

Science

fromUniversityofCalifornia,LosAngelesin

June,

2000.WhatyouseewhenyoulookatAWhat

machine

sees

when

looks

homepage

…A

JPEG

Imagea

jpeg

Filetext

bigA

bold

fontA4

lines

textanother

dozen

lines

oftext

with

twoembedded

URLsAWhatamachineseeswhenitl专题论坛大数据课件Semantic

Web?•

Number

trend

2008–

Richard

MacManus•

The

infrastructure

power

theSemantic

Web

already

here.–

Tim

Berners-Lee•

Unstructured

information

will

give

way

structuredinformation

–

paving

the

road

intelligent

computing.–

Alex

IskoldSemanticWeb?•Number1tren专题论坛大数据课件More

data

beats

better

algorithmsBanko

and

Brill

2001MoredatabeatsbetteralgoritMean

translation

quality(1=incomprehensible,

perfect)English-Spanish

translation

quality,Microsoft

technical

texts2.5

23.52001200220032004200520062007Systran

Improvealgorithms,

scale

system,and

add

data!Rule-based

system

with

expensive

customizations

for

Microsoft3

MSRMT

Logos

Off-the-shelfrule-based

systemFrom

Rick

Rashid’s

talk:

It’s

data

driven

world

–

get

over

it!Meantranslationquality(1=incProbase

isA(concept,entities)isPropertyOf

(attributes)Co-occurrence

(isCEOof,

LocatedIn,etc)Concepts

(“SpanishArtists”)Entities

(“PabloPicaso”)Probase isAisPropertyOfCo-occuExplicit

vs.

Latent

Knowledge•

Abstract

representations

(such

clustersfrom

latent

analysis)

that

lack

linguisticcounterparts

are

hard

learn

validate

andtend

lose

information.•

Human

language

has

evolved

over

millennia

tohave

words

for

the

important

concepts;

let’suse

them.Halevy,

Norvig,

Pereira,

“The

Unreasonable

Effectiveness

Data”,

IEEE

Intelligent

Systems,

2009.Explicitvs.LatentKnowledge•What

interpretation?Whatisinterpretation?Add

Common

Sense

ComputingPablo

Picasso

Oct

1881SpanishAddCommonSensetoComputingPWhich

“kiki”

and

which

“bouba”?Whichis“kiki”andwhichis“soundshapezigzaggednesssoundshapezigzaggednessChinaIndiacountryBrazilemerging

marketChinaIndiacountryBrazilemerginbodytastesmell

winebodytastesmellIT

companyThe

engineer

eating

applefruitITcompanyTheengineeriseat

Multiple

ConceptsObama’s

real-estatepolicypresident,

politicianinvestment,

property,

asset,

plan,

documentpresident,

politician,investment,

property,

asset,

plan,

document MultipleConceptspresident,pMultiple

Concepts

applesoftware

company,

brand,

fruit,

juice

adobebrand,

software

company,

materialsoftware

company,software

manufacturer,

brand

juice,

materialbrand,

company,

fruit,MultipleConcepts apple adobes

Multiple

ConceptsObama’s

real-estatepolicypresident,

politicianinvestment,

property,

asset,

plan,

documentpresident,

politician,investment,

property,

example

plan,

documentthing,

issue,

term,

asset, MultipleConceptspresident,pExample:

(from

Dolan)Who

assassinatedAbraham

Lincoln?Example:(fromB.Dolan)WhoasThe

far

reaching

implicationsScientific

MethodThefarreachingimplicationsSScientific

MethodScientificMethodWhat

really

counts

isunderstandingora

mastery

some

commonvocabularyWhatreallycountsisunderstanHow

can

big

data

help?A

much

rapid

cycle

hypothesisgeneration

and

testing•

General

access

toknowledge

science•

Autonomousexperimentation,

withan

‘active

learning’modelHowcanbigdatahelp?AmuchmTechnological

Singularityif

machines

could

even

slightly

surpass

human

intellect,

they

could

improve

theirown

designs

ways

unforeseen

their

designers,

and

thus

recursively

augmentthemselves

into

far

greater

intelligencesTechnologicalSingularityifmaThanksThanks大数据平台及互联网应用服务大数据平台及互联网应用服务Agenda

当前面临问题和挑战

国内外公司解决方案

大数据领域腾讯解决之道Agenda当前面临问题和挑战国内外公司解决方案Agenda第一篇：当前面临问题和挑战Agenda第一篇：当前面临问题和挑战大数据挑战（1）-海量数据存储技术？

1.PB级数据向ZB级演进，如何降低存储

和计算成本数据量：46PB机器数量：5600台2.工业级业务发展迅速对大数据计算时

效性和可靠性提出新的挑战大数据挑战（1）-海量数据存储技术？数据量：46PB机器数量大数据挑战（2）—数据应用难大数据挑战（2）—数据应用难大数据挑战（3）-精准推荐难1.企业信息泛滥的问题（全互联网）2.推荐精度低3.推荐效果有效评估问题4.如何有效收集用户主动行为数据大数据挑战（3）-精准推荐难1.企业信息泛滥的问题（全互联网Agenda第二篇：

国内外公司解决方案Agenda第二篇：国内外公司解决方案hadoop开源产品HbaseMahoutHive/Pig海豚技术海狗章鱼海星剑鱼蓝鲸…..…..海量计算:基于Hadoop海量存储计算集群,同时提供一站式的计算和存储资源管理

分布式数据挖掘:

基于Mahout分布式数

据数据挖掘数据分发中心:提供批量数据抽取和转载,同时准实时消息,日志分发(采用客户pull方式)

海量数据实时搜索:

基于Hbase和Solr集成,

提供千亿级别数据实时

查询和全文检索流计算框架:类似M/R流式计算框架,可以实现应用快速,提供在线数据加工服务海量数据查询:基于hive和Pig,提供Web页面海量数据可视化查询服务国内案例-支付宝大数据平台

支付宝hadoop相关应用服务hadoop开源HbaseMahoutHive/Pig海豚技•••••Online

news,

Google

News

reports

that

recommendations

increasearticles

viewed

38%

(Das

al.

2007).Movies,

Netflix

reports

that

over

60%

their

rentals

originate

fromrecommendations

(Thompson

2008).Amazon,

which

sells

music,

books,

and

movies,

35%

sales

arereported

originate

from

recommendations

(Lamere

Green

2008).Video,

YouTub

人人文库> 全部分类> 教育资料 > 辅导培训

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

专题论坛大数据课件

文档简介

温馨提示

最新文档

评论

专题论坛大数据课件

文档简介

温馨提示

最新文档

评论

相关文档