网络搜索引擎原理004similarity and clustering_第1页
网络搜索引擎原理004similarity and clustering_第2页
网络搜索引擎原理004similarity and clustering_第3页
网络搜索引擎原理004similarity and clustering_第4页
网络搜索引擎原理004similarity and clustering_第5页
已阅读5页,还剩98页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

网络搜索引擎原理陈光()信息与通信工程学院Similarity

&

ClusteringClustering:theprocess

of

grouping

aset

ofobjects

into

classes

of

similar

objectsDocuments

within

a

cluster

should

be

similar.Documents

from

different

clusters

should

bedissimilar.The

commonest

form

of

unsupervised

learningUnsupervised

learning

=

learning

from

raw

data,as

opposed

to

supervised

data

where

aclassification

of

examples

is

givenA

common

and

important

task

that

finds

manyapplications

in

IR

and

other

placesCh.

16What

is

clustering?How

wouldyou

designan

algorithmfor

findingthe

threeclusters

inthis

case?Ch.

16A

data

set

with

clear

cluster

structureWhole

corpus

analysis/navigationBetter

user

interface:

search

withouttypingFor

improving

recall

insearch

applicationsBetter

search

results

(like

pseudo

RF)For

better

navigation

of

search

resultsEffective

“user

recall”

will

be

higherFor

speeding

up

vector

spaceretrievalCluster-based

retrieval

gives

fastersearchSec.

16.1Applications

of

clustering

in

Web

SearchYahoo!

Hierarchy

isn’t

clustering

but

isthekind

of

output

you

want

fromclusteringdairycropsagronomyforestryAIHCIcraftmissionsbotany

cellmagnetismevolutionrelativitycoursesagriculturebiologyphysicsCSspace.........…

(30)......Sample

of

Yahoo!Google

News:automatic

clusteringgivesaneffective

news

presentation

metaphorSample

of

Google

NewsScatter/Gather:

Cutting,

Karger,

and

PedersenSec.

16.1Sample

of

Scatter/GatherFor

visualizing

a

document

collection

and

its

themesWise

et

al,

“Visualizing

the

non-visual”

PNNLThemeScapes,

Cartia–

[Mountain

height

=

cluster

size]Sample

of

VisualizationCluster

hypothesis

-

Documents

in

the

same

clusterbehave

similarly

with

respect

to

relevance

toinformation

needsTherefore,

to

improve

search

recall:Cluster

docs

in

corpus

a

prioriWhen

a

query

matches

a

doc

D,

also

return

otherdocs

in

the

cluster

containing

DHope

if

we

do

this:

The

query

“car”

will

also

returndocs

containing

automobileBecause

clustering

grouped

together

docscontaining

car

with

those

containing

automobile.Sec.

16.1For

improving

search

recallFor

grouping

search

results

thematicallySec.

16.1For

better

navigation

of

search

resultsRepresentation

for

clusteringDocument

representationVector

space?

Normalization?Centroids

aren’t

length

normalizedNeed

a

notion

of

similarity/distanceHow

manyclusters?Fixed

a

priori?Completely

data

driven?Avoid

“trivial”

clusters

-

too

large

or

smallIf

a

cluster's

too

large,

then

for

navigation

purposesyou've

wasted

an

extra

user

click

without

whittlingdown

the

set

of

documents

much.Sec.

16.2Issues

for

clusteringIdeal:

semantic

similarity.Practical:

term-statistical

similarityWewillusecosine

similarity.Docs

as

vectors.For

manyalgorithms,easier

tothinkintermsof

a

distance

(rather

than

similarity)

betweendocs.We

willmostlyspeak

ofEuclideandistanceBut

real

implementations

use

cosine

similarityNotion

of

similarity/distanceFlat

algorithmsUsually

start

with

a

random(partial)partitioningRefine

it

iterativelyK

means

clustering(Model

based

clustering)Hierarchical

algorithmsBottom-up,

agglomerative(Top-down,

divisive)Clustering

AlgorithmsHard

clustering:

Each

document

belongs

to

exactlyone

clusterMore

common

and

easier

to

doSoft

clustering:

A

document

can

belong

to

more

thanone

cluster.Makes

more

sense

for

applications

like

creating

browsablehierarchiesYou

may

want

to

put

a

pair

of

sneakers

in

two

clusters:

(i)sports

apparel

and

(ii)

shoesYou

can

only

do

that

with

a

soft

clustering

approach.Hard

vs.

soft

clusteringPartitioning

method:

Construct

a

partition

of

ndocuments

into

a

set

of

KclustersGiven:

a

set

of

documents

and

the

number

KFind:

a

partition

of

K

clustersthat

optimizesthechosen

partitioning

criterionGlobally

optimalIntractable

for

many

objective

functionsErgo,

exhaustively

enumerate

all

partitionsEffective

heuristic

methods:K-means

andK-medoids

algorithmsPartitioning

AlgorithmsReassignment

ofinstances

to

clusters

is

basedon

distancetothe

currentclustercentroids.–

(Or

one

can

equivalently

phrase

it

in

terms

ofsimilarities)x˛cx|

c

|

1μ(c)

=Sec.

16.4K-MeansAssumes

documentsarereal-valuedvectors.Clusters

based

oncentroids(akathecenter

ofgravityor

mean)

of

points

in

a

cluster,c:Select

K

random

docs

{s1,

s2,…

sK}as

seeds.Until

clustering

converges

(or

other

stopping

criterion):For

each

doc

di:Assign

di

to

the

cluster

cj

such

that

dist(xi,

sj)

is

minimal.(Next,

update

the

seeds

to

the

centroid

ofeachcluster)For

each

cluster

cjsj

=

m(cj)Sec.

16.4K-Means

AlgorithmxxxxPick

seedsReassign

clustersCompute

centroidsReassign

clustersCompute

centroidsReassign

clustersConverged!Sec.

16.4K

Means

Example

(K=2)Several

possibilities,

e.g.,Afixednumberofiterations.Doc

partition

unchanged.Centroid

positions

don’tchange.Sec.

16.4Termination

conditionsWhy

should

the

K-means

algorithmeverreacha

fixed

point?A

state

in

which

clustersdon’t

change.K-means

is

a

special

case

of

a

generalprocedure

known

as

the

ExpectationMaximization

(EM)

algorithm.EM

is

knownto

converge.Number

of

iterations

couldbe

large.–

But

in

practice

usually

isn’tSec.

16.4ConvergenceDefinegoodnessmeasureof

cluster

k

as

sumof

squared

distances

from

clustercentroid:–

Gk

=

Σi

(di

–ck)2

(sum

over

alldi

incluster

k)G

=Σk

GkReassignment

monotonically

decreases

Gsince

each

vector

isassigned

to

the

closestcentroid.Lower

case!Sec.

16.4Convergence

of

K-Meansputation

monotonically

decreases

each

Gksince(mk

is

number

of

members

in

cluster

k):Σ

(di–

a)2reaches

minimumfor:Σ

–2(di

–a)

=0Σ

di=Σ

amK

a

=

Σ

dia=

(1/

mk)

Σdi=ckK-meanstypicallyconverges

quicklySec.

16.4Convergence

of

K-MeansComputing

distance

between

two

docs

is

O(M)where

M

is

the

dimensionality

of

the

vectors.Reassigningclusters:

O(KN)

distancecomputations,

orO(KNM).Computing

centroids:

Each

doc

gets

added

oncetosomecentroid:O(NM).Assume

these

two

stepsare

each

done

once

forIiterations:

O(IKNM).Sec.

16.4Time

ComplexityResults

can

vary

based

onrandom

seed

selection.Some

seeds

can

result

in

poorconvergence

rate,

or

convergenceto

sub-optimal

clusterings.Select

good

seeds

using

a

heuristic(e.g.,

doc

least

similar

to

anyexisting

mean)Try

out

multiple

starting

pointsInitialize

with

the

results

of

anothermethod.In

the

above,

if

you

startwith

B

and

E

as

centroidsyou

converge

to

{A,B,C}and

{D,E,F}If

you

start

with

D

and

Fyou

converge

to{A,B,D,E}

{C,F}Exampleshowingsensitivity

to

seedsSec.

16.4Seed

Choiceputing

thecentroidaftereveryassignment(rather

than

after

all

points

are

re-assigned)

canimprove

speed

of

convergenceofK-meansAssumes

clustersarespherical

invector

spaceSensitive

to

coordinate

changes,

weighting

etc.Disjoint

and

exhaustiveDoesn’t

have

a

notion

of

“outliers”

by

defaultBut

can

add

outlier

filteringSec.

16.4K-means

issues,

variations,

etc.Number

of

clusters

K

isgivenPartition

n

docs

into

predetermined

number

ofclustersFinding

the

“right”

number

of

clusters

is

part

ofthe

problemGiven

docs,

partition

into

an

“appropriate”

numberof

subsets.E.g.,

for

query

results

-

ideal

value

of

K

not

knownup

front

-

though

UI

may

impose

limits.How

Many

Clusters?Say,

the

results

of

a

query.Solve

an

optimization

problem:penalizehaving

lots

ofclusters–

applicationdependent,e.g.,compressedsummary

of

search

results

list.Tradeoff

between

having

more

clusters(betterfocuswithineachcluster)andhaving

toomanyclustersK

not

specified

in

advanceGiven

a

clustering,

define

theBenefit

for

adoc

to

be

the

cosine

similarity

to

itscentroidDefine

theTotalBenefit

to

be

the

sum

ofthe

individual

doc

Benefits.K

not

specified

in

advanceFor

each

cluster,

we

have

a

Cost

C.Thus

for

aclustering

with

K

clusters,

the

Total

Cost

isKC.Define

theValueofaclusteringtobe=Total

Benefit

-

Total

Cost.Find

the

clustering

of

highestvalue,

over

allchoices

of

K.–

Total

benefit

increases

with

increasing

K.

But

canstop

when

it

doesn’t

increase

by

“much”.

The

Costterm

enforces

this.Penalize

lots

of

clustersTwo

important

paradigms:Bottom-up

agglomerative

clusteringTop-down

partitioningVisualisation

techniques:

Embedding

of

corpus

in

a

low-dimensional

spaceCharacterising

the

entities:Internally

:

Vector

space

model,

probabilistic

modelsExternally:

Measure

of

similarity/dissimilarity

between

pairsClustering(cont’d)31Clustering:

ParametersSimilarity

measure:

(eg:

Cosine

similarity)r(d1,

d2

)Distance

measure:

(eg:

Eucledian

distance)d(d1

,

d2

)Number

‘k’

of

clusters32Partitioning

ApproachesBottom-up

clusteringTop-down

clusteringGeometric

Embedding

ApproachesSelf-organization

map(SOM)Multidimensional

scalingLatent semantic

indexingGenerative

models

and

probabilistic

approachesSingle

topic

per

documentDocuments

correspond

to

mixtures

of

multiple

topicsClustering:

Formal

specification33Two

ways

to

get

partitionsbottom-up

clustering

and

top-down

clusteringPartitioning

Approaches34Partition

document

collection

into

k

clustersChoices:

{D1,

D2……Dk}–

Minimize

intra-cluster

distance

d(d1

,

d2

)i

d1

,d2˛

Di–

Maximize

r(d

,

Di

)i

DiSoft

clusteringi

d1

,d2˛

DiMaximize

intra-cluster

semblance

r(d1,

d2

)If

cluster

representations

Di

are

availableMinimize

d(d

,

Di

)i

Dii d˛

Di

i d˛

Did

assigned

to

Di

with

‘confidence’

Zd,iFind

Zd,i

so

as

to

minimize

zd

,i

r(d

,

Di

)or

maximize

zd

,id(d

,

Di

)For

each

G

keep

track

of

best

DUse

above

info

to

plot

the

hierarchical

merging

process(DENDROGRAM)To

get

desired

number

of

clusters:

cut

across

any

levelof

the

dendrogramHAC=hierarchical

agglomerative

clusteringBottom-up

clustering(HAC)35DendrogramA

dendrogram

presents

the

progressive,

hierarchy-forming

merging

process

pictorially.36Typically

s(G,D)

decreases

with

increasing

number

ofmergesSelf-Similarity–

Average

pair

wise

similarity

between

documents

in

Gs(d1,d2)

=

inter-document

similarity

measure

(say

cosine

ofTFIDF

vectors)Other

criteria:

Maximium/Minimum

pair

wise

similaritybetween

documents

in

the

clustersSimilarity

measure1

237F

d1

,d2˛F

,d1

„d2s(d

,

d

)C2s(F

)

=

1

Computationd˛Fp(F

)=dˆUn-normalized

group

profile:Can

show:1

2F

(F

-1)Fpˆ(F

),

pˆ(F

)

-

FC

2s(F

)=

1

s(d

,

d

)

=d1

,d2˛Fs(Γ

Δ)=

pˆ(G

D),

pˆ(G

D)

-

G

+

D

)(G

+

D

)(G

+

D

-1)pˆ(G

D),

pˆ(G

D)

=

pˆ(G),

pˆ(G)

+

pˆ(D),

pˆ(D)

+2

pˆ(G),

pˆ(D)38Bottom-upRequires

quadratic

time

and

spaceTop-down

or

move-to-nearestInternal

representation

for

documents

as

well

as

clustersPartition

documents

into

‘k’

clusters2

variants‘Hard’

(0/1)

assignment

of

documents

to

clusters‘soft’

:

documents

belong

to

clusters,

with

fractional

scoresTerminationwhen

assignment

of

documents

to

clusters

ceases

to

change(much)ORWhen

cluster

centroids

move

negligibly

over

successive

iterationsSwitch

to

top-down39Hard

k-Means:

Repeat…Choose

k

arbitrary

‘centroids’sign

each

document

to

nearest

centroidute

centroidsft

k-Means

:on’t

break

close

ties

between

document

assignmentsAspSoDtoclustersDon’t

make

documents

contribute

to

a

single

clusterfromdocument

d related

to

the

current

similarity

betweenand

d

.40Top-down

clusteringcContribution

for

ugpdating

cluster

centroidgmc

mc

+

Dmcwhich

wins

nDamcr=rhowly(d

-

mc

)exp(-

|

d

-

m

|

)2exp(-

|

d

-

m

|2

)mcmcRun

bottom-up

group

average

clustering

algorithm

toreduce

to

k

groups

or

clusters

:

O(kn

log

n)

timePhrase

II:Iterate

assign-to-nearestMove

each

document

to

nearest

clusterpute

cluster

centroidsTotal

time

taken

is

O(kn)Two

phrases:

O(kn

log

n)Seeding

‘k’

clusterskn

)

documents41Phrase

I:Randomly

sample

OGoal:

Embedding

of

corpus

in

a

low-dimensional

spaceHierarchical

Agglomerative

Clustering

(HAC)lends

itself

easily

to

visualisatonSelf-Organization

map

(SOM)A

close

cousin

of

k-meansMultidimensional

scaling

(MDS)minimize

the

distortion

of

interpoint

distances

in

the

low-dimensional

embedding

as

compared

to

the

dissimilaritygiven

in

the

input

data.Latent

Semantic

Indexing

(LSI)Linear

transformations

to

reduce

number

of

dimensionsVisualisation

techniques42Self-Organization

Map

(SOM)Like

soft

k-means–

Determine

association

between

clusters

and

documents–

Associate

a

representative

vector

mc

with

each

cluster

anditeratively

refineUnlike

k-means–

Embed

the

clusters

in

a

low-dimensional

space

right

from

thebeginning–

Large

number

of

clusters

can

be

initialized

even

if

eventuallyA

self-organizing

map

(SOM)

or

self-organizing

feature

map

(SOFM)is

a

type

of

artificial

neural

network

(ANN)

that

is

trained

usingunsupervised

learning

to

produce

a

low-dimensional

(typically

two-dimensional),

discretized

representation

of

the

input

space

of

thetraining

samples,

called

a

map.

Self-organizing

maps

are

differentfrom

other

artificial

neural

networks

in

the

sense

that

they

use

aneighborhood

function

to

preserve

the

topological

properties

of

theinput

space.——

from

Wikimany

are

to

remain

devoid

of

documentsEach

cluster

can

be

a

slot

in

a

square/hexagonal

grid.The

grid

structure

defines

the

neighborhood

N(c)

for

each

cluster

cAlso

involves

a

proximity

function

h(c,g)

between

clusters and

c44as

well

asSOM

:

Update

RuleLike

Neural

networkData

item

d

activates

neuron

(closest

cluster)

cdthe

neighborhood neurons

N

(cd

)Eg.

Gaussian

neighborhood

function2s

2

(t)||

m

-

m

||2h(c,g)

=

exp(

c

g

)Update

rule

for

node under

the

influence

of

d

is:mg

(t

+1)

=

mg

(t)

+h(t)h(g,

cd

)(d

-

mg

)Where

h(t)

is

the

learning

rate

parameter4546SOM

:

Example

ISOM

computed

from

over

a

million

documents

taken

from

80

Usenet

news

groups.

Light

areas

have

ahigh

density

of

documents.

The

region

shown

is

near

groups

pc.chips

and

pc.video,

and

closer

inspectionshows

a

number

of

URLs

in

this

region

that

are

about

PC

videocards.47SOM:

Example

IIAnother

example

of

SOM

at

work:

the

sites

listed

in

the

Open

Directory

Projecthave

been

organized

within

a

map

of

Antarctica,

at

antarcti.ca/

(a).

Clicking

on

aregion

maintains

context

(inset)

and

zooms

in

on

more

specific

topics

(b).Documents

are

located

at

the

cluster

to

which

they

are

most

similar.48Goal“Distance

preserving”

low

dimensional

embedding

of

documentsSymmetric

inter-document

distances

dijGiven

a

priori

or

computed

from

internal

representationCoarse-grained

user

feedback“documents

i

and

j

are

quite

dissimilar”

or

“document

i

is

more

similar

to

j

than

k.”Multidimensional

Scaling(MDS)49ij–

With

increasing

feedback,

prior

distances

areoverridden–

User

provides

similarity

between

documents

i

and

j

.goal

of

MDS:–

To

represent

documents

as

points

in

a

low-dimensional

space(often

2D

to

3D)

such

that

the

Euclidean

distance

betweenany

pair

of

points

is

as

close

as

possible

to

the

distancebetween

them

specified

by

the

input.Objective

:

Minimize

the

stress

of

embedding22ˆijijij(dd-d

)stress

=i,

ji,

jMultidimensional

Scaling(MDS)50MDS:

issuesStress

not

easy

to

optimizeIterative

hill

climbingPoints

(documents)

assigned

random

coordinatesby

external

heuristicPoints

moved

by

small

distance

in

direction

oflocally

decreasing

stressFor

n

documentsEach

takes

O(n)

time

to

be

movedTotally

O(n2

)

time

per

relaxation51No

internal

representation

of

documents

availableGoal–

find

a

projection

from

an

‘n’

dimensional

space

toa

space

with

a

smaller

number

‘k’

of

dimensions.Iterative

projection

of

documents

along

lines

ofmaximum

spreadEach

1D

projection

preserves

distance

informationFast

Map

[Faloutsos

’95]52Pivots

for

a

line:

two

points

(a

and

b)

that

determine

itAvoid

exhaustive

checking

by

picking

pivots

that

arefar

apartFirst

coordinates

x1

of

point

x

on

“best

line”

(a,b)Fast

Map

-

Best

linea,bd

212d+

d

2

-

d

2x

=

a,x

a,b

b,x

53For i

=

1

to

kFind

a

next

(ith)

“best”

lineA

“best”

line

is

onewhich

gives

maximumvariance

of

the point-set

in

the

direction

of

thelineProject

points

on

the

lineProject

points

on

the

“hyperplane”

orthogonal

to

theabove

lineFast

Map

-

Iterative

projection54Project

recursivelyupto

1-D

spaceTime:

O(nk)

timeProduct: (x1,

.

.

.

,

xk)

foreach

point

x

in

the

originaldata

setFast

Map

Hyperplane

Projection2'

2'

'dx

,

y=

dx,

y

-(x1

-

y1

)Purpose–

To

correct

inter-pointdistances

dx'

,

y'

between

points

(x’,y’)

bytaking

into

account

the

components

(x1,y1)already

accounted

for

by

the

first

pivot

line.a,bd

212d+

d

2

-

d

2x

=

a,

x

a,b

b,x

55Fast

Map

ExampleResponse

to the

query“Tony

Bennett”5657Vector

Space

Model:

Pros58Automatic

selection

of

index

termsPartial

matching

of

queries

and

documents

(dealingwith

the

case

where

no

document

contains

all

search

terms)Ranking

according

to

similarity

score

(dealing

withlarge

result

sets)Term

weighting

schemes

(improves

retrievalperformance)Various

extensionsDocument

clusteringRelevance

feedback

(modifying

query

vector)Geometric

foundation59Problems

with

Lexical

SemanticsPolysemy:词通常有multitudeofmeanings和不同用法。VectorSpaceModel不能区分同一个词的不同含义,即ambiguity.Synonymy:不同的terms可能具有dentical

or

asimilarmeaning.VectorSpaceModel里不能表达词之间的associations.60Issues

in

the

VSMterms之间的独立性假设有些terms更可能在一起出现同义词,相关词汇,拼写错误,etc.根据上下文,terms可能有不同的含义term-document矩阵维度很高对每篇文档/每个词,真的有那么多重要的特征?61Perform

alow-rank

approximation

of

term-document

matrix

(typical

rank

100-300)General

ideaMap

documents

(and

terms)

to

a

low-dimensionalrepresentation.Design

a

mapping

such

that

the

low-dimensionalspace

reflects

semantic

associations

(latentsemantic

space).Compute

document

similarity

based

on

the

innerproduct

in

this

latent

semantic

spaceLatent

Semantic

Indexing

(LSI)62Latent

Semantic

Indexing

(LSI)对角阵 给出各个dimensions的相对重要性t

·

d t

·

r对term-document矩阵作奇异值分解

SingularValue

positionDT,文档在r

dimensions里的表示T,

用于把新文档转换到r

dimensions里的矩阵wtd=TSr

·

rDTr

·

d63Low-rank

Approximationt

·

dt

·

rwtd=TSr

·

rDTr

·

dt

·

dt

·

kw'td=k

·

kk

·

dSTDT64≈从原始的term-document矩阵Ar,我们计算得到它的近似Ak.在Ak

中,每行对应一个term,每列对应一个document区别是,文档在新的空间,它的维度k

<<

r

dimensionsWhat

it

is65Tmatrix每个term在LSI

space的向量原始matrix:

terms向量是d-dimensional,T中要小很多Dimensions是在相同文档中倾向于与这个词“同现”的一组termssynonyms,

contextually-related

words,

variantendingsLSI

Term

matrix

T66S

gives

an

orderingtothedimensions值下降非常快尾部的singular

valuesat

代表"noise"在low-valuedimensions截止可以减少noise,提高性能Singular

Values67DT

matrix在LSI

space中文档的表示和T

vectors有相同的dimensionality可用于计算查询和一个文档的similarityDocument

matrix

D68LSI检索过程:查询映射/投影到LSI的DT空间,称为“folded

in”:document/query

vector

乘上S-1T

(TTT=DTD=I)文档集的文档向量为DT两者通过dot-product计算相似度性能提升来自…去除了noise不需要stem

terms

(variants

will

co-occur)不需要stop

list没有速度和空间上的改进,though…Improved

Retrieval

with

LSI69C=Sample

of

LSI70Tr=Sr=Dr=Sample

of

LSI71Sample

of

LSI(Cont’)S2=D2=72Sample

of

LSI(Cont’)Map

into

2-dimenstion

space73Latent

semantic

space:

illustrating

exampleSample

of

LSI(Cont’)74Experiments

on

TREC

1/2/3

DumaisPrecision

at

or

above

median

TREC

precision–

Top

scorer

on

almost

20%

of

TREC

topicsSlightly

better

on

average

than

straight

vector

spacesEffect

of

dimensionality:DimensionsPrecision2500.3673000.3713460.374Empirical

evidence75在很多场合,我们都有feature-object

matrix.矩阵是高维,有大量冗余,从而能使用low-rankapproximation.比如文本检索,the

terms是features,the

docs是objects.Latent

Semantic

Index比如opinions和users

…数据不全(e.g.,users’opinions),可以在低维空间里恢复.Powerful

general

analytical

techniqueLSI

has

many

other

applications76IR

based

on

Language

Model

(LM)queryd1d2dn…Informationneeddocument

collection1MdMd2…nM

d通常的search方法:猜测作者写相关文档时使用的词,形成queryThe

LM

approach

directly

exploits

that

idea!78P(Q

|

Md

)generationFormal

Language

(Model)I

wish传统的生成模型generative

model:产生strings

–Finite

state

machinesor

regular

grammars,etc.Example:I

wishI

wish

I

wishI

wish

I

wish

I

wish79I

wish

I

wish

I

wish

I

wish…(I

wish)

*Stochastic

Language

ModelsModel’s

probability

of

generating

strings

in

thelanguage

(commonly

all

strings

over

alphabet

∑)0.2

the0.1

a0.01

man0.01

woman0.03

said0.02

likes…themanlikesthewoman0.20.010.020.20.01multiply80Model

MP(s

|

M)

=

0.00000008Stochastic

Language

ModelsModel

probability

of

generating

any

string0.2

the0.01

class0.0001

sayst0.0001

pleaseth0.0001

yon0.0005

maiden0.01

womanModel

M1classpleasethyon

maidenthe810.01

0.0001

0.0001

0.00050.0001

0.02

0.1

0.010.20.2P(s|M2)

>

P(s|M1)Model

M20.2

the0.0001

class0.03

sayst0.02

pleaseth0.1

yon0.01

maiden0.0001

womanStochastic

Language

Models用来生成文本的统计模型–

Probability

distribution

over

strings

in

a

given

languageMP

( |

M

) =

P

(|

M

)P

(|

M,)P

(|

M,)P

(|

M,)82Stochastic

Language

Models|

) P

(

|)=

P

(

) P

(

|

) P

(Unigram

Language

ModelsP

( )

P

(

) P

(

) P

(

)P

()Bigram

(generally,

n-gram)

Language

ModelsP

( )

P

(

| )

P

(

|

) P

(

|

)Easy.Effective!83Using

Language

Models

in

IR84每篇文档对应一个model按P(d

|

q)对文档排序P(d

|

q)

=

P(q

|

d)

x

P(d)

/

P(q)P(q)

is

the

same

for

all

documents,

so

ignoreP(d)

[the

prior]

is

often

treated

as

the

same

for

all

dBut

wecould

use

criteria

like

authority,

length,

genreP(q

|

d)

is

the

probability

of

q

given

d’s

modelVery

general

formalappro

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论