网络搜索引擎原理004similarity and clustering

上传人：洞*** IP属地：北京上传时间：2023-07-02 格式：PPTX 页数：103 大小：2.11MB 积分：16 举报 版权申诉

网络搜索引擎原理004similarity and clustering_第2页

网络搜索引擎原理004similarity and clustering_第3页

网络搜索引擎原理004similarity and clustering_第4页

网络搜索引擎原理004similarity and clustering_第5页

已阅读5页，还剩98页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

网络搜索引擎原理陈光()信息与通信工程学院Similarity

ClusteringClustering:theprocess

grouping

aset

ofobjects

into

classes

similar

objectsDocuments

within

cluster

should

similar.Documents

from

different

clusters

should

bedissimilar.The

commonest

form

unsupervised

learningUnsupervised

learning

from

raw

data,as

opposed

supervised

data

where

aclassification

examples

givenA

common

and

important

task

that

finds

manyapplications

and

other

placesCh.

16What

clustering?How

wouldyou

designan

algorithmfor

findingthe

threeclusters

inthis

case?Ch.

16A

data

set

with

clear

cluster

structureWhole

corpus

analysis/navigationBetter

user

interface:

withouttypingFor

improving

recall

insearch

applicationsBetter

results

(like

pseudo

RF)For

better

navigation

resultsEffective

“user

recall”

will

higherFor

speeding

vector

spaceretrievalCluster-based

retrieval

gives

fastersearchSec.

16.1Applications

clustering

Web

SearchYahoo!

Hierarchy

isn’t

clustering

but

isthekind

output

you

want

fromclusteringdairycropsagronomyforestryAIHCIcraftmissionsbotany

cellmagnetismevolutionrelativitycoursesagriculturebiologyphysicsCSspace.........…

(30)......Sample

Yahoo!Google

News:automatic

clusteringgivesaneffective

news

presentation

metaphorSample

Google

NewsScatter/Gather:

Cutting,

Karger,

and

PedersenSec.

16.1Sample

Scatter/GatherFor

visualizing

document

collection

and

its

themesWise

al,

“Visualizing

the

non-visual”

PNNLThemeScapes,

Cartia–

[Mountain

height

cluster

size]Sample

VisualizationCluster

hypothesis

Documents

the

same

clusterbehave

similarly

with

respect

relevance

toinformation

needsTherefore,

improve

recall:Cluster

docs

corpus

prioriWhen

query

matches

doc

also

return

otherdocs

the

cluster

containing

DHope

this:

The

query

“car”

will

also

returndocs

containing

automobileBecause

clustering

grouped

together

docscontaining

car

with

those

containing

automobile.Sec.

16.1For

improving

recallFor

grouping

results

thematicallySec.

16.1For

better

navigation

resultsRepresentation

for

clusteringDocument

representationVector

space?

Normalization?Centroids

aren’t

length

normalizedNeed

notion

similarity/distanceHow

manyclusters?Fixed

priori?Completely

data

driven?Avoid

“trivial”

clusters

too

large

smallIf

cluster's

too

large,

then

for

navigation

purposesyou've

wasted

extra

user

click

without

whittlingdown

the

set

documents

much.Sec.

16.2Issues

for

clusteringIdeal:

semantic

similarity.Practical:

term-statistical

similarityWewillusecosine

similarity.Docs

vectors.For

manyalgorithms,easier

tothinkintermsof

distance

(rather

than

similarity)

betweendocs.We

willmostlyspeak

ofEuclideandistanceBut

real

implementations

use

cosine

similarityNotion

similarity/distanceFlat

algorithmsUsually

start

with

random(partial)partitioningRefine

iterativelyK

means

clustering(Model

based

clustering)Hierarchical

algorithmsBottom-up,

agglomerative(Top-down,

divisive)Clustering

AlgorithmsHard

clustering:

Each

document

belongs

exactlyone

clusterMore

common

and

easier

doSoft

clustering:

document

can

belong

thanone

cluster.Makes

sense

for

applications

creating

browsablehierarchiesYou

may

want

put

pair

sneakers

two

clusters:

(i)sports

apparel

and

(ii)

shoesYou

can

only

that

with

soft

clustering

approach.Hard

vs.

soft

clusteringPartitioning

method:

Construct

partition

ndocuments

into

set

KclustersGiven:

set

documents

and

the

number

KFind:

partition

clustersthat

optimizesthechosen

partitioning

criterionGlobally

optimalIntractable

for

many

objective

functionsErgo,

exhaustively

enumerate

all

partitionsEffective

heuristic

methods:K-means

andK-medoids

algorithmsPartitioning

AlgorithmsReassignment

ofinstances

clusters

basedon

distancetothe

currentclustercentroids.–

(Or

one

can

equivalently

phrase

terms

ofsimilarities)x˛cx|

1μ(c)

=Sec.

16.4K-MeansAssumes

documentsarereal-valuedvectors.Clusters

based

oncentroids(akathecenter

ofgravityor

mean)

points

cluster,c:Select

random

docs

{s1,

s2,…

sK}as

seeds.Until

clustering

converges

(or

other

stopping

criterion):For

each

doc

di:Assign

the

cluster

such

that

dist(xi,

sj)

minimal.(Next,

update

the

seeds

the

centroid

ofeachcluster)For

each

cluster

cjsj

m(cj)Sec.

16.4K-Means

AlgorithmxxxxPick

seedsReassign

clustersCompute

centroidsReassign

clustersCompute

centroidsReassign

clustersConverged!Sec.

16.4K

Means

Example

(K=2)Several

possibilities,

e.g.,Afixednumberofiterations.Doc

partition

unchanged.Centroid

positions

don’tchange.Sec.

16.4Termination

conditionsWhy

should

the

K-means

algorithmeverreacha

fixed

point?A

state

which

clustersdon’t

change.K-means

special

case

generalprocedure

known

the

ExpectationMaximization

(EM)

algorithm.EM

knownto

converge.Number

iterations

couldbe

large.–

But

practice

usually

isn’tSec.

16.4ConvergenceDefinegoodnessmeasureof

cluster

sumof

squared

distances

from

clustercentroid:–

Σi

(di

–ck)2

(sum

over

alldi

incluster

k)G

=Σk

GkReassignment

monotonically

decreases

Gsince

each

vector

isassigned

the

closestcentroid.Lower

case!Sec.

16.4Convergence

K-Meansputation

monotonically

decreases

each

Gksince(mk

number

members

cluster

k):Σ

(di–

a)2reaches

minimumfor:Σ

–2(di

–a)

=0Σ

di=Σ

amK

dia=

(1/

mk)

Σdi=ckK-meanstypicallyconverges

quicklySec.

16.4Convergence

K-MeansComputing

distance

between

two

docs

O(M)where

the

dimensionality

the

vectors.Reassigningclusters:

O(KN)

distancecomputations,

orO(KNM).Computing

centroids:

Each

doc

gets

added

oncetosomecentroid:O(NM).Assume

these

two

stepsare

each

done

once

forIiterations:

O(IKNM).Sec.

16.4Time

ComplexityResults

can

vary

based

onrandom

seed

selection.Some

seeds

can

result

poorconvergence

rate,

convergenceto

sub-optimal

clusterings.Select

good

seeds

using

heuristic(e.g.,

doc

least

similar

anyexisting

mean)Try

out

multiple

starting

pointsInitialize

with

the

results

anothermethod.In

the

above,

you

startwith

and

centroidsyou

converge

{A,B,C}and

{D,E,F}If

you

start

with

and

Fyou

converge

to{A,B,D,E}

{C,F}Exampleshowingsensitivity

seedsSec.

16.4Seed

Choiceputing

thecentroidaftereveryassignment(rather

than

after

all

points

are

re-assigned)

canimprove

speed

convergenceofK-meansAssumes

clustersarespherical

invector

spaceSensitive

coordinate

changes,

weighting

etc.Disjoint

and

exhaustiveDoesn’t

have

notion

“outliers”

defaultBut

can

add

outlier

filteringSec.

16.4K-means

issues,

variations,

etc.Number

clusters

isgivenPartition

docs

into

predetermined

number

ofclustersFinding

the

“right”

number

clusters

part

ofthe

problemGiven

docs,

partition

into

“appropriate”

numberof

subsets.E.g.,

for

query

results

ideal

value

not

knownup

front

though

may

impose

limits.How

Many

Clusters?Say,

the

results

query.Solve

optimization

problem:penalizehaving

lots

ofclusters–

applicationdependent,e.g.,compressedsummary

results

list.Tradeoff

between

having

clusters(betterfocuswithineachcluster)andhaving

toomanyclustersK

not

specified

advanceGiven

clustering,

define

theBenefit

for

adoc

the

cosine

similarity

itscentroidDefine

theTotalBenefit

the

sum

ofthe

individual

doc

Benefits.K

not

specified

advanceFor

each

cluster,

have

Cost

C.Thus

for

aclustering

with

clusters,

the

Total

Cost

isKC.Define

theValueofaclusteringtobe=Total

Benefit

Total

Cost.Find

the

clustering

highestvalue,

over

allchoices

K.–

Total

benefit

increases

with

increasing

But

canstop

when

doesn’t

increase

“much”.

The

Costterm

enforces

this.Penalize

lots

clustersTwo

important

paradigms:Bottom-up

agglomerative

clusteringTop-down

partitioningVisualisation

techniques:

Embedding

corpus

low-dimensional

spaceCharacterising

the

entities:Internally

Vector

space

model,

probabilistic

modelsExternally:

Measure

similarity/dissimilarity

between

pairsClustering(cont’d)31Clustering:

ParametersSimilarity

measure:

(eg:

Cosine

similarity)r(d1,

)Distance

measure:

(eg:

Eucledian

distance)d(d1

)Number

‘k’

clusters32Partitioning

ApproachesBottom-up

clusteringTop-down

clusteringGeometric

Embedding

ApproachesSelf-organization

map(SOM)Multidimensional

scalingLatent semantic

indexingGenerative

models

and

probabilistic

approachesSingle

topic

per

documentDocuments

correspond

mixtures

multiple

topicsClustering:

Formal

specification33Two

ways

get

partitionsbottom-up

clustering

and

top-down

clusteringPartitioning

Approaches34Partition

document

collection

into

clustersChoices:

{D1,

D2……Dk}–

Minimize

intra-cluster

distance

d(d1

,d2˛

Di–

Maximize

r(d

d˛

DiSoft

clusteringi

,d2˛

DiMaximize

intra-cluster

semblance

r(d1,

)If

cluster

representations

are

availableMinimize

d(d

d˛

Dii d˛

i d˛

Did

assigned

with

‘confidence’

Zd,iFind

Zd,i

minimize

r(d

)or

maximize

,id(d

)For

each

keep

track

best

DUse

above

info

plot

the

hierarchical

merging

process(DENDROGRAM)To

get

desired

number

clusters:

cut

across

any

levelof

the

dendrogramHAC=hierarchical

agglomerative

clusteringBottom-up

clustering(HAC)35DendrogramA

dendrogram

presents

the

progressive,

hierarchy-forming

merging

process

pictorially.36Typically

s(G,D)

decreases

with

increasing

number

ofmergesSelf-Similarity–

Average

pair

wise

similarity

between

documents

Gs(d1,d2)

inter-document

similarity

measure

(say

cosine

ofTFIDF

vectors)Other

criteria:

Maximium/Minimum

pair

wise

similaritybetween

documents

the

clustersSimilarity

measure1

237F

,d2˛F

,d1

„d2s(d

)C2s(F

)

Computationd˛Fp(F

)=dˆUn-normalized

group

profile:Can

show:1

-1)Fpˆ(F

pˆ(F

)

2s(F

s(d

)

=d1

,d2˛Fs(Γ

Δ)=

pˆ(G

D),

pˆ(G

)(G

-1)pˆ(G

D),

pˆ(G

pˆ(G),

pˆ(G)

pˆ(D),

pˆ(D)

pˆ(G),

pˆ(D)38Bottom-upRequires

quadratic

time

and

spaceTop-down

move-to-nearestInternal

representation

for

documents

well

clustersPartition

documents

into

‘k’

clusters2

variants‘Hard’

(0/1)

assignment

documents

clusters‘soft’

documents

belong

clusters,

with

fractional

scoresTerminationwhen

assignment

documents

clusters

ceases

change(much)ORWhen

cluster

centroids

move

negligibly

over

successive

iterationsSwitch

top-down39Hard

k-Means:

Repeat…Choose

arbitrary

‘centroids’sign

each

document

nearest

centroidute

centroidsft

k-Means

:on’t

break

ties

between

document

assignmentsAspSoDtoclustersDon’t

make

documents

contribute

single

clusterfromdocument

d related

the

current

similarity

betweenand

.40Top-down

clusteringcContribution

for

ugpdating

cluster

centroidgmc

‹

Dmcwhich

wins

nDamcr=rhowly(d

)exp(-

)2exp(-

)mcmcRun

bottom-up

group

average

clustering

algorithm

toreduce

groups

clusters

O(kn

log

timePhrase

II:Iterate

assign-to-nearestMove

each

document

nearest

clusterpute

cluster

centroidsTotal

time

taken

O(kn)Two

phrases:

O(kn

log

n)Seeding

‘k’

clusterskn

)

documents41Phrase

I:Randomly

sample

OGoal:

Embedding

corpus

low-dimensional

spaceHierarchical

Agglomerative

Clustering

(HAC)lends

itself

easily

visualisatonSelf-Organization

map

(SOM)A

cousin

k-meansMultidimensional

scaling

(MDS)minimize

the

distortion

interpoint

distances

the

low-dimensional

embedding

compared

the

dissimilaritygiven

the

input

data.Latent

Semantic

Indexing

(LSI)Linear

transformations

reduce

number

dimensionsVisualisation

techniques42Self-Organization

Map

(SOM)Like

soft

k-means–

Determine

association

between

clusters

and

documents–

Associate

representative

vector

with

each

cluster

anditeratively

refineUnlike

k-means–

Embed

the

clusters

low-dimensional

space

right

from

thebeginning–

Large

number

clusters

can

initialized

even

eventuallyA

self-organizing

map

(SOM)

self-organizing

feature

map

(SOFM)is

type

artificial

neural

network

(ANN)

that

trained

usingunsupervised

learning

produce

low-dimensional

(typically

two-dimensional),

discretized

representation

the

input

space

thetraining

samples,

called

map.

Self-organizing

maps

are

differentfrom

other

artificial

neural

networks

the

sense

that

they

use

aneighborhood

function

preserve

the

topological

properties

theinput

space.——

from

Wikimany

are

remain

devoid

documentsEach

cluster

can

slot

square/hexagonal

grid.The

grid

structure

defines

the

neighborhood

N(c)

for

each

cluster

cAlso

involves

proximity

function

h(c,g)

between

clusters and

c44as

well

asSOM

Update

RuleLike

Neural

networkData

item

activates

neuron

(closest

cluster)

cdthe

neighborhood neurons

(cd

)Eg.

Gaussian

neighborhood

function2s

(t)||

||2h(c,g)

exp(

)Update

rule

for

node under

the

influence

is:mg

+1)

(t)

+h(t)h(g,

)(d

)Where

h(t)

the

learning

rate

parameter4546SOM

Example

ISOM

computed

from

over

million

documents

taken

from

Usenet

news

groups.

Light

areas

have

ahigh

density

documents.

The

region

shown

near

groups

pc.chips

and

pc.video,

and

closer

inspectionshows

number

URLs

this

region

that

are

about

videocards.47SOM:

Example

IIAnother

example

SOM

work:

the

sites

listed

the

Open

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

网络搜索引擎原理004similarity and clustering

文档简介

温馨提示

最新文档

评论

网络搜索引擎原理004similarity and clustering

文档简介

温馨提示

最新文档

评论

相关文档