Algorithms for Nearest Neighbor Search-大学课件-在线_第1页
Algorithms for Nearest Neighbor Search-大学课件-在线_第2页
Algorithms for Nearest Neighbor Search-大学课件-在线_第3页
Algorithms for Nearest Neighbor Search-大学课件-在线_第4页
Algorithms for Nearest Neighbor Search-大学课件-在线_第5页
已阅读5页,还剩30页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Algorithms

for

Nearest

NeighborSearchPiotr

IndykMITNearest

Neighbor

SearchGiven:

a

set

P

of

n

points

in

Rd

Goal:

a

data

structure,

which

given

a

quepoint

q,

finds

the

nearest

neighbor

p

ofin

PpqOutline

of

this

talkVariantsMotivationMain

memory

algorithms:quadtreeskd-treesLocality

Sensitive

HashingSecondary

storage

algorithms:R-tree

(and

its

variants)VA-fileVariants

of

nearest

neighbor

Near

neighbor

(range

search):

find

one/alpoints

in

P

within

distance

r

from

q

Spatial

join:

given

two

sets

P,Q,

find

allpairs

p

in

P,

q

in

Q,

such

that

p

is

withindistance

r

from

q

Approximate

near

neighbor:

find

one/allpoints

p’

in

P,

whose

distance

to

q

is

atmost

(1+e)

times

the

distance

from

q

to

itsnearest

neighborMotivationDepends

on

the

value

of

d:low

d:

graphics,

vision,

GIS,

etchigh

d:similarity

search

in

databases

(text,

imagesfinding

pairs

of

similar

objects

(e.g.,

copyrviolation

detection)useful

subroutine

for

clusteringAlgorithmsMain

memory

(Computational

Geometry)linear

scantree-based:quadtreekd-treehashing-based:

Locality-Sensitive

HashingSecondary

storage

(Databases)R-tree

(and

numerous

variants)Vector

Approximation

File

(VA-file)QuadtreeSimplest

spatial

structure

on

Earth

!Quadtree

ctd.Split

the

space

into

2d

equal

subsquaresRepeat

until

done:only

one

pixel

leftonly

one

point

leftonly

a

few

points

leftVariants:split

only

one

dimension

at

a

timek-d-trees

(in

a

moment)Range

searchNear

neighbor

(range

search):put

the

root

on

the

stackrepeatpop

the

next

node

T

from

the

stackfor

each

child

C

of

T:if

C

is

a

leaf,

examine

point(s)

in

Cif

C

intersects

with

the

ball

of

radius

r

around

q,

add

Cthe

stackNear

neighbor

ctdNearest

neighborStart

range

search

with

r

=Whenever

a

point

is

found,

update

r

Only

investigate

nodes

with

respect

tocurrent

rQuadtree

ctd.Simple

data

structureVersatile,

easy

to

implementSo

why

doesn’t

this

talk

end

here

?Empty

spaces:

if

the

points

form

sparse

cloudit

takes

a

while

to

reach

themSpace

exponential

in

dimensionTime

exponential

in

dimension,

e.g.,

points

othe

hypercubeSpace

issues:

exampleK-d-trees

[Bentley’75]Main

ideas:only

one-dimensional

splitsinstead

of

splitting

in

the

middle,

choose

thsplit

“carefully”

(many

variations)near(est)

neighbor

queries:

as

for

quadtreesAdvantages:no

(or

less)

empty

spacesonly

linear

spaceExponential

query

time

still

possibleExponential

query

timeWhat

does

it

mean

exactly

?Unless

we

do

something

really

stupid,

query

time

ismost

dnTherefore,

the

actual

query

time

isMin[

dn,

exponential(d)

]

This

is

still

quite

bad

though,

when

the

dimensiois

around

20-30

Unfortunately,

it

seems

inevitable

(both

in

theoand

practice)Approximate

nearest

neighbor

Can

do

it

using

(augmented)

k-d

trees,

byinterrupting

search

earlier

[Arya

et

al’Still

exponential

time

(in

the

worst

caseTry

a

different

approach:for

exact

queries,

we

can

use

binary

searchtrees

or

hashingcan

we

adapt

hashing

to

nearest

neighborsearch

?Locality-Sensitive

Hashing[Indyk-Motwani’98]

Hash

functions

are

locality-sensitive,

ia

random

hash

random

function

h,

for

anypair

of

points

p,q

we

have:Pr[h(p)=h(q)]

is

“high”

if

p

is

“close”

tqPr[h(p)=h(q)]

is

“low”

if

p

is”far”

fromqDo

such

functions

exist

?Consider

the

hypercube,

i.e.,pointsfrom{0,1}dHamming

distance

D(p,q)=

#

positions

onwhich

p

and

q

differDefine

hash

function

h

by

choosing

a

set

Iof

k

random

coordinates,

and

settingh(p)

=projection

of

p

onIExampleTake–

d=10,

p=0101110010–

k=2,

I={2,5}Then

h(p)=11h’s

are

locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe

can

vary

the

probability

by

changing

kk=1k=2distancedistancePrPrHow

can

we

use

LSH

?Choose

several

h1..hlInitialize

a

hash

array

for

each

hiStore

each

point

p

in

the

bucket

hi(p)

of

ti-th

hash

array,

i=1...lIn

order

to

answer

query

qfor

each

i=1..l,

retrieve

points

in

a

bucket

hreturn

the

closest

point

foundWhat

does

this

algorithm

do

?

By

proper

choice

of

parameters

k

and

l,

we

canmake,

for

any

p,

the

probability

thathi(p)=hi(q)

for

some

ilook

like

this:Can

control:Position

of

the

slopeHow

steep

it

isdistanceThe

LSH

algorithm

Therefore,

we

can

solve

(approximately)

the

nearneighbor

problem

with

given

parameter

rWorst-case

analysis

guarantees

dn1/(1+e)

query

ti

Practical

evaluation

indicates

much

better

beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:

works

best

for

Hamming

distance

(although

can

be

generalizeto

Euclidean

space)requires

radius

r

to

be

fixed

in

advanceSecondary

storage

Seek

time

same

as

time

needed

to

transferhundreds

of

KBsGrouping

the

data

is

crucialDifferent

approach

required:in

main

memory,

any

reduction

in

the

numberof

inspected

points

was

goodon

disk,

this

is

not

the

case

!Disk-based

algorithmsR-tree

[Guttman’84]departing

point

for

many

variationsover

600

citations

!

(according

to

CiteSeer)“optimistic”

approach:

try

to

answer

queries

inlogarithmic

timeVector

Approximation

File

[WSB’98]“pessimistic”

approach:

if

we

need

to

scan

the

whdata

set,

we

better

do

it

fastLSH

works

also

on

diskR-tree

“Bottom-up”

approach

(k-d-tree

was“top-down”)

:Start

with

a

set

of

points/rectanglesPartition

the

set

into

groups

of

small

cardinFor

each

group,

find

minimum

rectanglecontaining

objects

from

this

groupRepeatR-tree

ctd.R-tree

ctd.Advantages:Supports

near(est)

neighbor

search

(similarbefore)Works

for

points

and

rectanglesAvoids

empty

spacesMany

variants:

X-tree,

SS-tree,

SR-tree

etcWorks

well

for

low

dimensionsNot

so

great

for

high

dimensionsVA-file

[Weber,

Schek,Blott’98]Approach:In

high-dimensional

spaces,

all

tree-basedindexing

structures

examine

large

fraction

ofleavesIf

we

need

to

visit

so

many

nodes

anyway,

it

isbetter

to

scan

the

whole

data

set

and

avoidperforming

seeks

altogether1

seek

=

transfer

of

few

hundred

KBVA-file

ctd.

Natural

question:

how

to

speed-up

linearscan

?Answer:

use

approximationUse

only

i

bits

per

dimension

(and

speed-up

thscan

by

a

factor

of

32/i)Identify

all

points

which

could

be

returned

aan

answerVerify

the

points

using

original

data

setTime

to

sum

up“Curse

of

dimensionality”

is

indeed

a

curse

In

main

memory,

we

can

perform

sublinear-timesearch

using

trees

or

hashing

In

secondary

storage,

linear

scan

is

p

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论