Algorithms for Nearest Neighbor Search-大学课件-在线

上传人：1*** IP属地：湖北上传时间：2023-11-30 格式：PPTX 页数：35 大小：99.12KB 积分：6 举报 版权申诉

Algorithms for Nearest Neighbor Search-大学课件-在线_第2页

Algorithms for Nearest Neighbor Search-大学课件-在线_第3页

Algorithms for Nearest Neighbor Search-大学课件-在线_第4页

Algorithms for Nearest Neighbor Search-大学课件-在线_第5页

已阅读5页，还剩30页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Algorithms

for

Nearest

NeighborSearchPiotr

IndykMITNearest

Neighbor

SearchGiven:

set

points

Goal:

data

structure,

which

given

quepoint

finds

the

nearest

neighbor

ofin

PpqOutline

this

talkVariantsMotivationMain

memory

algorithms:quadtreeskd-treesLocality

Sensitive

HashingSecondary

storage

algorithms:R-tree

(and

its

variants)VA-fileVariants

nearest

neighbor

Near

neighbor

(range

search):

find

one/alpoints

within

distance

from

Spatial

join:

given

two

sets

P,Q,

find

allpairs

such

that

withindistance

from

Approximate

near

neighbor:

find

one/allpoints

p’

whose

distance

atmost

(1+e)

times

the

distance

from

itsnearest

neighborMotivationDepends

the

value

d:low

graphics,

vision,

GIS,

etchigh

d:similarity

databases

(text,

imagesfinding

pairs

similar

objects

(e.g.,

copyrviolation

detection)useful

subroutine

for

clusteringAlgorithmsMain

memory

(Computational

Geometry)linear

scantree-based:quadtreekd-treehashing-based:

Locality-Sensitive

HashingSecondary

storage

(Databases)R-tree

(and

numerous

variants)Vector

Approximation

File

(VA-file)QuadtreeSimplest

spatial

structure

Earth

!Quadtree

ctd.Split

the

space

into

equal

subsquaresRepeat

until

done:only

one

pixel

leftonly

one

point

leftonly

few

points

leftVariants:split

only

one

dimension

timek-d-trees

(in

moment)Range

searchNear

neighbor

(range

search):put

the

root

the

stackrepeatpop

the

node

from

the

stackfor

each

child

T:if

leaf,

examine

point(s)

Cif

intersects

with

the

ball

radius

around

add

Cthe

stackNear

neighbor

ctdNearest

neighborStart

range

with

=Whenever

point

found,

update

Only

investigate

nodes

with

respect

tocurrent

rQuadtree

ctd.Simple

data

structureVersatile,

easy

implementSo

why

doesn’t

this

talk

end

here

?Empty

spaces:

the

points

form

sparse

cloudit

takes

while

reach

themSpace

exponential

dimensionTime

exponential

dimension,

e.g.,

points

othe

hypercubeSpace

issues:

exampleK-d-trees

[Bentley’75]Main

ideas:only

one-dimensional

splitsinstead

splitting

the

middle,

choose

thsplit

“carefully”

(many

variations)near(est)

neighbor

queries:

for

quadtreesAdvantages:no

(or

less)

empty

spacesonly

linear

spaceExponential

query

time

still

possibleExponential

query

timeWhat

does

mean

exactly

?Unless

something

really

stupid,

query

time

ismost

dnTherefore,

the

actual

query

time

isMin[

dn,

exponential(d)

]

This

still

quite

bad

though,

when

the

dimensiois

around

20-30

Unfortunately,

seems

inevitable

(both

theoand

practice)Approximate

nearest

neighbor

Can

using

(augmented)

k-d

trees,

byinterrupting

earlier

[Arya

al’Still

exponential

time

(in

the

worst

caseTry

different

approach:for

exact

queries,

can

use

binary

searchtrees

hashingcan

adapt

hashing

nearest

neighborsearch

?Locality-Sensitive

Hashing[Indyk-Motwani’98]

Hash

functions

are

locality-sensitive,

random

hash

random

function

for

anypair

points

p,q

have:Pr[h(p)=h(q)]

“high”

“close”

tqPr[h(p)=h(q)]

“low”

is”far”

fromqDo

such

functions

exist

?Consider

the

hypercube,

i.e.,pointsfrom{0,1}dHamming

distance

D(p,q)=

positions

onwhich

and

differDefine

hash

function

choosing

set

Iof

random

coordinates,

and

settingh(p)

=projection

onIExampleTake–

d=10,

p=0101110010–

k=2,

I={2,5}Then

h(p)=11h’s

are

locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe

can

vary

the

probability

changing

kk=1k=2distancedistancePrPrHow

can

use

LSH

?Choose

several

h1..hlInitialize

hash

array

for

each

hiStore

each

point

the

bucket

hi(p)

ti-th

hash

array,

i=1...lIn

order

answer

query

qfor

each

i=1..l,

retrieve

points

bucket

hreturn

the

closest

point

foundWhat

does

this

algorithm

proper

choice

parameters

and

canmake,

for

any

the

probability

thathi(p)=hi(q)

for

some

ilook

this:Can

control:Position

the

slopeHow

steep

isdistanceThe

LSH

algorithm

Therefore,

can

solve

(approximately)

the

nearneighbor

problem

with

given

parameter

rWorst-case

analysis

guarantees

dn1/(1+e)

query

Practical

evaluation

indicates

much

better

beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:

works

best

for

Hamming

distance

(although

can

generalizeto

Euclidean

space)requires

radius

fixed

advanceSecondary

storage

Seek

time

same

time

needed

transferhundreds

KBsGrouping

the

data

crucialDifferent

approach

required:in

main

memory,

any

reduction

the

numberof

inspected

points

was

goodon

disk,

this

not

the

case

!Disk-based

algorithmsR-tree

[Guttman’84]departing

point

for

many

variationsover

600

citations

(according

CiteSeer)“optimistic”

approach:

try

answer

queries

inlogarithmic

timeVector

Approximation

File

[WSB’98]“pessimistic”

approach:

need

scan

the

whdata

set,

better

fastLSH

works

also

diskR-tree

“Bottom-up”

approach

(k-d-tree

was“top-down”)

:Start

with

set

points/rectanglesPartition

the

set

into

groups

small

cardinFor

each

group,

find

minimum

rectanglecontaining

objects

from

this

groupRepeatR-tree

ctd.R-tree

ctd.Advantages:Supports

near(est)

neighbor

(similarbefore)Works

for

points

and

rectanglesAvoids

empty

spacesMany

variants:

X-tree,

SS-tree,

SR-tree

etcWorks

well

for

low

dimensionsNot

great

for

high

dimensionsVA-file

[Weber,

Schek,Blott’98]Approach:In

high-dimensional

spaces,

all

tree-basedindexing

structures

examine

large

fraction

ofleavesIf

need

visit

many

nodes

anyway,

isbetter

scan

the

whole

data

set

and

avoidperforming

seeks

altogether1

seek

transfer

few

hundred

KBVA-file

ctd.

Natural

question:

how

speed-up

linearscan

?Answer:

use

approximationUse

only

bits

per

dimension

(and

speed-up

thscan

factor

32/i)Identify

all

points

which

could

returned

aan

answerVerify

the

points

using

original

data

setTime

sum

up“Curse

dimensionality”

indeed

curse

main

memory,

can

perform

sublinear-timesearch

using

trees

hashing

secondary

storage,

linear

scan

人人文库> 全部分类> 应用文书 > 作业报告

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

Algorithms for Nearest Neighbor Search-大学课件-在线

文档简介

温馨提示

最新文档

评论

Algorithms for Nearest Neighbor Search-大学课件-在线

文档简介

温馨提示

最新文档

评论

相关文档