版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Algorithms
for
Nearest
NeighborSearchPiotr
IndykMITNearest
Neighbor
SearchGiven:
a
set
P
of
n
points
in
Rd
Goal:
a
data
structure,
which
given
a
quepoint
q,
finds
the
nearest
neighbor
p
ofin
PpqOutline
of
this
talkVariantsMotivationMain
memory
algorithms:quadtreeskd-treesLocality
Sensitive
HashingSecondary
storage
algorithms:R-tree
(and
its
variants)VA-fileVariants
of
nearest
neighbor
Near
neighbor
(range
search):
find
one/alpoints
in
P
within
distance
r
from
q
Spatial
join:
given
two
sets
P,Q,
find
allpairs
p
in
P,
q
in
Q,
such
that
p
is
withindistance
r
from
q
Approximate
near
neighbor:
find
one/allpoints
p’
in
P,
whose
distance
to
q
is
atmost
(1+e)
times
the
distance
from
q
to
itsnearest
neighborMotivationDepends
on
the
value
of
d:low
d:
graphics,
vision,
GIS,
etchigh
d:similarity
search
in
databases
(text,
imagesfinding
pairs
of
similar
objects
(e.g.,
copyrviolation
detection)useful
subroutine
for
clusteringAlgorithmsMain
memory
(Computational
Geometry)linear
scantree-based:quadtreekd-treehashing-based:
Locality-Sensitive
HashingSecondary
storage
(Databases)R-tree
(and
numerous
variants)Vector
Approximation
File
(VA-file)QuadtreeSimplest
spatial
structure
on
Earth
!Quadtree
ctd.Split
the
space
into
2d
equal
subsquaresRepeat
until
done:only
one
pixel
leftonly
one
point
leftonly
a
few
points
leftVariants:split
only
one
dimension
at
a
timek-d-trees
(in
a
moment)Range
searchNear
neighbor
(range
search):put
the
root
on
the
stackrepeatpop
the
next
node
T
from
the
stackfor
each
child
C
of
T:if
C
is
a
leaf,
examine
point(s)
in
Cif
C
intersects
with
the
ball
of
radius
r
around
q,
add
Cthe
stackNear
neighbor
ctdNearest
neighborStart
range
search
with
r
=Whenever
a
point
is
found,
update
r
Only
investigate
nodes
with
respect
tocurrent
rQuadtree
ctd.Simple
data
structureVersatile,
easy
to
implementSo
why
doesn’t
this
talk
end
here
?Empty
spaces:
if
the
points
form
sparse
cloudit
takes
a
while
to
reach
themSpace
exponential
in
dimensionTime
exponential
in
dimension,
e.g.,
points
othe
hypercubeSpace
issues:
exampleK-d-trees
[Bentley’75]Main
ideas:only
one-dimensional
splitsinstead
of
splitting
in
the
middle,
choose
thsplit
“carefully”
(many
variations)near(est)
neighbor
queries:
as
for
quadtreesAdvantages:no
(or
less)
empty
spacesonly
linear
spaceExponential
query
time
still
possibleExponential
query
timeWhat
does
it
mean
exactly
?Unless
we
do
something
really
stupid,
query
time
ismost
dnTherefore,
the
actual
query
time
isMin[
dn,
exponential(d)
]
This
is
still
quite
bad
though,
when
the
dimensiois
around
20-30
Unfortunately,
it
seems
inevitable
(both
in
theoand
practice)Approximate
nearest
neighbor
Can
do
it
using
(augmented)
k-d
trees,
byinterrupting
search
earlier
[Arya
et
al’Still
exponential
time
(in
the
worst
caseTry
a
different
approach:for
exact
queries,
we
can
use
binary
searchtrees
or
hashingcan
we
adapt
hashing
to
nearest
neighborsearch
?Locality-Sensitive
Hashing[Indyk-Motwani’98]
Hash
functions
are
locality-sensitive,
ia
random
hash
random
function
h,
for
anypair
of
points
p,q
we
have:Pr[h(p)=h(q)]
is
“high”
if
p
is
“close”
tqPr[h(p)=h(q)]
is
“low”
if
p
is”far”
fromqDo
such
functions
exist
?Consider
the
hypercube,
i.e.,pointsfrom{0,1}dHamming
distance
D(p,q)=
#
positions
onwhich
p
and
q
differDefine
hash
function
h
by
choosing
a
set
Iof
k
random
coordinates,
and
settingh(p)
=projection
of
p
onIExampleTake–
d=10,
p=0101110010–
k=2,
I={2,5}Then
h(p)=11h’s
are
locality-sensitivePr[h(p)=h(q)]=(1-D(p,q)/d)kWe
can
vary
the
probability
by
changing
kk=1k=2distancedistancePrPrHow
can
we
use
LSH
?Choose
several
h1..hlInitialize
a
hash
array
for
each
hiStore
each
point
p
in
the
bucket
hi(p)
of
ti-th
hash
array,
i=1...lIn
order
to
answer
query
qfor
each
i=1..l,
retrieve
points
in
a
bucket
hreturn
the
closest
point
foundWhat
does
this
algorithm
do
?
By
proper
choice
of
parameters
k
and
l,
we
canmake,
for
any
p,
the
probability
thathi(p)=hi(q)
for
some
ilook
like
this:Can
control:Position
of
the
slopeHow
steep
it
isdistanceThe
LSH
algorithm
Therefore,
we
can
solve
(approximately)
the
nearneighbor
problem
with
given
parameter
rWorst-case
analysis
guarantees
dn1/(1+e)
query
ti
Practical
evaluation
indicates
much
better
beha[GIM’99,HGI’00,Buh’00,BT’00]Drawbacks:
works
best
for
Hamming
distance
(although
can
be
generalizeto
Euclidean
space)requires
radius
r
to
be
fixed
in
advanceSecondary
storage
Seek
time
same
as
time
needed
to
transferhundreds
of
KBsGrouping
the
data
is
crucialDifferent
approach
required:in
main
memory,
any
reduction
in
the
numberof
inspected
points
was
goodon
disk,
this
is
not
the
case
!Disk-based
algorithmsR-tree
[Guttman’84]departing
point
for
many
variationsover
600
citations
!
(according
to
CiteSeer)“optimistic”
approach:
try
to
answer
queries
inlogarithmic
timeVector
Approximation
File
[WSB’98]“pessimistic”
approach:
if
we
need
to
scan
the
whdata
set,
we
better
do
it
fastLSH
works
also
on
diskR-tree
“Bottom-up”
approach
(k-d-tree
was“top-down”)
:Start
with
a
set
of
points/rectanglesPartition
the
set
into
groups
of
small
cardinFor
each
group,
find
minimum
rectanglecontaining
objects
from
this
groupRepeatR-tree
ctd.R-tree
ctd.Advantages:Supports
near(est)
neighbor
search
(similarbefore)Works
for
points
and
rectanglesAvoids
empty
spacesMany
variants:
X-tree,
SS-tree,
SR-tree
etcWorks
well
for
low
dimensionsNot
so
great
for
high
dimensionsVA-file
[Weber,
Schek,Blott’98]Approach:In
high-dimensional
spaces,
all
tree-basedindexing
structures
examine
large
fraction
ofleavesIf
we
need
to
visit
so
many
nodes
anyway,
it
isbetter
to
scan
the
whole
data
set
and
avoidperforming
seeks
altogether1
seek
=
transfer
of
few
hundred
KBVA-file
ctd.
Natural
question:
how
to
speed-up
linearscan
?Answer:
use
approximationUse
only
i
bits
per
dimension
(and
speed-up
thscan
by
a
factor
of
32/i)Identify
all
points
which
could
be
returned
aan
answerVerify
the
points
using
original
data
setTime
to
sum
up“Curse
of
dimensionality”
is
indeed
a
curse
In
main
memory,
we
can
perform
sublinear-timesearch
using
trees
or
hashing
In
secondary
storage,
linear
scan
is
p
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 2024五人入股成立教育科技有限公司合作协议书3篇
- 2025年南昌从业资格证考试答案货运
- 2025年吉林货运驾驶员从业资格题库
- 2025年郴州货运资格证考试真题
- 2024年版:高清影视制作与后期服务合同
- 2025年江西货运从业资格证考试一共多少题
- 2025年海西货运从业资格证怎么考
- 2024年煤炭货场运营许可合同
- 2024年度互联网+教育平台委托经营授权书3篇
- 2024年版权许可使用合同(电子书)
- 肯定句改双重否定句练习(加答案)
- 2024年高等学校英语应用能力考试B级真题
- 2024-2025学年新教材高中化学 第3章 不同聚集状态的物质与性质 第2节 第2课时 共价晶体 分子晶体 晶体结构的复杂性教案 鲁科版选择性必修2
- 初中主题班会人际交往
- 气候可行性论证技术规范 第10部分:油田开发工程
- 五年级道德与法治上册说课稿《古代科技 耀我中华(第一课时) 》部编版
- 单位工程质量竣工验收记录1
- Unit 6 教学教学设计 2024-2025学年人教版七年级英语上册
- Visio商业图表制作分析智慧树知到期末考试答案章节答案2024年上海商学院
- 竞争性谈判工作人员签到表及竞争性谈判方案
- 医疗投诉与纠纷处理管理制度
评论
0/150
提交评论