版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
Lecture4:IndexConstruction·
Lastlecture
:·
Wildcards
○
○○
○
○
○·
Spellcorrection·
Soundex
$m
mace
→madden
mo
□〉
among
amortize
·
This
time
:
on
abandon
among
·
Indexconstruction·
Dictionarydatastructures
a-hu
h
y-m
n-z·
Tolerant
retrieval
0
0
0PlanIndexconstruction·
Howdo
weconstructanindex?·
Whatstrategiescan
weuse
withlimited
mainmemory?Hardwarebasics·
Manydesigndecisionsininformationretrieval
a
basedonthe
characteristicsofhardware·
WebeginbyreviewinghardwarebasicsHardwarebasics·
Access
to
data
in
memory
ismuchfaster
than
acces
to
data
on
disk.·
Diskseeks
:
Nodataistransferredfromdisk
while
diskheadisbeingpositioned.·
Therefore
:Transferringonelargechunkofdataf
diskto
memoryisfasterthantransferring
manysmall
chunks.·
Disk
I/O
is
block-based
:Reading
and
writing
of
enblocks
(as
opposed
to
smaller
chunks).·
Blocksizes
:8KBto256KB.Hardwarebasics·ServersusedinIR
systemsnowtypicallyhaveseve
GBofmainmemory,sometimestensofGB.·
Available
disk
space
is
several
(2–
3)orders
of
magnitudelarger.·
Faulttoleranceisveryexpensive
:It
’
smuchcheto
use
many
regular
machines
rather
than
one
faulttolerant
machine.Hardwareassumptionsforthislectur·
symbol
statisticvalue·s
average
seektime5ms=5x10-3
s·
b
transfer
time
per
byte0.02
μs=2x10-s·
processor
’
sclock
rate
109
s-1·
p
low-level
operation
0.01μs=10-8
s(e.g.,compare&swap
a
word)·sizeofmainmemoryseveralGB·
size
of
disk
space1TBorRCV1:Our
collectionfor
thislectur·Shakespeare
’
scollected
worksdefinitelyaren
’enough
for
demonstrating
many
of
the
points
in
this
course.·
Thecollection
we
’ll
use
isn
’t
reallylargeeno
either,
butit
’
spubliclyavailableandisatleasmore
plausible
example.·
As
an
example
for
applying
scalable
indexconstructionalgorithms,
we
willusetheReuters
RCV1
collection.·
This
is
one
year
of
Reuters
newswire
(part
of1995
and
1996)AReutersRCV1document·symbolstatisticvalue·Ndocuments800,000·Lavg.#tokensperdoc200·Mterms(=wordtypes)400,000·avg.#bytespertoken6(incl.spaces/punct.)·avg.#bytespertoken4.5(withoutspaces/punct.)·avg.#bytesperterm7.5ReutersRCV1
statisticsRecallIIR1indexconstructionDoc2SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc
1IdidenactJulius
Caesar
Iwaskilled
i"theCapitol
;Brutuskilledme.·
Documentsareparsedtoextract
wordsandthese
aresaved
withtheDocumentID.Wefocusonthissortstep.
Wehave
100M
itemsto
sort.·After
all
documentshavebeen
parsed,theinvertedfileissorted
by
terms.Key
step·In-memoryindexconstructiondoesnotscale·Can’tstuffentirecollectionintomemory,sort,theback·Howcanweconstructanindexforverylargecollections?·Takingintoaccountthehardwareconstraintswejlearnedabout...·Memory,disk,speed,etc.ScalingindexconstructionSort-basedindexconstruction·Aswebuildtheindex,weparsedocsoneatatime.·Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)·Thefinalpostingsforanytermarepleteuntiltheend·At12bytespernon-positionalpostingsentry(term,docdemandsalotofspaceforlargecollections.·T=100,000,000inthecaseofRCV1·So…wecandothisinmemoryin2009,buttypicalcollecaremuchlarger.E.g.,theNewYorkTimesprovidesanindeof>150yearsofnewswire·Thus:Weneedtostoreintermediateresultsondisk.Sort
using
disk
as
“memory
”
?·Canweusethesame
index
constructionalgorithm
forlargercollections,
but
by
usingdiskinsteado
memory?·
No:Sorting
T=100,000,000records
on
disk
is
tooslow
–toomanydiskseeks.·
Weneedanexternal
sortingalgorithm.·
Parse
and
build
postings
entries
one
doc
at
a
time·
Now
sort
postings
entries
by
term
(then
by
docwithineachterm)·
Doing
this
with
random
disk
seeks
would
be
too
slo–mustsortT=
100MrecordsIfeverycomparisontook2
disk
seeks,
andN
items
could
be
sortedwithN
log2N
comparisons,how
longwouldthistake?BottleneckBSBI:Blocked
sort-based
Indexing
(Sorting
withfewerdiskseeks)·12-byte
(4+4+4)records
(term,
doc,
freq).·
Thesearegeneratedasweparsedocs.·
Mustnowsort
100Msuch12-byterecordsbyterm.·
DefineaBlock~
10Msuchrecords·
Caneasily
fitacoupleinto
memory.·
Will
have10such
blockstostart
with.·
Basicideaofalgorithm
:·
Accumulate
postingsforeach
block,sort,
writetodi·
Then
mergethe
blocksintoonelongsortedorder.·Can
do
binary
merges,
with
a
merge
tree
of
log210=4layer·During
each
layer,read
into
memory
runs
in
blocks
of10M
merge,
write
back.Howtomergethesortedruns?DiskRunsbeing
merged.2
Mergedrun.1
234431Howtomergethesortedruns?·Butitis
moreefficienttodoa
multi-way
merge,
whereyo
are
readingfromall
blockssimultaneously·Providingyou
readdecent-sizedchunksofeach
blockin
tmemory
and
then
write
out
a
decent-sized
output
chunk,thenyou
’re
not
killed
bydiskseeksRemainingproblemwithsort-basedalgorithm·Ourassumptionwas:wecankeepthedictionaryinmemory.·Weneedthedictionary(whichgrowsdynamically)ordertoimplementatermtotermIDmapping.·Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings...·...butthenintermediatefileseverylarge.wouldendupwithascalable,butveryslowindexconstructionmethod.)SPIMI:Single-passin-memoryindexing·Keyidea1:Generateseparatedictionariesforeablock–noneedtomaintainterm-termIDmappingacrossblocks.·Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.·Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.·Theseseparateindexescanthenbemergedintoonebigindex.·
Merging
of
blocks
is
analogous
to
BSBI.SPIMI-Invert·CompressionmakesSPIMIevenmoreefficient.·
Compression
of
terms·
Compressionof
postings·
SeenextlectureSPIMI:CompressionDistributedindexing·
For
web-scale
indexing
(don
’t
trythisat
home!)must
useadistributedcomputingcluster·
Individual
machinesarefault-prone·
Can
unpredictablyslowdownorfail·
Howdoweexploitsuchapoolofmachines?Websearch
engine
datacenter
s·
Web
search
datacenter
s
(Google,Bing,Baidu)
mainlycontaincommoditymachines.·
Datacenter
saredistributedaroundthe
world.·
Estimate
:Google~
1
millionservers,3
millionprocessors/cores
(Gartner2007)Massivedatacenter
s·
Ifin
a
non-fault-tolerantsystemwith
1000nodes
eachnodehas99.9%uptime,whatistheuptimeofthesystem?·
Answer:63%·
Exercise
:Calculate
the
number
of
servers
failin
minuteforaninstallationof1
millionservers.Distributedindexing·
Maintaina
master
machinedirectingtheindexingjob
–considered
“
safe
”.·
Break
upindexingintosetsof
(parallel)tasks.·
Mastermachineassignseachtasktoanidlemachinfrom
a
pool.Paralleltasks·
We
willusetwosetsofparalleltasks·
Parser
s·
Inverter
s·
Breaktheinputdocumentcollectionintosplits·
Each
split
is
a
subset
of
documents
(correspondinblocks
in
BSBI/SPIMI)·
Masterassignsasplittoanidleparser
machine·
Parserreadsadocumentatatimeandemits
(term,doc)
pairs·
Parser
writes
pairsintojpartitions·
Eachpartitionisforarangeofterms
’
first
let·
(e.g.,
a-f,
g-p,
q-z)
–
here
j=3.·
Now
to
complete
the
index
in
versionParser
sInverter
s·
Aninvertercollectsall
(term,doc)
pairs
(=
post
foroneterm-partition.·
Sortsand
writesto
postingslistsParser
一
a-f
g-pq-zParserParser
一
a-f
g-p
q-zDataflowMaster
assignInverter
g-pInverter
a-f一
a-f
g-p
q-zInverter
q-zReduce
phaseSegment
filesMapphasePostingsassignsplitsMapReduce·TheindexconstructionalgorithmwejustdescribaninstanceofMapReduce.·MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputing…·…withouthavingtowritecodeforthedistributpart.·TheydescribetheGoogleindexingsystem(ca.200asconsistingofanumberofphases,eachimplementedinMapReduce.·Index
construction
was
just
one
phase.·
Anotherphase
:transformingaterm-partitioned
indexintoadocument-partitionedindex.·
Term-partitioned:one
machinehandlesasubrangeofterms·
Document-partitioned:one
machinehandlesasubrange
ofdocuments·
As
we
’ll
discuss
in
the
web
part
of
the
course,mosearch
engines
use
a
document-partitioned
index…
betterload
balancing,etc.MapReduceSchemafor
indexconstructioninMapReduce·
Schemaofmapandreducefunctions·map:input→list(k,
v)
reduce
:
(k,
list(v))→
output·Instantiation
oftheschemaforindexconstruction
·map:collection→list(termID,docID)·reduce
:
(<termID1,list(docID)>,<termID2,list(docID
(postingslist1,
postingslist2,…)·Map:·d1:Ccame,Cc’ed.·d2:Cdied.→·<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>,<C,d2>,<died,d2>·Reduce:·(<C,(d1,d2,d1)>,<died,(d2)>,<came,(d1)>,<c’→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:<c’ed,(d1:1)>)Exampleforindexconstruction37Dynamic
indexing·Uptonow,wehaveassumedthatcollectionsare
static.·
Theyrarely
are
:·
Documents
come
in
over
time
and
need
to
be
inserted.·
Documentsaredeletedand
modified.·
This
meansthatthedictionaryandpostingslists
havetobemodified
:·
Postings
updatesfortermsalreadyindictionary·
Newtermsaddedtodictionary·
Maintain
“big
”
main
index·
Newdocsgointo“
small”auxiliaryindex·
Search
across
both,merge
results·
Deletions·
Invalidation
bit-vector
fordeleteddocs·
Filterdocsoutputon
asearch
result
by
thisinvalida
bit-vector·
Periodically,re-indexintoone
mainindexSimplest
approachIssueswithmainandauxiliaryindex·Problemoffrequent
merges
–
youtouchstuffalot·Poor
performance
during
merge·
Actually:·Mergingoftheauxiliaryindexintothe
mainindexisefficientifkeepaseparatefileforeachpostingslist.·
Merge
is
the
same
as
a
simple
append.·Butthen
we
wouldneedalotoffiles
–inefficientfor
OS.·Assumptionforthe
restofthelecture
:
Theindexisone
b
file.·
In
reality:
Useaschemesomewhereinbetween
(e.g.,spl
verylarge
postingslists,
collect
postingslists
ofleng
one
file
etc.)Logarithmicmerge·
Maintainaseriesofindexes,eachtwiceaslargethe
previous
one·
At
any
time,some
of
these
powers
of2are
instantiated·
Keepsmallest
(Z0)inmemory·
Larger
ones
(I0,I1,…)on
disk·
IfZ0
getstoo
big
(>n),
writetodisk
asI0·
or
merge
with
I0
(if
I0
already
exists)as
Z1
·
Either
write
mergeZ1
todiskasI1
(ifnoI1)
·
OrmergewithI1
toformZ2Logarithmicmerge·
Auxiliaryand
mainindex
:indexconstructiontim
O(T2)as
eachpostingistouchedineachmerge.·
Logarithmicmerge
:EachpostingismergedO(logTtimes,so
complexity
is
O(T
log
T)·Sologarithmicmergeismuchmoreefficient
for
indexconstruction·
But
query
processing
now
requires
the
merging
ofO(logT)indexes·
Whereas
it
is
O(1)if
you
just
have
a
main
and
auxiliarindex·
Collection-widestatisticsare
hardto
maintain·
E
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 江西省宜春市丰城市第九中学2024-2025学年高二上学期期中考试(日新班)政治试卷(含解析)
- 《反传销知识讲座的》课件
- 2024年度二手农机销售代理与合作合同
- 通史版2025届高考历史统考一轮复习第16讲课题1改革开放新篇章-从计划经济到市抄济和社会生活的变迁学案含解析
- 2024年度环保技术与设备采购合同(含治理、监测、节能减排)
- 2024别墅房地产销售代理合同
- 2024年度农业产品代销合同
- 2024年度医院配电室紧急照明系统合同
- 2024年度技术研发与技术服务合同终止协议
- 2024北京夫妻共同财产清算合同
- 徕卡v lux4中文说明书大约工作时间和可拍摄图像数量
- 格力2匹柜机检测报告KFR-50LW(50530)FNhAk-B1(性能)
- 分级护理制度考试题及答案
- 高考作文模拟写作:“德”与“得”导写及范文
- 意向性和と思う课件 高考日语复习
- 江苏专转本《大学语文》考纲
- 替代燃料汽车
- 山东省消防安全管理体系
- 放射科专科护理模拟习题(含参考答案)
- 康复工程详解演示文稿
- 五线谱乐谱稿纸
评论
0/150
提交评论