作业信息检索lecture

上传人：1*** IP属地：广东上传时间：2024-11-25 格式：PPT 页数：46 大小：556.41KB 积分：24 举报 版权申诉

已阅读5页，还剩41页未读，继续免费阅读

版权说明：本文档由用户提供并上传，收益归属内容提供方，若内容存在侵权，请进行举报或认领

文档简介

Lecture4:IndexConstruction·

Lastlecture

:·

Wildcards

○

○○

○

○·

Spellcorrection·

Soundex

mace

→madden

□〉

among

amortize

This

time

abandon

among

Indexconstruction·

Dictionarydatastructures

a-hu

y-m

n-z·

Tolerant

retrieval

0PlanIndexconstruction·

Howdo

weconstructanindex?·

Whatstrategiescan

weuse

withlimited

mainmemory?Hardwarebasics·

Manydesigndecisionsininformationretrieval

basedonthe

characteristicsofhardware·

WebeginbyreviewinghardwarebasicsHardwarebasics·

Access

data

memory

ismuchfaster

than

acces

data

disk.·

Diskseeks

Nodataistransferredfromdisk

while

diskheadisbeingpositioned.·

Therefore

:Transferringonelargechunkofdataf

diskto

memoryisfasterthantransferring

manysmall

chunks.·

Disk

I/O

block-based

:Reading

and

writing

enblocks

(as

opposed

smaller

chunks).·

Blocksizes

:8KBto256KB.Hardwarebasics·ServersusedinIR

systemsnowtypicallyhaveseve

GBofmainmemory,sometimestensofGB.·

Available

disk

space

several

(2–

3)orders

magnitudelarger.·

Faulttoleranceisveryexpensive

:It

’

smuchcheto

use

many

regular

machines

rather

than

one

faulttolerant

machine.Hardwareassumptionsforthislectur·

symbol

statisticvalue·s

average

seektime5ms=5x10-3

s·

transfer

time

per

byte0.02

μs=2x10-s·

processor

’

sclock

rate

109

s-1·

low-level

operation

0.01μs=10-8

s(e.g.,compare&swap

word)·sizeofmainmemoryseveralGB·

size

disk

space1TBorRCV1:Our

collectionfor

thislectur·Shakespeare

’

scollected

worksdefinitelyaren

’enough

for

demonstrating

many

the

points

this

course.·

Thecollection

’ll

use

isn

’t

reallylargeeno

either,

butit

’

spubliclyavailableandisatleasmore

plausible

example.·

example

for

applying

scalable

indexconstructionalgorithms,

willusetheReuters

RCV1

collection.·

This

one

year

Reuters

newswire

(part

of1995

and

1996)AReutersRCV1document·symbolstatisticvalue·Ndocuments800,000·Lavg.#tokensperdoc200·Mterms(=wordtypes)400,000·avg.#bytespertoken6(incl.spaces/punct.)·avg.#bytespertoken4.5(withoutspaces/punct.)·avg.#bytesperterm7.5ReutersRCV1

statisticsRecallIIR1indexconstructionDoc2SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc

1IdidenactJulius

Caesar

Iwaskilled

i"theCapitol

;Brutuskilledme.·

Documentsareparsedtoextract

wordsandthese

aresaved

withtheDocumentID.Wefocusonthissortstep.

Wehave

100M

itemsto

sort.·After

all

documentshavebeen

parsed,theinvertedfileissorted

terms.Key

step·In-memoryindexconstructiondoesnotscale·Can’tstuffentirecollectionintomemory,sort,theback·Howcanweconstructanindexforverylargecollections?·Takingintoaccountthehardwareconstraintswejlearnedabout...·Memory,disk,speed,etc.ScalingindexconstructionSort-basedindexconstruction·Aswebuildtheindex,weparsedocsoneatatime.·Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)·Thefinalpostingsforanytermarepleteuntiltheend·At12bytespernon-positionalpostingsentry(term,docdemandsalotofspaceforlargecollections.·T=100,000,000inthecaseofRCV1·So…wecandothisinmemoryin2009,buttypicalcollecaremuchlarger.E.g.,theNewYorkTimesprovidesanindeof>150yearsofnewswire·Thus:Weneedtostoreintermediateresultsondisk.Sort

using

disk

“memory

”

?·Canweusethesame

index

constructionalgorithm

forlargercollections,

but

usingdiskinsteado

memory?·

No:Sorting

T=100,000,000records

disk

tooslow

–toomanydiskseeks.·

Weneedanexternal

sortingalgorithm.·

Parse

and

build

postings

entries

one

doc

time·

Now

sort

postings

entries

term

(then

docwithineachterm)·

Doing

this

with

random

disk

seeks

would

too

slo–mustsortT=

100MrecordsIfeverycomparisontook2

disk

seeks,

andN

items

could

sortedwithN

log2N

comparisons,how

longwouldthistake?BottleneckBSBI:Blocked

sort-based

Indexing

(Sorting

withfewerdiskseeks)·12-byte

(4+4+4)records

(term,

doc,

freq).·

Thesearegeneratedasweparsedocs.·

Mustnowsort

100Msuch12-byterecordsbyterm.·

DefineaBlock~

10Msuchrecords·

Caneasily

fitacoupleinto

memory.·

Will

have10such

blockstostart

with.·

Basicideaofalgorithm

:·

Accumulate

postingsforeach

block,sort,

writetodi·

Then

mergethe

blocksintoonelongsortedorder.·Can

binary

merges,

with

merge

tree

log210=4layer·During

each

layer,read

into

memory

runs

blocks

of10M

merge,

write

back.Howtomergethesortedruns?DiskRunsbeing

merged.2

Mergedrun.1

234431Howtomergethesortedruns?·Butitis

moreefficienttodoa

multi-way

merge,

whereyo

are

readingfromall

blockssimultaneously·Providingyou

readdecent-sizedchunksofeach

blockin

tmemory

and

then

write

out

decent-sized

output

chunk,thenyou

’re

not

killed

bydiskseeksRemainingproblemwithsort-basedalgorithm·Ourassumptionwas:wecankeepthedictionaryinmemory.·Weneedthedictionary(whichgrowsdynamically)ordertoimplementatermtotermIDmapping.·Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings...·...butthenintermediatefileseverylarge.wouldendupwithascalable,butveryslowindexconstructionmethod.)SPIMI:Single-passin-memoryindexing·Keyidea1:Generateseparatedictionariesforeablock–noneedtomaintainterm-termIDmappingacrossblocks.·Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.·Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.·Theseseparateindexescanthenbemergedintoonebigindex.·

Merging

blocks

analogous

BSBI.SPIMI-Invert·CompressionmakesSPIMIevenmoreefficient.·

Compression

terms·

Compressionof

postings·

SeenextlectureSPIMI:CompressionDistributedindexing·

For

web-scale

indexing

(don

’t

trythisat

home!)must

useadistributedcomputingcluster·

Individual

machinesarefault-prone·

Can

unpredictablyslowdownorfail·

Howdoweexploitsuchapoolofmachines?Websearch

engine

datacenter

s·

Web

datacenter

(Google,Bing,Baidu)

mainlycontaincommoditymachines.·

Datacenter

saredistributedaroundthe

world.·

Estimate

:Google~

millionservers,3

millionprocessors/cores

(Gartner2007)Massivedatacenter

s·

Ifin

non-fault-tolerantsystemwith

1000nodes

eachnodehas99.9%uptime,whatistheuptimeofthesystem?·

Answer:63%·

Exercise

:Calculate

the

number

servers

failin

minuteforaninstallationof1

millionservers.Distributedindexing·

Maintaina

master

machinedirectingtheindexingjob

–considered

“

safe

”.·

Break

upindexingintosetsof

(parallel)tasks.·

Mastermachineassignseachtasktoanidlemachinfrom

pool.Paralleltasks·

willusetwosetsofparalleltasks·

Parser

s·

Inverter

s·

Breaktheinputdocumentcollectionintosplits·

Each

split

subset

documents

(correspondinblocks

BSBI/SPIMI)·

Masterassignsasplittoanidleparser

machine·

Parserreadsadocumentatatimeandemits

(term,doc)

pairs·

Parser

writes

pairsintojpartitions·

Eachpartitionisforarangeofterms

’

first

let·

(e.g.,

a-f,

g-p,

q-z)

–

here

j=3.·

Now

complete

the

index

versionParser

sInverter

s·

Aninvertercollectsall

(term,doc)

pairs

post

foroneterm-partition.·

Sortsand

writesto

postingslistsParser

一

a-f

g-pq-zParserParser

一

a-f

g-p

q-zDataflowMaster

assignInverter

g-pInverter

a-f一

a-f

g-p

q-zInverter

q-zReduce

phaseSegment

filesMapphasePostingsassignsplitsMapReduce·TheindexconstructionalgorithmwejustdescribaninstanceofMapReduce.·MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputing…·…withouthavingtowritecodeforthedistributpart.·TheydescribetheGoogleindexingsystem(ca.200asconsistingofanumberofphases,eachimplementedinMapReduce.·Index

construction

was

just

one

phase.·

Anotherphase

:transformingaterm-partitioned

indexintoadocument-partitionedindex.·

Term-partitioned:one

machinehandlesasubrangeofterms·

Document-partitioned:one

machinehandlesasubrange

ofdocuments·

’ll

discuss

the

web

part

the

course,mosearch

engines

use

document-partitioned

index…

betterload

balancing,etc.MapReduceSchemafor

indexconstructioninMapReduce·

Schemaofmapandreducefunctions·map:input→list(k,

reduce

(k,

list(v))→

output·Instantiation

oftheschemaforindexconstruction

·map:collection→list(termID,docID)·reduce

(<termID1,list(docID)>,<termID2,list(docID

(postingslist1,

postingslist2,…)·Map:·d1:Ccame,Cc’ed.·d2:Cdied.→·<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>,<C,d2>,<died,d2>·Reduce:·(<C,(d1,d2,d1)>,<died,(d2)>,<came,(d1)>,<c’→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:<c’ed,(d1:1)>)Exampleforindexconstruction37Dynamic

indexing·Uptonow,wehaveassumedthatcollectionsare

static.·

Theyrarely

are

:·

Documents

come

over

time

and

need

inserted.·

Documentsaredeletedand

modified.·

This

meansthatthedictionaryandpostingslists

havetobemodified

:·

Postings

updatesfortermsalreadyindictionary·

Newtermsaddedtodictionary·

Maintain

“big

”

main

index·

Newdocsgointo“

small”auxiliaryindex·

across

both,merge

results·

Deletions·

Invalidation

bit-vector

fordeleteddocs·

Filterdocsoutputon

asearch

result

thisinvalida

bit-vector·

Periodically,re-indexintoone

mainindexSimplest

approachIssueswithmainandauxiliaryindex·Problemoffrequent

merges

–

youtouchstuffalot·Poor

performance

during

merge·

Actually:·Mergingoftheauxiliaryindexintothe

mainindexisefficientifkeepaseparatefileforeachpostingslist.·

Merge

the

same

simple

append.·Butthen

wouldneedalotoffiles

–inefficientfor

OS.·Assumptionforthe

restofthelecture

Theindexisone

file.·

reality:

Useaschemesomewhereinbetween

(e.g.,spl

verylarge

postingslists,

collect

postingslists

ofleng

one

file

etc.)Logarithmicmerge·

Maintainaseriesofindexes,eachtwiceaslargethe

one·

any

time,some

these

powers

of2are

instantiated·

Keepsmallest

(Z0)inmemory·

Larger

ones

(I0,I1,…)on

disk·

IfZ0

getstoo

big

(>n),

writetodisk

asI0·

merge

with

(if

already

exists)as

Either

write

mergeZ1

todiskasI1

(ifnoI1)

OrmergewithI1

toformZ2Logarithmicmerge·

Auxiliaryand

mainindex

:indexconstructiontim

O(T2)as

eachpostingistouchedineachmerge.·

Logarithmicmerge

:EachpostingismergedO(logTtimes,so

complexity

O(T

log

T)·Sologarithmicmergeismuchmoreefficient

for

indexconstruction·

But

query

processing

now

requires

the

merging

ofO(logT)indexes·

Whereas

O(1)if

you

just

have

main

and

auxiliarindex·

Collection-widestatisticsare

hardto

maintain·

人人文库> 全部分类> 行业资料 > 信息产业

温馨提示

1. 本站所有资源如无特殊说明，都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
2. 本站的文档不包含任何第三方提供的附件图纸等，如果需要附件，请联系上传者。文件的所有权益归上传用户所有。
3. 本站RAR压缩包中若带图纸，网页内容里面会有图纸预览，若没有图纸预览就没有图纸。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 人人文库网仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对用户上传分享的文档内容本身不做任何修改或编辑，并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容，请与我们联系，我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

作业信息检索lecture

文档简介

温馨提示

最新文档

评论

作业信息检索lecture

文档简介

温馨提示

最新文档

评论

相关文档