作业信息检索lecture_第1页
作业信息检索lecture_第2页
作业信息检索lecture_第3页
作业信息检索lecture_第4页
作业信息检索lecture_第5页
已阅读5页,还剩41页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

Lecture4:IndexConstruction·

Lastlecture

Wildcards

○○

○·

Spellcorrection·

Soundex

$m

mace

→madden

mo

□〉

among

amortize

·

This

time

:

on

abandon

among

·

Indexconstruction·

Dictionarydatastructures

a-hu

h

y-m

n-z·

Tolerant

retrieval

0

0

0PlanIndexconstruction·

Howdo

weconstructanindex?·

Whatstrategiescan

weuse

withlimited

mainmemory?Hardwarebasics·

Manydesigndecisionsininformationretrieval

a

basedonthe

characteristicsofhardware·

WebeginbyreviewinghardwarebasicsHardwarebasics·

Access

to

data

in

memory

ismuchfaster

than

acces

to

data

on

disk.·

Diskseeks

:

Nodataistransferredfromdisk

while

diskheadisbeingpositioned.·

Therefore

:Transferringonelargechunkofdataf

diskto

memoryisfasterthantransferring

manysmall

chunks.·

Disk

I/O

is

block-based

:Reading

and

writing

of

enblocks

(as

opposed

to

smaller

chunks).·

Blocksizes

:8KBto256KB.Hardwarebasics·ServersusedinIR

systemsnowtypicallyhaveseve

GBofmainmemory,sometimestensofGB.·

Available

disk

space

is

several

(2–

3)orders

of

magnitudelarger.·

Faulttoleranceisveryexpensive

:It

smuchcheto

use

many

regular

machines

rather

than

one

faulttolerant

machine.Hardwareassumptionsforthislectur·

symbol

statisticvalue·s

average

seektime5ms=5x10-3

b

transfer

time

per

byte0.02

μs=2x10-s·

processor

sclock

rate

109

s-1·

p

low-level

operation

0.01μs=10-8

s(e.g.,compare&swap

a

word)·sizeofmainmemoryseveralGB·

size

of

disk

space1TBorRCV1:Our

collectionfor

thislectur·Shakespeare

scollected

worksdefinitelyaren

’enough

for

demonstrating

many

of

the

points

in

this

course.·

Thecollection

we

’ll

use

isn

’t

reallylargeeno

either,

butit

spubliclyavailableandisatleasmore

plausible

example.·

As

an

example

for

applying

scalable

indexconstructionalgorithms,

we

willusetheReuters

RCV1

collection.·

This

is

one

year

of

Reuters

newswire

(part

of1995

and

1996)AReutersRCV1document·symbolstatisticvalue·Ndocuments800,000·Lavg.#tokensperdoc200·Mterms(=wordtypes)400,000·avg.#bytespertoken6(incl.spaces/punct.)·avg.#bytespertoken4.5(withoutspaces/punct.)·avg.#bytesperterm7.5ReutersRCV1

statisticsRecallIIR1indexconstructionDoc2SoletitbewithCaesar.ThenobleBrutushathtoldyouCaesarwasambitiousDoc

1IdidenactJulius

Caesar

Iwaskilled

i"theCapitol

;Brutuskilledme.·

Documentsareparsedtoextract

wordsandthese

aresaved

withtheDocumentID.Wefocusonthissortstep.

Wehave

100M

itemsto

sort.·After

all

documentshavebeen

parsed,theinvertedfileissorted

by

terms.Key

step·In-memoryindexconstructiondoesnotscale·Can’tstuffentirecollectionintomemory,sort,theback·Howcanweconstructanindexforverylargecollections?·Takingintoaccountthehardwareconstraintswejlearnedabout...·Memory,disk,speed,etc.ScalingindexconstructionSort-basedindexconstruction·Aswebuildtheindex,weparsedocsoneatatime.·Whilebuildingtheindex,wecannoteasilyexploitcompressiontricks(youcan,butmuchmorecomplex)·Thefinalpostingsforanytermarepleteuntiltheend·At12bytespernon-positionalpostingsentry(term,docdemandsalotofspaceforlargecollections.·T=100,000,000inthecaseofRCV1·So…wecandothisinmemoryin2009,buttypicalcollecaremuchlarger.E.g.,theNewYorkTimesprovidesanindeof>150yearsofnewswire·Thus:Weneedtostoreintermediateresultsondisk.Sort

using

disk

as

“memory

?·Canweusethesame

index

constructionalgorithm

forlargercollections,

but

by

usingdiskinsteado

memory?·

No:Sorting

T=100,000,000records

on

disk

is

tooslow

–toomanydiskseeks.·

Weneedanexternal

sortingalgorithm.·

Parse

and

build

postings

entries

one

doc

at

a

time·

Now

sort

postings

entries

by

term

(then

by

docwithineachterm)·

Doing

this

with

random

disk

seeks

would

be

too

slo–mustsortT=

100MrecordsIfeverycomparisontook2

disk

seeks,

andN

items

could

be

sortedwithN

log2N

comparisons,how

longwouldthistake?BottleneckBSBI:Blocked

sort-based

Indexing

(Sorting

withfewerdiskseeks)·12-byte

(4+4+4)records

(term,

doc,

freq).·

Thesearegeneratedasweparsedocs.·

Mustnowsort

100Msuch12-byterecordsbyterm.·

DefineaBlock~

10Msuchrecords·

Caneasily

fitacoupleinto

memory.·

Will

have10such

blockstostart

with.·

Basicideaofalgorithm

Accumulate

postingsforeach

block,sort,

writetodi·

Then

mergethe

blocksintoonelongsortedorder.·Can

do

binary

merges,

with

a

merge

tree

of

log210=4layer·During

each

layer,read

into

memory

runs

in

blocks

of10M

merge,

write

back.Howtomergethesortedruns?DiskRunsbeing

merged.2

Mergedrun.1

234431Howtomergethesortedruns?·Butitis

moreefficienttodoa

multi-way

merge,

whereyo

are

readingfromall

blockssimultaneously·Providingyou

readdecent-sizedchunksofeach

blockin

tmemory

and

then

write

out

a

decent-sized

output

chunk,thenyou

’re

not

killed

bydiskseeksRemainingproblemwithsort-basedalgorithm·Ourassumptionwas:wecankeepthedictionaryinmemory.·Weneedthedictionary(whichgrowsdynamically)ordertoimplementatermtotermIDmapping.·Actually,wecouldworkwithterm,docIDpostingsinsteadoftermID,docIDpostings...·...butthenintermediatefileseverylarge.wouldendupwithascalable,butveryslowindexconstructionmethod.)SPIMI:Single-passin-memoryindexing·Keyidea1:Generateseparatedictionariesforeablock–noneedtomaintainterm-termIDmappingacrossblocks.·Keyidea2:Don’tsort.Accumulatepostingsinpostingslistsastheyoccur.·Withthesetwoideaswecangenerateacompleteinvertedindexforeachblock.·Theseseparateindexescanthenbemergedintoonebigindex.·

Merging

of

blocks

is

analogous

to

BSBI.SPIMI-Invert·CompressionmakesSPIMIevenmoreefficient.·

Compression

of

terms·

Compressionof

postings·

SeenextlectureSPIMI:CompressionDistributedindexing·

For

web-scale

indexing

(don

’t

trythisat

home!)must

useadistributedcomputingcluster·

Individual

machinesarefault-prone·

Can

unpredictablyslowdownorfail·

Howdoweexploitsuchapoolofmachines?Websearch

engine

datacenter

Web

search

datacenter

s

(Google,Bing,Baidu)

mainlycontaincommoditymachines.·

Datacenter

saredistributedaroundthe

world.·

Estimate

:Google~

1

millionservers,3

millionprocessors/cores

(Gartner2007)Massivedatacenter

Ifin

a

non-fault-tolerantsystemwith

1000nodes

eachnodehas99.9%uptime,whatistheuptimeofthesystem?·

Answer:63%·

Exercise

:Calculate

the

number

of

servers

failin

minuteforaninstallationof1

millionservers.Distributedindexing·

Maintaina

master

machinedirectingtheindexingjob

–considered

safe

”.·

Break

upindexingintosetsof

(parallel)tasks.·

Mastermachineassignseachtasktoanidlemachinfrom

a

pool.Paralleltasks·

We

willusetwosetsofparalleltasks·

Parser

Inverter

Breaktheinputdocumentcollectionintosplits·

Each

split

is

a

subset

of

documents

(correspondinblocks

in

BSBI/SPIMI)·

Masterassignsasplittoanidleparser

machine·

Parserreadsadocumentatatimeandemits

(term,doc)

pairs·

Parser

writes

pairsintojpartitions·

Eachpartitionisforarangeofterms

first

let·

(e.g.,

a-f,

g-p,

q-z)

here

j=3.·

Now

to

complete

the

index

in

versionParser

sInverter

Aninvertercollectsall

(term,doc)

pairs

(=

post

foroneterm-partition.·

Sortsand

writesto

postingslistsParser

a-f

g-pq-zParserParser

a-f

g-p

q-zDataflowMaster

assignInverter

g-pInverter

a-f一

a-f

g-p

q-zInverter

q-zReduce

phaseSegment

filesMapphasePostingsassignsplitsMapReduce·TheindexconstructionalgorithmwejustdescribaninstanceofMapReduce.·MapReduce(DeanandGhemawat2004)isarobustandconceptuallysimpleframeworkfordistributedcomputing…·…withouthavingtowritecodeforthedistributpart.·TheydescribetheGoogleindexingsystem(ca.200asconsistingofanumberofphases,eachimplementedinMapReduce.·Index

construction

was

just

one

phase.·

Anotherphase

:transformingaterm-partitioned

indexintoadocument-partitionedindex.·

Term-partitioned:one

machinehandlesasubrangeofterms·

Document-partitioned:one

machinehandlesasubrange

ofdocuments·

As

we

’ll

discuss

in

the

web

part

of

the

course,mosearch

engines

use

a

document-partitioned

index…

betterload

balancing,etc.MapReduceSchemafor

indexconstructioninMapReduce·

Schemaofmapandreducefunctions·map:input→list(k,

v)

reduce

:

(k,

list(v))→

output·Instantiation

oftheschemaforindexconstruction

·map:collection→list(termID,docID)·reduce

:

(<termID1,list(docID)>,<termID2,list(docID

(postingslist1,

postingslist2,…)·Map:·d1:Ccame,Cc’ed.·d2:Cdied.→·<C,d1>,<came,d1>,<C,d1>,<c’ed,d1>,<C,d2>,<died,d2>·Reduce:·(<C,(d1,d2,d1)>,<died,(d2)>,<came,(d1)>,<c’→(<C,(d1:2,d2:1)>,<died,(d2:1)>,<came,(d1:<c’ed,(d1:1)>)Exampleforindexconstruction37Dynamic

indexing·Uptonow,wehaveassumedthatcollectionsare

static.·

Theyrarely

are

Documents

come

in

over

time

and

need

to

be

inserted.·

Documentsaredeletedand

modified.·

This

meansthatthedictionaryandpostingslists

havetobemodified

Postings

updatesfortermsalreadyindictionary·

Newtermsaddedtodictionary·

Maintain

“big

main

index·

Newdocsgointo“

small”auxiliaryindex·

Search

across

both,merge

results·

Deletions·

Invalidation

bit-vector

fordeleteddocs·

Filterdocsoutputon

asearch

result

by

thisinvalida

bit-vector·

Periodically,re-indexintoone

mainindexSimplest

approachIssueswithmainandauxiliaryindex·Problemoffrequent

merges

youtouchstuffalot·Poor

performance

during

merge·

Actually:·Mergingoftheauxiliaryindexintothe

mainindexisefficientifkeepaseparatefileforeachpostingslist.·

Merge

is

the

same

as

a

simple

append.·Butthen

we

wouldneedalotoffiles

–inefficientfor

OS.·Assumptionforthe

restofthelecture

:

Theindexisone

b

file.·

In

reality:

Useaschemesomewhereinbetween

(e.g.,spl

verylarge

postingslists,

collect

postingslists

ofleng

one

file

etc.)Logarithmicmerge·

Maintainaseriesofindexes,eachtwiceaslargethe

previous

one·

At

any

time,some

of

these

powers

of2are

instantiated·

Keepsmallest

(Z0)inmemory·

Larger

ones

(I0,I1,…)on

disk·

IfZ0

getstoo

big

(>n),

writetodisk

asI0·

or

merge

with

I0

(if

I0

already

exists)as

Z1

·

Either

write

mergeZ1

todiskasI1

(ifnoI1)

·

OrmergewithI1

toformZ2Logarithmicmerge·

Auxiliaryand

mainindex

:indexconstructiontim

O(T2)as

eachpostingistouchedineachmerge.·

Logarithmicmerge

:EachpostingismergedO(logTtimes,so

complexity

is

O(T

log

T)·Sologarithmicmergeismuchmoreefficient

for

indexconstruction·

But

query

processing

now

requires

the

merging

ofO(logT)indexes·

Whereas

it

is

O(1)if

you

just

have

a

main

and

auxiliarindex·

Collection-widestatisticsare

hardto

maintain·

E

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论