IndexCompressionTennesseeTechUniversity索引压缩课件_第1页
IndexCompressionTennesseeTechUniversity索引压缩课件_第2页
IndexCompressionTennesseeTechUniversity索引压缩课件_第3页
IndexCompressionTennesseeTechUniversity索引压缩课件_第4页
IndexCompressionTennesseeTechUniversity索引压缩课件_第5页
已阅读5页,还剩46页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

1、MotivationUncompressed indexes are largeIt might be useful for some modern devices to support information retrieval techniques that would not be able to do with uncompressed indexes第1页,共51页。Motivation (cont.)Disk I/O is slow第2页,共51页。Types of CompressionLossyCompression that involves the removal of d

2、ata.LoselessCompression that involves no removal of data.第3页,共51页。OverviewA lossy compression schemeStatic Index PruningLoseless compressionElias Codesn-s encodingGolomb encodingVariable Byte Encoding (vByte)Fixed Binary CodewordsCPSS-Tree第4页,共51页。Static Index PruningGoal is to reduce the size of th

3、e index without reducing the precision such that a human cant tell the difference between a pruned index and non-pruned indexFocuses on the top k or top resultsAssumes there is a scoring functionAssumes the function is based off of some table A such that A(t,d) 0 if t is within d and A(t,d) = 0 othe

4、rwise第5页,共51页。Static Index Pruning (cont.)Two approachesDefined as Uniform pruning.The removal of “all posting entries whose corresponding table values are bounded above by some fixed cutoff threshold”Could have a terms entire posting list prunedDefined as Term based pruningAn approach that attempts

5、 to guarantee that every term will have at least some entries remaining in the index第6页,共51页。Static Index Pruning (cont.)Scoring functions are fuzzyOnly need to find some scoring function S such that S is within a factor of epsilon of SCarmel et al proved this mathematically for both uniform and ter

6、m-based methods第7页,共51页。Static Index Pruning (cont.)第8页,共51页。Static Index Pruning resultsFound that the idealized top k pruning algorithm did not work very wellThe smallest value in the posting list was almost always above their threshold so little pruning was doneModified the algorithm to apply a s

7、hiftSubtracted the smallest value from all positive scores with the listGreatly increased the pruning第9页,共51页。Static Index Pruning results (cont.)第10页,共51页。Static Index Pruning results (cont.)第11页,共51页。Static Index Pruning results (cont.)第12页,共51页。OverviewLoseless Compression第13页,共51页。Elias CodesNon

8、-parameterized bitwise method of coding integersGamma CodesRepresent a positive integer k with stored as a unary code. This is followed by the binary representation of the number without the most significant bitNot efficient for numbers larger than 15第14页,共51页。Elias Codes (cont.)Delta CodesRepresent

9、 a positive integer k with stored as a gamma code. This is followed by the binary representation of the number without the most significant bitNot efficient for small values第15页,共51页。n-s codingParameterized, bitwise encodingUses a block of n bits followed by s stop bits.Also contains a parameter b w

10、hich refers to the base of the number. Meaning, the numbers represented in the blocks of n size cannot be greater than or equal to b.第16页,共51页。n-s coding exampleLet n=3, s=2, and the base be 6.Valid data blocks are 000, 001, 010, 011, 100, and 101.101 100 001 11 would have the value of 5416第17页,共51页

11、。n-s coding (cont.)2 used n-s codes with prefix omission and run-length encodingEx.第18页,共51页。n-s coding (cont.)Run-length encoding is the process of replacing non-initial elements of a sequence with differences between adjacent elements. E.g.第19页,共51页。n-s coding results第20页,共51页。Golomb codingBetter

12、compression and faster retrieval than Elias codesIs parameterizedThis is usually stored separate using some other compression scheme第21页,共51页。vByte codingA very simple bytewise compression schemeUses 7 bits to code the data portion and the most significant bit is reserved as a flag bit.第22页,共51页。Sch

13、oler et. al.Defined an inverted list to be the following:Where the list is Example inverted list for term “Matthew”:Uses different coding schemes per partE.g. Golomb for freq, Gamma for doc, and vByte for offset第23页,共51页。Scholer et al. (cont.)One optimization is to require encoding to be byte aligne

14、d so that decompression can be fasterAnother optimization when referring to Boolean or ranked queries is to ignore the offsets and only take into account flag bits within the offset.Referred to as scanning第24页,共51页。Scholer et al. (cont.)Third optimization is called signature blocks.An eight bit bloc

15、k that stores the flag bits of up to eight blocks that follow.For example: 11100101Represents 5 integers that are stored in the eight blocksRequires more space but allows the data blocks to use all 8 bits instead of 7.第25页,共51页。Scholer et al. results第26页,共51页。Scholer et al. results (cont.)第27页,共51页。

16、Scholer et al. results (cont.)第28页,共51页。Fixed Binary CodesOften times the inverted list will be stored as a series of difference gaps between documents like so,This reduces the amount of bits required to represent a document IDs on average第29页,共51页。Fixed Binary Codes (cont.)Take for example the foll

17、owing list of d-gaps:If a binary code was used to encode this list, 6 bits would be used on each codeword when that would be unnecessary 第30页,共51页。Fixed Binary Codes (cont.)Instead encode as spans:where the notation would indicate that w-bit binary codes are to be used to code each of the next s val

18、ues.Similar to the approach of Anh and Moffat第31页,共51页。Anh and MoffatUses a selector then data representation for encodingA selector can be thought of as the unary portion of gamma codesData representation would be the binary portion of gamma codesThe selector uses a table of values where each case

19、is determined on the w-value and is relative to the previous case.第32页,共51页。Anh and Moffat (cont.)第33页,共51页。Anh and Moffat (cont.)Using this list and assuming s1= 1, s2= 2, and s3= 4From the table on the previous slide we get the followingWith each selector as 4 bits (2 bits for w 3, 2 bits to choos

20、e s1-s3) it takes 16 bits plus the summation of all of the w x s pairs. So, 57 bits are used to encode this list. It would take 60 bits for gamma code.第34页,共51页。Anh and Moffat (cont.)第35页,共51页。Anh and Moffat (cont.)The use of parsing is involved to discover segments.A graph is used in combination wi

21、th shortest path labelingEach node is a d-gap and the width to code itEach outgoing edge is a different way in which selector might be used to cover some subsequent gaps.第36页,共51页。Anh and Moffat (cont.)A multiplier is used since every list can be different but the values for s1, s2, and s3 are fixed

22、.For example, if m=2 and s1= 1, s2= 2, and s3= 4, or 1-2-4, then they would be equal to 2-4-8.An escape sequence can also be used on lists that have gaps that span larger than s3 would allow.This is the addition of an extra 4 bits stating that up to 15m gaps can be placed under one selector第37页,共51页

23、。Anh and Moffat results (cont.)第38页,共51页。Anh and Moffat results (cont.)第39页,共51页。Anh and Moffat第40页,共51页。Speeding up decodingNeed to exploit the cache and reduce both cache misses and TLB missesUse CSS-trees or CPSS-treesCSS-trees are cache-sensitive search trees that are a variation on m-ary trees.

24、By making each node contiguous this reduces the need for child pointersThis allows for each node to fit into a cache line (32/64 bit)第41页,共51页。CSS-Tree vs m-ary Tree第42页,共51页。CPSS-treesCache/Page sensitive search trees main purpose is to reduce number cache/TLB misses during random searchesAccomplis

25、hed by making each node, except the root, 4 KB in size and contains several CSS-TreesThe CSS-Trees are the same size as a cache line and contain the postingsEither 32 or 64 bit第43页,共51页。CPSS-trees results第44页,共51页。CPSS-trees results (cont.)第45页,共51页。Compressed CPSS-trees results第46页,共51页。Compressed

26、CPSS-tree results第47页,共51页。Questions?Questions?第48页,共51页。References1 David Carmel, Doron Cohen, Ronald Fagin, Eitan Farchi, Michael Herscovici, Yoelle S. Maarek, Aya Soffer. Static Index Pruning for Information Retrieval Systems. SIGIR 01: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pgs 43-50, 2001.2 Gordon Linoff, Craig Stanfill. Compression of Indexes with Full Positional Inform

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论