




版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、Finding the Frequent Items in Streams of Data,By Graham Cormode and Marios Hadjieleftheriou Presented by Ankur Agrawal (09305002),Many data generation processes produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole. Examples Se
2、quence of queries posed to an Internet search engine. Collection of transactions across all branches of a supermarket chain. Data can arrive at enormous rates - hundreds of gigabytes per day or higher.,Data Streams,Finding the Frequent Items in Streams of Data,2,Storing and indexing such large amoun
3、t of data is costly. Important to process the data as it happens. Provide up to the minute analysis and statistics. Algorithms take only a single pass over their input. Compute various functions using resources that are sublinear in the size of input.,Streaming Model,Finding the Frequent Items in St
4、reams of Data,3,Problems range from simple to very complex. Given a stream of transactions, finding the mean and standard deviation of the bill totals. Requires only few sufficient statistics to be stored. Determining whether a search query has already appeared in the stream of queries. Requires a l
5、arge amount of information to be stored. Algorithms must Respond quickly to new information. Use very less amount of resources in comparison to total quantity of data.,Finding the Frequent Items in Streams of Data,4,Given a stream of items, the problem is simply to find those items which occur most
6、frequently. Formalized as finding all items whose frequency exceeds a specified fraction of the total number of items. Variations arise when the items are given weights, and further when these weights can also be negative.,Frequent Items Problem,Finding the Frequent Items in Streams of Data,5,The pr
7、oblem is important both in itself and as a subroutine in more advanced computations. For example, It can help in routing decisions, for in-network caching etc (if items represent packets on the Internet). Can help in finding popular terms if items represent queries made to an Internet search engine.
8、 Mining frequent itemsets inherently builds on this problem as a basic building block. Algorithms for the problem have been applied by large corporations: AT&T and Google.,Motivation,Finding the Frequent Items in Streams of Data,6,Given a stream S of n items t1tn, the exact -frequent items comprise
9、the set i | fin, where fi is the frequency of item i. Solving the exact frequent items problem requires (n) space. Approximate version is defined based on tolerance for error parameterized by .,Variations,Finding the Frequent Items in Streams of Data,7,Given a stream S of n items, the -approximate f
10、requent items problem is to return a set of items F so that for all items i F, fi(-)n, and there is no iF such that fin.,-approximate Frequent Items Problem,Finding the Frequent Items in Streams of Data,8,Given a stream S of n items, the frequency estimation problem is to process a stream so that, g
11、iven any i, an fi* is returned satisfying fi fi* fi+n.,Frequency Estimation Problem,Finding the Frequent Items in Streams of Data,9,Two main classes of algorithms: Counter-based Algorithms Sketch Algorithms Other Solutions: Quantiles : based on various notions of randomly sampling items from the inp
12、ut, and of summarizing the distribution of items. Less effective and have attracted less interest.,Solutions,Finding the Frequent Items in Streams of Data,10,Track a subset of items from the input and monitor their counts. Decide for each new arrival whether to store or not. Decide what count to ass
13、ociate with it. Cannot handle negative weights.,Counter-based Algorithms,Finding the Frequent Items in Streams of Data,11,Problem posed by J. S. Moore in Journal of Algorithms, in 1981.,Majority Problem,Finding the Frequent Items in Streams of Data,12,Start with a counter set to zero. For each item:
14、 If counter is zero, store the item, set counter to 1. Else, if item is same as item stored, increment counter. Else, decrement counter. If there is a majority item, it is the item stored. Proof outline: Each decrement pairs up two different items and cancels them out. Since majority occurs N/2 time
15、s, not all of its occurrences can be canceled out.,Finding the Frequent Items in Streams of Data,13,Majority Algorithm,First proposed by Misra and Gries in 1982. Finds all items in a sequence whose frequency exceeds a 1/k fraction of the total count. Stores k-1 (item, counter) pairs. A generalizatio
16、n of the Majority algorithm.,The Frequent Algorithm,Finding the Frequent Items in Streams of Data,14,For each new item: Increment the counter if the item is already stored. If k items stored, then store new item with counter set to 1. Otherwise decrement all the counters. If any counter equals 0, th
17、en delete the corresponding item.,Finding the Frequent Items in Streams of Data,15,The Frequent Algorithm (contd.),Example,6,1,+1,+1,+1,3,1,2,0,2,Finding the Frequent Items in Streams of Data,16,Example (contd.),-1,-1,+1,3,1,2,1,2,Finding the Frequent Items in Streams of Data,17,5,0,-1,-1,6,1,Pseudo
18、 Code,Finding the Frequent Items in Streams of Data,18,Time cost involves: O(1) dictionary operations per update. Cost of decrementing counts. Can be performed in O(1) time. Also solves the frequency estimation problem if executed with k=1/.,Analysis,Finding the Frequent Items in Streams of Data,19,
19、Proposed by Manku and Motwani in 2002. Stores tuples comprising An item A lower bound on its count A delta () value which records the difference between the upper bound and the lower bound.,LossyCounting Algorithm,Finding the Frequent Items in Streams of Data,20,For processing i th item: Increment t
20、he lower bound if it is already stored. Else create a new tuple for the item with lower bound set to 1 and set to i/k. Periodically delete tuples with upper bound less than i/k. Uses O(k log(n/k) space in worst case. Also solves frequency estimation problem with k=1/.,Finding the Frequent Items in S
21、treams of Data,21,LossyCounting Algorithm (contd.),Pseudo Code,Finding the Frequent Items in Streams of Data,22,* The CACM 2009 version of the paper has erroneously shown if block (line 8) outside for loop.,Introduced by Metwally et al. in 2005. Store k (item, count) pairs. Initialize by first k dis
22、tinct items and their exact counts. If new item is not already stored, replace the item with least count and set the counter to 1 more than the least count. Items which are stored by the algorithm early in the stream and are not removed have very accurate estimated counts.,SpaceSaving Algorithm,Find
23、ing the Frequent Items in Streams of Data,23,Pseudo Code,Finding the Frequent Items in Streams of Data,24,The space required is O(k). The time cost involves cost of: the dictionary operation of finding if an item is stored. finding and maintaining the item with minimum count. Simple heap implementat
24、ions can track the smallest count item in O(log k) time per update. As in other counter based algorithms, it also solves frequency estimation problem with k=1/.,Finding the Frequent Items in Streams of Data,25,Analysis,The term “sketch” denotes a data structure. a linear projection of the input freq
25、uency vector. Sketch is the product of frequency vector with a matrix. Updates with negative values can easily be accommodated. Allows us to model the removal of items. Primarily solve the frequency estimation problem. Algorithms are randomized. Also take a failure probability so that the probabilit
26、y of failure is at most .,Sketch Algorithms,Finding the Frequent Items in Streams of Data,26,Introduced by Charikar et al. in 2002. Each update only affects a small subset of the sketch. The sketch consists of a two-dimensional array C with d rows of w counters Each. Uses two hash functions for each
27、 row: hj which maps input items onto w. gj which maps input items onto 1, +1. Each input item i Add gj(i) to entry C j, hj(i) in row j, for 1 j d.,CountSketch Algorithm,Finding the Frequent Items in Streams of Data,27,Finding the Frequent Items in Streams of Data,28,Sketch Structure,Pseudo Code,Find
28、ing the Frequent Items in Streams of Data,29,For any row j, the value gj(i)*Cj,hj(i) is an unbiased estimator for fi. The estimate fi* is the median of these estimates over the d rows.,Setting d=log(4/) and w=O(1/2) ensures that fi has error at most n with probability of at least 1. Requires the has
29、h functions to be chosen randomly from a family of “fourwise independent” hash functions. The total space used is . The time per update is in worst case.,Finding the Frequent Items in Streams of Data,30,Analysis,Introduced by Cormode and Muthukrishnan in 2005. Similarly uses a sketch consisting of 2
30、-D array of d rows of w counters each. Uses d hash functions hj, one for each row. Each update is mapped onto d entries in the array, each of which is incremented. Now frequency estimate for any item is fi* = min 1jd C j, hj(i),CountMin Sketch Algorithm,Finding the Frequent Items in Streams of Data,
31、31,Pseudo Code,Finding the Frequent Items in Streams of Data,32,The estimate for an item in each row overestimates by less than n/w. Can be shown using Markov inequality. Setting d=log(1/) and w=O(1/) ensures that fi has error at most n with probability of at least 1. Similar to CountSketch with gj(
32、i) always equal to 1. The total space used is . The time per update is in worst case.,Finding the Frequent Items in Streams of Data,33,Analysis,Sketches estimate the frequency of a single item. How to find frequent items without explicitly storing sketch for all items? Divide-and-conquer approach li
33、mits search cost. Impose a binary tree over the domain Keep a sketch of each level of the tree Descend if a node is heavy, else stop Correctness: all ancestors of a frequent item are also frequent. Assumes that negative updates are possible but no negative frequencies are allowed. Updates take hashi
34、ng operations, and O(1) counter updates for each hash.,Finding Frequent Items Using a Hierarchy,Finding the Frequent Items in Streams of Data,34,Randomly divides the input into buckets. Expect at most one frequent item in each bucket. Within each bucket, the items are divided into subgroups so that
35、the “weight” of each group indicates the identity of the frequent item. Repeating this for all bit positions reveals the full identity of the item.,Finding Frequent Items Using Group Testing,Finding the Frequent Items in Streams of Data,35,Experiments performed using 10 million packets of HTTP traff
36、ic, representing 24 hours of traffic from a backbone router in a major network. The frequency threshold varied from 0.0001 to 0.01 and the error guarantee set to . Compare the efficiency of the algorithms with respect to: Update throughput, measured in number of updates per millisecond. Space consumed, measured in bytes. Recall, measured in the total number of true heavy hitters reported over the number of true heavy hitters given by an exact algorithm. Precision, measured in total number of true heavy hitters reported over the total numbe
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 碎石场安全协议书
- 体验式零售创新店企业制定与实施新质生产力项目商业计划书
- 被派遣人员协议书
- 老年游免责协议书
- 历史文化名城夜游行业深度调研及发展项目商业计划书
- 动漫周边商品购物中心行业深度调研及发展项目商业计划书
- 高精度数控机床定制服务企业制定与实施新质生产力项目商业计划书
- 作家与艺术家职业中断保险行业跨境出海项目商业计划书
- 湖南省名校联考联合体2022-2023学年高一下学期入学考试物理(原卷版)
- 人教版语文六年级下册20 真理诞生于一百个问号之后阅读练习卷
- 班组工程量结算书
- 最新易制毒化学品管理制度大全
- 机载直流用电设备电源特性要求及试验方法
- 养老院老人入(出)院流程图
- 健康照护教材课件汇总完整版ppt全套课件最全教学教程整本书电子教案全书教案课件合集
- 最新-临时救助申请审核审批表模板
- 《有效沟通》PPT课件-(2)
- 家庭室内装饰装修工程验收单
- 青春红绿灯教学设计中小学心理健康心理游戏脚本
- 《城镇土地使用税纳税申报表》
- 三年级数学下册口算脱式竖式练习题
评论
0/150
提交评论