下载本文档
版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领
文档简介
1、A Generic High Throughput Architecture for Stream ProcessingChristos Rousopoulos, Ektoras Karandeinos, Grigorios Chrysos, Apostolos Dollas and Dionisios N. PnevmatikatosSchool of Electrical and Computer Engineering Technical University of Crete, Chania, Greece 73100, ekarandeino
2、, , dollasece.tuc.gr, pnevmatiece.tuc.gr Finally, both proposed architectures are scalable and generic. These architectures can be used on many prob- lems that operate on streaming data.The rest of the paper is organized as follows: Section II presents the related s
3、oftware-based and hardware-based solutions on stream join operation. Section III describes the stream join problem and the ScaleJoin algorithm. Section IV presents the two proposed reconfigurable architectures and Section V evaluates their performance,and Section VI draws the conclusions of this wor
4、k.II. RELATED WORKOne of the most important software-based implementations for the stream join problem is the Handshake algorithm 5. This algorithm enables the parallel comparison of input streams by flowing them in opposite directions and using multiple threads. A new version of the above algorithm
5、 6 stores incoming tuples into a node-local window, where they can be processed by parallel processing units. Gedik et al. 7 introduced the parallel CellJoin, which is an implementation of window-based stream join algorithm using a Cell processor; the SplitJoin 8 algorithm is based on two characteri
6、stics offering high-degree parallelization: the parallel execution of the “storage” and “processing” algorithm steps and the removal of the need for coordination and dependencies among join cores.On the other hand, many hardware-based systems have been proposed for stream processing. To begin with,
7、Halstead et al. 3 introduced an FPGA-based implementation that uses a hash-join engine, achieving impressive performance speedup compared to the corresponding software implemen-AbstractStream join is a fundamental and computationally expensive data mining operation for relating information from diff
8、erent data streams. This paper presents two FPGA-based ar- chitectures that accelerate stream join processing. The proposed hardware-based systems were implemented on a multi-FPGA hybrid system with high memory bandwidth. The experimental evaluation shows that our proposed systems can outperform a s
9、oftware-based solution that runs on a high-end, 48-core multiprocessor platform by at least one order of magnitude. In addition, the proposed solutions outperform any other previously proposed hardware-based or software-based solutions for stream join processing. Finally, our proposed hardware-based
10、 architec- tures can be used as generic templates to map stream processing algorithms on reconfigurable logic, taking into consideration real- world challenges and restrictions.I. INTRODUCTIONData Stream Mining is the process of extracting knowledge structures from continuous, rapid data streams. Th
11、e stream join is a fundamental operation in many applications for relating information from different data streams 1. The join operation usually takes place over a specific time-based window due to the unbounded size of the data streams 2 and it is computationally expensive 3.The performance of the
12、stream join operator is considered a key challenge for data mining systems. Due to the high computational cost of stream join, various implementations of software and hardware systems using multi-core CPUs, field programmable gate arrays (FPGAs) or massively parallel processor arrays (MPPAs) have be
13、en proposed in order to achieve high throughput and low latency.This work presents the mapping of a stream join operator algorithm, i.e. ScaleJoin 4, on reconfigurable logic. The pro- posed solutions outperform any other state-of-the-art stream join implementation. The contribution of this work is:s
14、oftware-based solutions. Moreover, Oge et al. 10 presented a hardware solution for the Handshake algorithm that can process input of 1.2 million tuples per second for small window sizes. Next, Kritikakis et al. 11 presented an FPGA- based implementation of ScaleJoin algorithm that offers high throug
15、hput due to the large number of processing units and the high parallelization level upon the overall comparisons. This work achieves up to four times higher throughput over the original software-based solution. Lastly, Oge et al. in 12 proposed a scalable and order-agnostic hardware-based design of
16、sliding-window aggregation.We propose two different FPGA-based architectures for the stream join operator that improve performance by tak- ing advantage of the fine-grained and the coarse-grained parallelism that reconfigurable hardware can offer.Our implementations exploit the high memory bandwidth
17、 available on the target hybrid platform and also achieve high resource utilization of its available FPGA devices. The implemented systems achieve at least one order of magnitude better throughput data rates in comparison to the fastest stream join multi-threaded solutions.This paper analyzes and pr
18、esents two new hardware-based architectures for the ScaleJoin algorithm. The proposed ar- chitectures offer some important architectural innovations vs. the previous related works. The main difference between the work in 3 vs. our proposed solutions is that the work in 3 focuses on relational data b
19、ases and not on stream processing algorithms. Our proposed implementations outperform any other software- and hardware-based implementations. In more detail, as we will present in our results, the difference with 10, 12 and 11 is that the current work increases the number of tuples per second that c
20、an be processed by hardware vs. the previous systems performance.Fig. 1. Top Level ArchitectureIII. STREAM JOIN OPERATOR AND SCALEJOINALGORITHMThis section focuses on the semantics of stream join pro- cessing and analyzes the basic algorithmic steps for the Scale- Join algorithm 4. Also, the complex
21、ity of the implemented algorithm is analyzed.The input streams consist of flowing tuples, which need to be processed in real time. Each tuple is modeled as two components v, t, where v is a value (or a set of values) and t is the timestamp that defines the order in the stream sequence. During the jo
22、in process between two streams, i.e. R and S, all the new tuples from each stream need to be compared with the tuples from the other input stream. In more detail, a tuple with timestamp ts, has to be “combined” with all the tuples from the other input streams in the interval ts - W, ts+ W, where W i
23、s the processing window size in time units. Whenever the correlation between two tuples is “successful”, a new output tuple is created, which combines all the attributes of both tuples and setting the output merge timestamp as the maximum timestamp of the input data.ScaleJoin 4 is a recent algorithm
24、 that offers the most performance-efficient processing for the stream join operation. This algorithm offers three important characteristics compared to previous proposed algorithms: (a) it can process tuples of streaming data without taking into account the number of the incoming or the outgoing str
25、eams; (b) the algorithm guarantees deterministic processing with scalable and high-throughput parallelism without the need of any centralized coordination or any locks; and (c) ScaleJoin can distribute the comparisons that take place among the tuples of the incoming streams to a number of parallel i
26、ndependent Processing Units (PUs).The stream join processing is computationally very inten- sive, which leads to the need for new systems capable of parallelizing the problem and increasing throughput. Each tuple from a stream has to be compared with all tuples from the opposite stream. Given that t
27、he tuple rate is T tuples per second for both processing streams and the processing window size is W seconds, the system needs to maintain W T tuples, in total. Hence, T tuples have to be compared with W T tuples every second. Thus, the total operations for the tuples ofeach stream is about W T 2 an
28、d the aggregate cost for both streams is 2 W TIV. HARDWARE ARCHITECTURES FOR STREAM JOINOPERATORThis section presents two reconfigurable logic architectures that target the acceleration of stream join operation. They both offer high main memory access bandwidth using differ- ent memory access patter
29、ns. Furthermore, both architectures archive great parallelization on an FPGA applying different pipeline approaches.A. First Hardware System ArchitectureThis section presents a top-down analysis of our first proposed FPGA-based architecture. This architecture, Scale- Join module, was implemented in
30、a hybrid platform, which combines FPGAs with industry standard processors.Fig. 1, presents the the top-level module of our imple- mentation, which is mapped in each one of four available FPGAs. It consists of two Multiple Stream Processing Units (MSPUs), where each one of them is responsible to make
31、 the comparisons between the new tuples of a stream with all the tuples of the opposite input stream. Each MSPU module consists of seven Stream Processing Units (SPUs), shown in Fig. 2. The number of SPUs is based on the 8 memory channels available for each MSPU module. To be more specific, one memo
32、ry channel is used for reading the “new” tuples and storing the results to the shared memory, while the other seven are used for reading and streaming the “old” tuples that belong in the processing time window. We chose to assign more channels for the “old” tuples, as this data transfer comprises th
33、e bigger percentage of the computational cost. This way, we achieved a higher level of parallelism and the process load is distributed to each SPU.The main process is divided into two steps. First, the “new” tuples for each stream are loaded to the corresponding MSPU module. Second, the “old” tuples
34、 are streamed into the parallel processing SPUs. Concurrently, the results are forwarded through a network of multiplexers and they are stored in the shared memory. After all the results are stored, the control unit brings the next “new” tuples and the processing continues. The above process takes p
35、lace repeatedly.Moving down in the hierarchy of our architecture, we describe the Stream Processing Unit (SPU). It consists of N Processing Units (PUs), which are connected in a pipeline,matches that occurred during the tuple movement through the pipelined architecture. This information is kept and
36、tagged along the tuple so that it can be used at a later stage of the execution phase. Furthermore, a different memory accessing scheme is used for better utilization of the available bandwidth, thus allowing for even higher parallelization level.1) FPGA Architecture Overview: This section presents
37、theproposed reconfigurable architecture. The system consists of sixteen fully configurable and independent parallel Stream Processor Engines (SPEs). These engines are used to capitalize the available high memory bandwidth. Each SPE module, as shown in Fig. 4, consists of a Stream Processing Unit (SP
38、U) and two memory management units, which load tuples to the SPU and store the results to the main memory, all sharing one main memory channel.During the execution step, each SPE works in two phases. In the first phase, the SPE loads a portion of the newly generated tuples of the input stream, e.g.
39、“new” tuples. In the second phase, it streams and compares all “old” tuples of the stream with the “new” tuples. This cycle is repeated until all “new” tuples are compared with the old” tuples.SFIFONewCTRLVBSelR0FIFOOld 0R1 FIFOOld 1ResultR2FIFOOld 2R6FIFOOld 6Fig. 2.Multiple Stream Processing Unit
40、(MSPU)as shown in Fig. 3. In every clock cycle each PU compares a newly arrived tuple from one stream with a tuple of theopposite streamide the processing window. As the “older”tuples are streamed through the pipeline, all the necessarycomparisons take place and the process ends when all the compari
41、sons have been made. Last, in the innermost level of our system lies the processing unit (PU). This module is fed with tuples from two input streams, i.e. S and R, it compares their attributes and defines if the comparison is successful.S R VBPU 0PU 1PU 2PU NSELResultFig. 4. Stream Processor Engine
42、(SPE)2) SPU Architecture: As mentioned above, the main pro- cessing element of each SPE is the SPU, Fig. 5 . The heart of each SPU is a systolic array structure of N Processing Units (PUs). During the first phase of the execution, N “new” tuples are loaded from the memory to the PUs. Next, the “old”
43、 stream tuples are loaded from RAM. At each clock cycle, they are streamed through the whole array till the last PU. Each PU compares the preloaded “new” tuple with the “old” streamed one. The PU internally keeps track of the times that this particular ”old” tuple has entered the array and also the
44、times that the comparison was successful. On each clock cycle, these statistics are updated and streamed to the next PU. At the other end of the array, special logic is used to detect if successful comparisons exist. For each successful comparison, a new merged output tuple is created and forwarded
45、to the memory storing unit. Furthermore, this logic compares the number of matches scored and the number of times it has entered the array pipeline. In case the number of matches is greater than the times entered, the tuple and its associated statistics are kept in a FIFO buffer so that it will re-e
46、nter the pipeline in a later stage and eventually all of its matching results will be forwarded to the corresponding memory storing unit.CTRLFIFOResultsWrEnFig. 3. Stream Processing Unit (SPU)B. Second Hardware System ArchitectureThe previous architecture of the ScaleJoin algorithm ex- ploited the e
47、nhanced parallelization and the high memory bandwidth outperforming any other software- and hardware- based implementation. On the other hand, the resource utiliza- tion was really low, thus we came up with a new architecture, where we managed to increase the parallelization level of the final syste
48、m.The limiting factor in the previously proposed solution was the high routing logic required for the network of multiplexers, which need to pass the results from each PU to the SPU element. This network of multiplexers is vital because on every clock cycle more than one matches can simultaneously o
49、ccur. The new architecture eliminates this network by adding extra storage structures i.e. FIFOs. These FIFOs keep necessary information about the first valid match and the number ofSPU 6SPU 2SPU 1SPU 0Fig. 6. Tuples per second for the software-based reference system and the 3 different FPGA systems
50、On the other hand, the previous hardware implementation 11 runs on the same platform as the proposed architectures.The benchmark used to evaluate our systems is the same used by the original ScaleJoin algorithm and many other state-Fig. 5. Stream Processing Units (SPU)of-the-art algorithms thusuring
51、 comparison accuracy of the3) Load and Store Memory Units: These two units are responsible for loading both “new” and “old” tuples for the two input streams from the main memory. Also, they are used to store the newly generated tuples of successful matches. Both units share the same main memory syst
52、em port and operate mutually exclusive.4) System Analysis: The presented system is mapped on four FPGA-chips, so that it takes full advantage of the high memory bandwidth, which is offered by the hybrid platform. The system hosts 64 SPEs at its 4 available FPGAs resulting in 8192 PUs, which operate
53、in parallel on each clock cycle.V. SYSTEM EVALUATIONThis section presents the performance results for both pro- posed systems and compares them with the results from pre- vious works. Both hardware systems were mapped and were tested in Micron-Convey HC-2ex hybrid platform 13. This platform is confi
54、gured with two six-core Intel Xeon E5-2640 processors running at 2.5 GHz with 128 GB of DDR3 memory, paired with four Xilinx Virtex-6 LX760 FPGAs connected to 64GB of SG-DIMM memory. The critical resource for both proposed implementations is the number of occupied slices. In more detail, the first a
55、rchitecture utilizes up to 51% of the available slices, while the second design uses up to 82% of them. Both proposed designs operate at the systems standard 150 MHz clock frequency.A. Performance EvaluationThis section describes the performance evaluation of the proposed reconfigurable-based system
56、s. The performance of the proposed systems was compared aga t the performance of the official multi-threaded software-based system 4 and the fastest FPGA-based system 11. The software-based im- plementation ran on a system configured with 48 2.6GHz AMD Opteron cores over 4 sockets, with 64 GB of RAM
57、.proposed systems. Both S and R streams use random generatedtuples uniformly distributed in the interval 0-10000. Also, these kinds of systems are window time-based, thus, we evaluated all systems on three predefined window sizes.The performance comparison of the systems used two metrics:Tuple rate (T/s), which is the maximum throughput of tuples that each system can handle every second over a predefined window size.Comparisons executed per s
温馨提示
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
最新文档
- 石河子大学《园艺通论》2023-2024学年第一学期期末试卷
- 洞穴奇案读书分享
- 石河子大学《跆拳道》2021-2022学年第一学期期末试卷
- 石河子大学《模拟电子技术》2021-2022学年期末试卷
- 石河子大学《教育网站设计与开发》2023-2024学年第一学期期末试卷
- 沈阳理工大学《体能与营养》2023-2024学年第一学期期末试卷
- 沈阳理工大学《机械设计学》2021-2022学年第一学期期末试卷
- 沈阳理工大学《高等代数》2021-2022学年第一学期期末试卷
- 沈阳理工大学《城市设计》2021-2022学年第一学期期末试卷
- 沈阳理工大学《材料成型工艺与装备》2023-2024学年第一学期期末试卷
- 2024产学研合作框架协议
- 2023年甘肃省工程设计研究院有限责任公司招聘笔试真题
- 2024年新中国成立75周年课件
- 2022部编版道德与法治三年级下册《请到我的家乡来》教学设计
- 2024年宾馆服务员管理规章制度(三篇)
- 远离烟卡知识科普讲座课件
- 中国燃气招聘笔试题库2024
- 左邻右舍一家亲(教学设计)-2023-2024学年五年级上册综合实践活动蒙沪版
- 10以内连加练习题完整版51
- 华为业务增长的流程管理之道:以客户为中心的高效运营策略
- GB 30254-2024高压三相笼型异步电动机能效限定值及能效等级
评论
0/150
提交评论