翻译以.原文和在同一文件中前一致性或延迟性一种基于状态机_第1页
翻译以.原文和在同一文件中前一致性或延迟性一种基于状态机_第2页
翻译以.原文和在同一文件中前一致性或延迟性一种基于状态机_第3页
翻译以.原文和在同一文件中前一致性或延迟性一种基于状态机_第4页
翻译以.原文和在同一文件中前一致性或延迟性一种基于状态机_第5页
已阅读5页,还剩42页未读 继续免费阅读

下载本文档

版权说明:本文档由用户提供并上传,收益归属内容提供方,若内容存在侵权,请进行举报或认领

文档简介

方法作者:XuWang,Hailong摘CAPPACELC宣称在分布式副本系统中存在一些性能评量的折RSMRSM-d,我们建立了概率模型去量化一致性和延迟性。一致性和延迟性都受d的影响,而d表示协关键词:一致性,写冲突,延迟性, 引[4]进一步解释了CAP,并提出了一致性与可用性,一致性与延迟之间的两种权衡。系统总是可写的并保持最终一致性,如Dynamo[5],Cassandra[6]。然而其他系统如Chubby[7],Zookeeper[8]必须有段很短的时间来从错误中进行恢复来保证各副本管弱一致性已经在一些商业系统中运用并且被许多最求高可用性和低延迟的开发者bing搜索业务,21.8%4.3%;由于延时从400ms增加到900ms,谷歌在搜索领域的业务量下降了25%;100ms的延迟增加使亚马逊的销售额下降了一个百分点[13]。因此在系统设计中延迟是一项很重要的考虑RSMs[14]Quorum[15]系统是分布式系统和数据库领域中比较流行的两种容WR两种副本集合。W集合用于写操作,R用于读操作。在随机时间通过RSMsquorum系统提供RSMsquorumRSMs的写RSMs一致性和延迟之间,基于不一致性的定量结果做出一个合理理性的权衡。束[19,20]和限制副本的写分离[21,22]RSMs的写不在这篇论文中,提出了一个定量的概率分析方法来计算RSMs的不一致性和延一点,调查和分析了三种代表性的RSM实现,分别为非一致全序广播[23,24],分调查和分析[27-29]。然后,我们提出了一种通用系统模型RSM-d,d的取值范围为[1-n](n为副本数。D表示提交一个写操作之前协调者必须收到的ACKs数。模型表明(1)在d∈[1,[n/2]+1]时一致性随着d的增加而增强(2)当d大于[n/2]+1d的增加而单调递增。第四,将写不一致性的量化结RSM-d的延迟联系起来,RSM-d的基础上,我们为写冲突构建了一个概率模型,写冲突是造成不一致性都受到一个共同因素的影响,这个共同因素就是d。 背RSMs的背景,RSMs致力于容错在分布式系统中得到ChandraToueg[30]已经证明全序广播和分布一致性问题是可解决的,在这次研究中,我们也遵循这些假设。这就意味着我们讨论的RSMs有2f+1个副本,能f个崩溃错误,并且最终能够侦测到崩溃的节点。然而,实际的错误探测器在一定时间内必须输出是否崩溃的结果,而不是长时间等待一个最终结论。[31]将在[23-29]调查中,RSMs通过非一致全序广播,分布式一致和一致全序广播实现。如图2.1所示,这三种具有代表性的RSM实现时这样实现的:一个独一的序列号并且将其提交给所有的副本。当写操作消息抵达副本,副本会记录下这个写操作,并且根据序列号进行提交,然后回复协调者。只要协调者受到第一个回复,它就会向客户端发送响应信息。任意时刻如果协调者被认RSMs的分布式一致,一旦协调者收到一个写操作命令,首先会1[n/2]+1n RSM-d系统模型,这个模型RSM模型的抽象表示。因RSMs,通过调查研究它们来获取尽管已经在第二部分介绍过了全序广播算法,也仅是全序广播综合调查结果5外两种是基于全序逻辑时钟[16]的历史交流和目的地协议实现。RSM和分布式一致的解决方案都是目的地协议。Paxos用于对单个写操作排序的协商,multi-paxos针3.1RSM-dPaxos算法进行全序广播。原因是:1)固定的定序器算法在所有的全序广播里的延迟是最少的。2)Paxos在3)在动态定序器中的标示符,基于特权的算法,在通信Paxos算法相结合建立一个统一的系统RSM-d。这是切实可行的,因为全序广播和分布式一致是基本相等的。如图2dd属于[1;n]d副本也包括了协调器,也就是一果d=1,RSM-d降低为不均一全序广播,协调器可以直接进入提交阶段。如果d=[n=2]+1,RSM-dPaxosd=n,RSM-d变成为一致的全序广播。我们不只是涉及到了这三个代表性的RSMs,而且将深入探讨其他的在过去被忽视的RSM-dRSMRSMs的假设。RSMs假设同时崩溃的效检测器将开始选择新的协调器,并且只在历史记录的基础上更改错误。在RSM-d有多个协调器,如Multi-RingPaxos,他与Paxos相协调。 写冲突概RSM-d,如ACK数𝑑≥[𝑛/2]+1,所有的写计算被提交到他们普通序和障修复后,f𝑓[𝑛/2]有故障出现,副本仍将保持一致性。然后如果𝑑[𝑛/21,ACK消息的图4.1显示了当至少d节点接收到了“PROPOSE产生故障时,写日志丢失的方案。假设𝑃𝑤是一个写计算所有日志丢失的可能。如d(b)之下,𝑃𝑤=𝑃𝑐。4.1因为历史记录只宝库了日志历史,所以𝑃𝑤也是在历史记录中的一个写计算丢𝑃𝑤= 如今我们不考虑条件(c,但(a)和(b)仍保留。这意味着我们允许提交的)𝑃𝑤𝑛𝑢=∑𝑃𝑐𝑘1−𝑃𝑐𝑛−[𝑃𝑤𝑐=𝑃𝑤𝑃𝑤𝑛𝑢= ∑ 𝑃𝑐𝑘1−𝑃𝑐𝑛− 现在我们进一步的不考虑(b),并只考虑(a。因此我们应该考虑到不同步的度作用是随机数据下的𝑃𝑒𝑤𝑑。为了便捷,我们用以下公式表示:𝑄𝑥=𝑃𝑐𝑥∑𝑃𝑐𝑘1−𝑃𝑐d<x<n。明显的可以看出方程式(2)可以被写成𝑃𝑤𝑐𝑄𝑑。当𝐷𝑄𝑚时,写冲突的条件几率也是如此。因此我们可以通过对每一个可能的𝐷=𝑚条件几率之𝑃𝑤𝑐=∑𝑃𝑒𝑤𝑚𝑄

=∑𝑃𝑒𝑤𝑚𝑃𝑐𝑚∑𝑃𝑐𝑘1−𝑃𝑐 如上述的,在不同条件下协调器选择和故障恢复,等式(1(2)和(3)都代𝑃𝑤𝑐= =∑𝑃𝑒𝑤𝑚𝑄4.3图4.3显示了写重复的场景。当一个不准确的故障检测器对协调者存在错误怀疑,至少存在一个调解节点。然而,如果dn2Pno表示没有相同节点的回复确认消息的两个节点集合的概率。除去两个协调者本身,其他两个随机(d1)节点集可能不重叠,且都是不同的节点。因此,Pno是两个随机(d1)节点集数,没有重叠和排除两个协调者除以所有两个随机节点集(𝑛− 𝑛−𝑑−

=𝑑− 𝑛− 𝑑−

𝑑−𝑛−𝑑−图4.4所示,主要工作如下:在时距Te4.4分布函数分别为𝑓和𝐹ℎ𝑏2ℎ𝑏1是它们的两个样本,分别代表两个连续心跳信息的延迟信息。“副本在定时器超时前没有接收到新的心跳信息”意味着ℎ𝑏2+𝑒𝑏1+𝑜𝑏2+𝑒𝑏1+𝑜,副本仍然认为它崩溃了。两个节点之间的这种错误怀疑的概率由Pfs表示:𝑃𝑓8=𝑃𝑟ℎ𝑏2+𝑒>ℎ𝑏1+=𝑃𝑟ℎ𝑏2−ℎ𝑏1>𝑜−=𝑃𝑓8

𝑜

𝑒Var(T)T的方差。如果我们不知道T的分布,上述不等式可用𝑝𝑓8的上限Pgfs中,对协调者错误怀疑的概率可以通过加和n-f-1活跃副本的概率,乘以至少有一个活跃副本错误怀疑协调者的n-f-1活跃副本来计算:

𝑛−

=∑(

) 1− 1−1−Pgfs的流传式故障检测器(由另一个符号Pgs-gfs表示)有一点不同: 𝑛− 𝑛−1−𝑔8−𝑔𝑓8=∑

) 1− Pwd表示写重复概率。写重复发生如果(1)协调者正常工作但(2)其他活认信息不重叠,即,Pwd等于这三个事件的概率的乘积:𝑃𝑤=1−𝑃𝑐 但难以准确计算Pno,因为它取决于特定协调者选定的持续时间,故障恢复算法,[]𝑃𝑤𝑐=𝑃𝑤+𝑃𝑤=∑𝑃𝑒𝑤+𝑄𝑚+1−𝑃𝑐 虽然𝑃𝑤𝑐的最终的结果给出了,我们应该在不同的情况下采用它。例如, 时正如图5.1所示,我们已经在即将使用的表5.2𝐿𝐴𝑖=𝑝𝑖+𝑜𝑔𝑖+𝑎𝑖𝑝=𝑎=𝑝𝑐=𝑝𝑓=0。若不考虑这种序。例如,𝐿𝐴1≤𝐿𝐴2≤⋯≤𝐿𝐴𝑛−1。𝐿𝑤k小的𝐿𝐴𝑖𝑝𝑜𝑔𝑎𝑐“COMMIT”消息从协调者到副本i所需时𝑐𝑜𝑚𝑓𝐿𝑤𝑑=𝐿𝐴−1+𝑚𝑖𝑛 当d2𝐿𝑤1=𝑜𝑔𝑐𝑜𝑜𝑑𝑖𝑛𝑎𝑜𝑟+𝑚𝑖𝑛 𝑜𝑔𝑐𝑜𝑜𝑑𝑖𝑛𝑎𝑜𝑟+min(𝑐𝑖)+𝑐𝑜𝑚𝑖+𝑓5.1接下来我们关注信息的推迟而忽略登陆时间𝑜𝑔𝑖(或者可以认为是时间非常短。因𝑝𝑖和𝑜𝑔𝑖都是随机变量T的样本(T的定义在第IV部分,遵循概率密度函遵循概率密度函 𝐿𝐴𝑖=𝑝𝑖+𝑎

𝑓𝑥 −𝑥𝑑𝑥(f(t)IV部分)和累计分布函数G(t)。𝐿𝐴𝑘的pdf和cdf分别由ℎ𝑘 和𝐻𝑘t1≤k≤n−1表示。直观上来说,对于足够小的正实数𝜀,𝐿𝐴𝑘∈[t,t+ε],只要存在一个𝐿𝐴𝑖∈[tt+ε],使得𝐿𝐴𝑖取值(k-1)t,对于其他𝐿𝐴𝑖取值𝑛1𝑘𝜀𝑃𝑟(𝐿𝐴𝑘∈[,+𝑛−= ) 𝐿𝐴ϵ[t, 𝑛−

𝐾−1𝑃𝑟𝐿𝐴 +ε 𝑛−2=𝑛−1𝑝(𝐿𝐴∈[t,t+ε]) ) 𝐿𝐴 𝐿𝐴 +ε 𝑘− 𝑛−=𝑛−1 )𝑝𝐿𝐴 𝐾−1𝑝𝐿𝐴 +ε𝑘− 𝑛−=𝑛−1

) 𝑘−1𝑝𝐿𝐴 +ε𝑘− 𝑛−=𝑛−1 ) 𝑘−11− + 𝑘− =𝑙𝑖𝑚𝑝𝑟(𝐿𝐴𝑘𝜖[t,t+ε])=𝑛−1 𝑛− 𝑘−11− 𝑘−通过方程(7),我们可以猜想𝐿𝑤𝑑1𝐿𝑤𝑑𝑑≥1:𝐸𝐿𝑤𝑑+1−𝐿𝑤 = +min𝐿𝐶𝑖)−𝐸(𝐿𝐴−1+min𝐿𝐶𝑖= −𝐿𝐴−1以上两个minLCi被抵消,因为提交阶段没有变化,并且𝐿𝐴0=0f(t)t0时,f(t)=0t0时,g(t)=0,G(t)=0,G+∞=1。如果d=1,那么𝐸𝐿𝑤2−𝐿𝑤 =𝐸𝐿𝐴 = 𝑛−1 1− 𝑛−2= −𝑑1− 0= 1− 𝑛−1𝑑−

1− 𝑛−1= 1− −

𝑔𝑥𝑑𝑥00若假设 存在,那么𝐸 =∫+∞ 𝑑是一个常量。在这种情况下很容0以至于:𝐸𝐿𝑤2−𝐿𝑤 如果𝑑≥2

𝑔𝑥𝑑𝑥𝑛−1=0+∞1− 𝐸𝐿𝑤𝑑+1−𝐿𝑤 = )−𝐸𝐿𝐴=∫ 𝑛−1 (𝑛−2) )−1(1− 𝑑−

𝑑− 𝑛−

𝑛− −1 ( ) (1− 𝑑= ) − − 𝑛−

𝑑−

𝑑−

=𝑑−

) 0

(1−

𝑑−

𝑔𝑥类似的,如果𝐸

𝑔𝑥 =𝐸

𝑑+1−

=(𝑛−1)∫𝑑−

−11− 𝑛− 𝑑≥让∆𝐿𝐴𝑑=(𝑛−1

− (1−

𝑑 𝐸𝐿𝑤𝑑+1−𝐿𝑤 =∆𝐿𝐴 𝑑≥ 显然∆𝐿𝐴𝑑>0。所以方程(8)f(x)n的数 一致性和延迟之间的长期运行的RSMs,系统的一致性象征着有序提交的写操作比率,这能通过1-PwcE(Lw(d))进行估算。在这部分,B是系统总效益(比如,收入或用户体验,BcBl是一致性和延迟对系统效益的影响值。举例来说,B能看做网上商店的总收入,强一致性的事务性购买会使收入增加,而高延迟将导致用户减少和收入减少。ξ表示其他因素对总效益的𝐵𝑑=𝐵𝑐+𝐵+𝛿=𝛼(1−𝑝𝑤𝑐𝑑)+𝛽+𝛾−𝜃𝐸𝐿𝑤 +=𝛼+𝛽+𝛾+𝛿−𝛼𝑃𝑤𝑐𝑑+𝜃𝐸𝐿𝑤=𝜂−𝛼𝑃𝑤𝑐𝑑+𝜃𝐸𝐿𝑤=𝜂−𝜃𝐸𝐿𝑤 −𝛼𝑃𝑤𝑐𝑑+𝜃𝐸𝐿𝑤 −𝐸𝐿𝑤=𝜂−𝜃𝐸𝐿𝑤 −𝑣其中𝜂=𝛼+𝛽+𝛾+𝛿并且𝑣𝑑=𝛼𝑃𝑤𝑐𝑑+𝜃𝐸𝐿𝑤𝑑−𝐸𝐿𝑤 尽管𝛼,𝜃,𝜂和𝐸𝐿𝑤1d的取值,对于一V(d)是上述等式的可变部分。V(d)取最小值时B(d)取最大值。𝐵𝑑=𝜂−𝜃𝐸𝐿𝑤 −𝑣𝑑≤𝜂−𝜃𝐸𝐿𝑤 −𝑣𝑑′=𝐵𝑑′=的d的值。B(1)=𝜂−𝛼𝑃𝑤𝑐1+𝜃𝐸𝐿𝑤 𝐵([2]+1)=𝜂−𝜃𝐸𝐿𝑤[2]+𝐵𝑛=𝜂−𝜃𝐸𝐿𝑤通过上述方程组,我们能得到𝛼,𝜃,𝜂d取值是一定的,而 实的失败率,RNSanti-entropygossip-style故障检测会使系统更复杂和更不确定。在本节中,我们关注的是验证我们的结果,从而减少RSM-d实验的假设变量数量的最低值,扩大应用范围。因此不考虑anti-entropy和gossip-style的故障检测。为𝑁𝑤和𝑁𝑤。因此我们估计𝑃𝑤𝑐=𝑁𝑤𝑁𝑤/10000000。与此同时,我们计算延迟的平均值来估计𝐿𝑤d的期望。s0.05(20ms,0.1(10ms)0.2(5ms被分别设置到(1000ms,2000ms)和(500ms,1000ms。在以上输入参数下RMSE=0.0009%,stddev.=0.0052%.dn21,我们观察到𝑁𝑤和𝑁𝑤为了证明方程(8均布fx)循=0:01;0:2;0:05;:1;:2,该分布已传入相应信息。我们执行一个空写(没有操作)保证日志记录和写入提交的时间可以忽略。我们对每个𝑛∈,9和𝑑∈,𝑛进行测试,得到延迟时间的平均值并预测方程(8)给出的延迟期望值。在所有情况下将𝐸𝑤𝑑 −𝐸𝑤1 的观测均0.013ms和t.d0.019ms,符合们对延迟的预期。dλ=0:01,Pc=2%,Te=1000ms,To=2000ms。7.1所示,xn(2<n<9)和相应的应答d(1<d<[n=2]),y轴表示的是写入冲突的概率。为了强调不同的差异,我们用log𝑝𝑤𝑐来代替𝑝𝑤𝑐。我们每步下降至少一个数量级。注意到当d>[n/2]时,写入冲突的概率𝑝𝑤𝑐为0,所以并d方程8揭示了写延迟与d的关系。在实验中,我们想要更直观的展示出写延迟df(x)被设置为λ=0.01。同时,还执行一个空写入扰最终的结果,因为𝐿𝑤1包括的所有值将被删除。∈[和相应数量的ACKs数𝑑∈[1,𝑛/2],y轴表示代替𝐿𝑤1的延迟𝐿𝑤𝑑𝐿𝑤𝑑d的增大而增大。在这里我们关注𝑑∈[2,𝑛/2],实际上,在𝑑∈[1,𝑛]之间的任意值𝐿𝑤𝑑dd值的降低将导致n值的增大。原因是协调器会存在更多的ACKs去选择最快的d。7.2d一致v.s.延正如第六节所定义的一致性等于1−𝑝𝑤𝑐,现在我们结合以上两个实验的结果,7.3所示,x轴表示代替𝐿𝑤1的延迟𝐿𝑤𝑑,y轴表示1𝑝𝑤𝑐d3n3,5,7和其相应∈[𝑛/2]n和𝑑1𝑛/2],它表明,更强的一致性必然导致糟糕的7.3V.S.RSMs系统作为其销售系统。虽然难的。但是,如果我们不想知道系统的精确值的好处𝐵𝑑,我们只需要通过统计方法知道一致性效益比𝛼和延迟效益比𝜃,最后确定𝑑′的最大收益值𝐵𝑑′。价格是V。一方面,如果每个正常的(一致)购买可以带来20%的价格,和每一个异常(不一致)购买带来双重损失的价格(如赔偿订单冲突)。产生效益的为20%𝑆𝑉1−𝑃𝑤𝑐−2𝑆𝑉1−1−𝑃𝑤𝑐,因此收益率的一致性为𝛼=2.2𝑆𝑉。另一方面,用户相1001%。也就是说,收益的比例延迟𝜃为0.01%𝑆𝑉。因此,我们有𝛼=2.2𝑆𝑉和𝜃=0.01%𝑆𝑉。7.4第六节表明,整体利益𝐵𝑑的最大值为其变量𝑉𝑑=𝑆𝑉2.2𝑃𝑤𝑐d0.01%𝐸𝐿𝑤𝑑−𝐿𝑤 有最小值。在𝑑∈[1,𝑛]+1]和𝑑∈[2,9]2经计算出𝑉𝑑11𝑉𝑑部分描述了变量整体利益𝐵𝑑。例如,n=3,d=2,𝑉𝑑值为0.125𝑆𝑉45d=2时𝑉𝑑有最小值,𝐵𝑑Paxos实现。因 相关较具有代表性的实现是非一致全序广播,PaxosRSMs已经RSMs一致性已经被大量研究,但依旧在最近吸引着人们的目光。根据CAP[3]和RSMs中,我们的工作通过考虑崩溃错误时写冲突发生的概率来量化不一致 讨Quorum系Quorum系统定义了两类副本集合,一类是W,另一类是读R。它们能被配ONE(W=1),QUORUM(W=[n/2]+1ALL(W=n)或者是任意值。这和序,而RSMs为全序排列。所以一些在RSMs系统中引起写冲突的情况不一定适用于Quorum系统。在现今的实际quorum系统中,一般把第二种写看做一个新的写操RSMs在对同步提交写操作和副本历史记录状态时怀疑协调者后会进行协调者10现的变体,我们首先提出了一种叫做RSM-d的通用模型来统一描述大多数运用参考.SummaryoftheAmazonEC2andAmazonRDSServiceDisruptionintheUSEastRegion.April2011J.Dean.Designs,lessons,andadvicefrombuildinglargedistributedsystems.KeynotefromLADIS2009.E.A.Brewer.TowardsRobustDistributedSystems.InPODC,pages7C7,D.J.Abadi.Consistencytradeoffsinmoderndistributeddatabasesystemdesign:CAPisonlypartofthestory.IEEEComputer,45(2):37C42,2012.G.DeCandia,D.Hastorun,M.Jampani,G.Kakulapati,A.Lakshman,A.Pilchin,S.Sivasubramanian,P.Vosshall,andW.Vogels.Dynamo:Amazonshighlyavailablekey-valuestore.InSOSP2007.TheApacheCassandraProject./[7]M.Burrows,”TheChubbylockserviceforloosely-coupleddistributedsystems,”inOSDI’06:Proceedingsofthe7thsymposiumonOperatingsystemsdesignandimplementation,2006,pp.P.Hunt,M.Konar,F.P.Junqueira,andB.Reed,ZooKeeper:Wait-freecoordinationforInternet-scalesystems,inUSENIXATC10:Proceedingsofthe2010USENIXAnnualTechnicalConference.USENIXAssociation,2010.W.Vogels.Eventuallyconsistent.Commun.ACM,52:40C44,JanuaryK.Birman.ReliableDistributedSystemsTechnologies,WebServicesandApplications.Springer,2005.E.SchurmanandJ.Brutlag.Performancerelatedchangesandtheiruserimpact.PresentedatVelocityWebPerformanceandOperationsConference,June2009.VelocityandtheBottomLine. /site/glinden/Home/StanfordDataMining.2006-11-29.ppt.29F.Schneider.Implementingfault-tolerantservicesusingthestatemachineapproach:atutorial.ACMComputingSurveys,22(4),1990.Merideth,MichaelG.,andMichaelK.Reiter.Selectedresultsfromthelatestdecadeofquorumsystemsresearch.Replication,SpringerBerlinHeidelberg,2010,185-206.Lamport,L.Time,clocks,andtheorderingofeventsinadistributedsystem.ACMCommunications,21(7),pp.558-565,1978.D.Malkhi,M.Reiter,A.Wool,andR.Wright.Probabilisticquorum.systems.InformationandCommunication,(170):184C206,2001.PeterBailis,ShivaramVenkataraman,JosephM.Hellerstein,MichaelFranklin,IonStoica.ProbabilisticallyBoundedStalenessforPracticalPartialQuorums.InVLDB2012S.S.Chawathe,H.Garcia-Molina,andJ.Widom.FlexibleConstraintManagementforAutonomousDistributedDatabases.IEEEDataEng.Bull.,17(2):23C27,1994.A.GuptaandS.Tiwari.Distributedconstraintmanagementforcollaborativeengineeringdatabases.InCIKM,pages655C664,1993.H.YuandA.Vahdat.DesignandEvaluationofaContinuousConsistencyModelforReplicatedServices.InOSDI,pages305C318,2000.S.Shah,K.Ramamritham,andP.J.Shenoy.ResilientandCoherencePreservingDisseminationofDynamicDataUsingCooperatingPeers.IEEETrans.Knowl.DataEng.,16(7):799C812,2004XavierDefago,AndreSchiper,andPeterUrban.Totalorderbroadcastandmulticastalgorithms:Taxonomyandsurvey.ACMComput.Surv.36,4(December2004)XavierDefago.Comparativeperformanceanalysisoforderingstrategiesinatomicbroadcastalgorithms.IEICETrans.onInformationandSystems.December2003L.Lamport.PaxosMadeSimple.ACMSIGACTNews,32(4):18C25,DecemberP.Marandi,M.Primi,N.Schiper,andF.Pedone.RingPaxos:Ahigh-throughputatomicbroadcastprotocol.InternationalConferenceonDependableSystemsandNetworks(DSN),2010,pp.527-536.L.Lamport.GeneralizedConsensusandPaxos.TechnicalReportMSRTR-2005-33,MicrosoftResearch,2005,B.KemmeandG.Alonso.Databasereplication:ataleofresearchcommunities.PVLDB,ParisaJaliliMarandiMarcoPrimiFernandoPedone.Multi-RingPaxos.InDSNChandra,T.,Toueg,S.:UnreliableFailureDetectorsforReliableDistributedSystems.JournaloftheACM43(2),225C267(1996).W.Chen.OntheQualityofServiceofFailureDetectors.IEEETransactionsonComputer.May2002.DiogoBecker,FlavioJunqueira,andMarcoSerafini.LeaderElectionforReplicatedServicesUsingApplicationScores.InMiddleware2011.B.Lampson.TheABCDsofPaxos.InProceedingsofthe20thACMSymposiumonPrinciplesofDistributedComputing(PODC01),page13.ACMPress,2001.IditKeidarandSergioRajsbaum.Onthecostoffault-tolerantconsensuswhentherearenofaults-atutorial.TechnicalReportMIT-LCS-TR-821,LaboratoryforComputerScience,MassachusettsInstituteTechnology,Cambridge,MA,02139,May2001.alsopublishedinSIGACTNews32(2)(June2001).A.Demers,D.Greene,C.Hauser,W.Irish,J.Larson,S.Shenker,H.Sturgis,D.Swinehart,andD.Terry.Epidemicalgorithmsforreplicateddatabasemaintenance.InPODC1987.MarinBertier,OlivierMarin,PierreSens,ImplementationandPerformanceEvaluationofanAdaptableFailureDetector,Proceedingsofthe2002InternationalConferenceonDependableSystemsandNetworks,p.354-363,June23-26,2002.M.RaynalandF.Tronel.GroupMembershipFailureDetection:ASimpleProtocolandItsProbabilisticAnalysis.DistributedSystemsEng.J.,vol.6,no.3,pp.95-102,R.vanRenesse,Y.Minsky,andM.Hayden.Agossip-stylefailuredetectionservice.InProceedingsofInternationalConferenceandDistributedSystemsPlatformsandOpenDistributedProcessing(IFIP),1998.J.Gray,P.Helland,P.ONeil,andD.Shasha.Thedangersofreplicationandasolution.InSIGMOD1996.B.F.Cooper,R.Ramakrishnan,U.Srivastava,A.Silberstein,P.Bohannon,H.-A.Jacobsen,N.Puz,D.Weaver,andR.Yerneni.PNUTS:Yahoo!sHostedDataServingPlatform.PVLDB,1:1277C1288,August2008.W.Lloyd,M.J.Freedmand,M.Kaminsky,andD.G.Andersen.Dontsettleforeventual:Scalablecausalconsistencyforwide-areastoragewithCOPS.InSOSP2011.ConsistencyorLatency?AQuantitativeAnalysisofReplicationSystemsBasedonReplicatedStateXuWang,HailongSun,TingDeng,JinpengSchoolofComputerScienceandEngineeringBeihangUniversityBeijing,ChinaEmail:{wangxu,sunhl,dengting,huaijp}—ExistingtheorieslikeCAPandPACELChaveclaimedthattherearetradeoffsbetweensomepairsofper-formancemeasuresindistributedreplicationsystems,suchasconsistencyandlatency.However,currentsystemstakeaveryvagueviewonhowtobalancethosetradeoffs,e.g.eventualconsistency.Inthiswork,weareconcernedwithprovidingaquantitativeanalysisonconsistencyandlatencyforwidely-usedreplicatedstatemachines(RSMs).BasedonourpresentedgenericRSMmodelcalledRSM-d,probabilisticmodelsarebuilttoquantifyconsistencyandlatency.Weshowthatbothareaffectedbyd,whichisthenumberofACKsreceivedbythecoordinatorbeforecommittingawriterequest.Andwefurtherdefineapayoffmodelthroughcombiningtheconsistencyandlatencymodels.Finally,withMonteCarlobasedsimulation,wevalidateourpresentedmodelsandshowtheeffectivenessofoursolutionsintermsofhowtoobtainanoptimaltradeoffbetweenconsistencyandlatency.—consistency;writeconflict;latency;replicatedPracticaldistributedsystemsinCloudandBigDatasolu-tionsoftenfacemanychallengessuchashighscalabilityandavailability,whicharebuiltonmassivecommoditycomput-ers,disks,networkdevicesandundercomplexmanagementtasks[1,2].Toservemoreusersandprovidehighenoughavailability,servicesanddataaretypicallyreplicatedacrossmultiplevirtualmachines(VMs),physicalmachinesandevengeographically-distributedclusters.TheCAPtheorem[3]showstheimpossibilityofobtainingallthreeofconsistency,availabilityandpartitiontolerancesimultaneously.AndPACELC[4]furtherinterpretsCAP,whichclaimstwotradeoffsincludingconsistency&availabilityandconsistency&latency.Ontheonehand,peoplemakeatradeoffbetweenconsistencyandavailability.Inordertoprovideextremelyhighavailability,manysystemsarealwayswritablewitheventualconsistency,suchasDynamo[5]andCassandra[6];whileothers,likeChubby[7]andZookeeper[8],mustexperienceashortperiodofunavailabilitytorecoverfromfailuresforguaranteeingstrongconsistencyofdifferentreplicas.Ontheotherhand,peoplefocusonthetradeoffbetweenconsistencyandlatency.Forexample,eventuallyconsistentreads,nomatterhowfasttheyare,cannotboundthestalenessasthenewestversionswillbeeventually[9];andstrongconsistencyleadstohigherlatencyfor

duetothewriteuniformity[10].Althoughweakconsistencyhasbeenusedinsomecommercialsystemsandisacceptablebymanypractitionersforhigheravailabilityandlowerlatency,itisnecessarytoboundtheinconsistencyandtoknowhowinconsistencytheweakconsistencyis.Sincelatencyimpactstheend-userexperienceofanIn-ternetapplication,ithas eanimportantsystemmetricformodernserviceproviders[11].Somestatisticalresultspre-sentedin[12]showthatend-usersareverysensitivetosystemlatency.Forinstance,atMicrosoftBing,a2-secondslowdownreducesqueries/userby1.8%andrevenue/userby4.3%;alatencyincreasefrom400msto900mswithGooglesearchresultsina25%dropoffinpagesearches;andhasa1%dropinsaleswitha100mslatencyincrease[13].Thereforelatencyisanimportantfactorinsystemdesign.Inthiswork,wefocusonthetradeoffbetweenconsistencyandlatencyindistributedreplicationprotocols.Twopopularfault-toleranceapproachesindistributedsys-temsanddatabasesarereplicatedstatemachines(RSMs)[14]andquorumsystems[15].RSMsdescribedesirablereplica-tionsemanticstomakeoperationscommittedintotalorder.QuorumsystemsdefinesetsofreplicasWandR,whereWisusedforwriteandRforread.Andtheycommitwritesbyvectorclocks[16]withsemanticreconciliation[5]incausalorder.Forquorum-likesystems,severalworks[17,18]boundreadstalenessfromtheaspectsofdataversions,staletimeandstalenessprobabilitytodescribetheinconsistency,andthendiscussthetradeoffbetweenconsistencyandlatency.However,littlehasbeendoneonquantifyinginconsistencyandlatencyforRSMs,whiletheyaredesignedtoproviderelativelystrongerconsistencythanquorumsystems.ForreadoperationsofRSMs,wecanborrowanalysistechniquesfromthequantitativereadstalenessinquorumsystems.ButforwriteoperationsofRSMs,weneedanewanalyticalandquantitativemethodtomeasurethedegreeofwriteinconsistency.ThenwecanrationallymakeatradeoffbetweenconsistencyandlatencyforRSMsbasedonthequantitativeresultofwriteAlthoughtherearesomeexistingalternativemodelstodescribewriteinconsistencyintheabsenceofcrashfailures,e.g.,writeconsistencyconstraint[19,20]andlimitedwritedivergencyofreplicas[21,22],wewanttoknowthedegreeofwriteinconsistencyforRSMswhenencountering978-1-4799-0181-4/13/$31.00©2013failures,ratherthanhowtosatisfyconstraintsorboundthemaximumdeviationofadataitem.ThereforewequantifywriteinconsistencyforRSMsbytheprobabilityofwriteconflictwhichrepresentstheprobabilityofawritecommittedinanabnormalorder.Inthispaper,wepresentaquantitativeprobabilisticanaly-sisapproachtomeasuringtheconsistencyandlatencyofRSM-s,andfurtherprovideasolutionformakingtradeoffbetweenconsistencyandlatencytoachievethemaximalsystembenefit.First,wehavesurveyedandanalyzedthreerepresentativeRSMimplementationsincludingnon-uniformtotalorderbroadcast[23,24],distributedconsensus(Paxos)[25,26]anduniformtotalorderbroadcast[23,24],aswellastheirvariants[27-29].ThenweproposeagenericsystemmodelRSM-d,whered∈[1,(nisthenumberofreplicas)representsthenumberofthatmustbereceivedbeforecommittingawrite.RSM-dcandescribethethreerepresentativeoneswhered=1,[n/2]+1,nrespectively.Second,wemeasurethewriteinconsistencyofRSM-dbasedonaprobabilisticmodel.Thewriteincon-sistencyisquantifiedbytheprobabilityofwriteconflictwhichrepresentstheprobabilityofanywritecommittedinanabnormalorder.Ourprobabilisticmodelshowsthat(1)consistencyincreaseswhendrisesifd∈[1,[n/2]+1]andstrongconsistencyisalwaysguaranteedifd∈[[n/2]+1,n].Third,weevaluatethelatencyofRSM-dthroughitsexpectationwhichshowsthatthelatencystrictlymono-tonicallyincreaseswithrespecttod.Fourth,combiningthequantitativeresultsofwriteinconsistencyandlatencyofRSM-d,weprovideasolutionfortradeoffbetweenconsistencyandlatencytoachievethemaximalsystembenefit.Finally,throughMonteCarlobasedeventdrivensimulations,wevalidateoursolutionfortradeoffbetweenconsistencyandlatencyfromtheaspectofoverallsystembenefit.WemakethecontributionsasWepresentagenericsystemmodelRSM-dforRSMstogiveanunifieddescriptionofmajorreplicationprotocolsusingRSMs;OnthebasisofRSM-d,webuildaprobabilisticmodelforwriteconflictthatisoneofthekeyfactorstocauseinconsistency,andaprobabilisticmodelforcharacterizinglatencyaswell.Wefindoutthatbothconsistencyandlatencyareaffectedbythecommonfactord.Wefurtherdefineapayoffmodeltomaketradeoffbetweenconsistencyandlatencyforachievingthemaximumsystembenefitthroughcombiningthewriteconflictmodelandlatencymodelfromtheperspectiveofaserviceprovider;WithMonteCarlobasedeventdrivensimulations,wevalidateourquantitativeresultsandshowtheeffectivenessofourpresentedsolutionsintermsofhowtoobtainanoptimaltradeoffbetweenconsistencyandlatency.Theremainderofthispaperisstructuredasfollows.SectionIIintroducesthebackground.SectionIIIpresentsoursystemmodel.SectionIVdescribeshowtoquantifytheprobabilityofwriteconflict.Thecalculationoflatency

presentedinSectionV.SectionVIshowsthetradeoffofconsistencyandlatency.SectionVIIprovidesourexperimentalresults.RelatedworkanddiscussionarepresentedinSectionInthissection,wepresentthebackgroundofRSMsarewidelyusedindistributedsystemswiththetargettotoleratefailures.Failuremodesfallintonon-Byzantine(fail-stop,crashandmessageloss)andByzantine(signedmessagesornot).Withthecomplexitythatatleast3f+1nodesfortoleratingfByzantinefailuresandthelowoccurrenceprobabilityofByzantinefailures,commonRSMsonlyconsidernon-Byzantineones.Andmessagelossfailurescanbeeasilyeliminatedbynetworkprotocols,sopractitionersusuallysaythatRSMscantoleratecrashfailures.ChandraandToueg[30]haveprovedthatthetotalorderbroadcastandthedistributedconsensusproblemsaresolvableandequivalenttoeachotherunderWfailuredetectorsandatmost[n/2]simultaneouscrashfailures.Inthiswork,wealsofollowtheseassumptions.ItmeansthatwediscussRSMswhichtolerateuptofcrashfailureswith2f+1replicasandcaneventuallydetectthecrashofnodes.However,practicalfailuredetectorsmustoutputaresult(crashornot)inaboundedtimeotherthanalong-waitingforeventualconclusions.[31]classifiesthequalityofserviceofpracticalfailuredetectorsintotwotypes:speed,i.e.,howfastafailuredetectordetectsacrash;andaccuracy,i.e.,howwellitavoidsfalsesuspicion.ThereforepracticalfailuredetectorsofRSMshaveaprobabilitytofalselysuspectonanormalAssurveyedin[23-29],RSMsaretypicallyimplementedbynon-uniformtotalorderbroadcast,distributedconsensusanduniformtotalorderbroadcast.AsshowninFigure1,thethreerepresentativeRSMimplementationsworkasfollows:Innon-uniformtotalorderbroadcast,onceacoordinatorreceivesawrite,itassignsthewriteauniquesequencenumberandthendirectlycommitsittoallreplicas.Whenthewritearrivesatareplica,thereplicawillrecordthewrite,andcommititaccordingtoitssequencenumberandthenrespondtothecoordinator.Oncethecoordinatorhasreceivedthefirstresponse,itwillreplytotheclient.Notethatanytimeacoordinatorissuspectedascrashedbyfailuredetectors,thenewcoordinatorwillbeelected[32]andrecoversfailuresbasedonthehistoricalcommitrecords.FordistributedconsensusbasedRSMs,onceacoordi-natorreceivesawrite,atfirstitpropagatesthewritewiththegivensequencenumbertoallreplicas.Afterloggingthewrite,everyreplicawillsendanacknowledgement(ACK)messagetothecoordinator.Untilatleast[n/2]+1ACKs(includingthecoordinatoritself)arereceivedbythecoordinator,itwillcommitthewrite.Whenthecoordinatorissuspectedascrashedbyfailuredetectors,thenewcoordinatorwillbeelectedandrecoverfailuresbasedonthehistoricallogrecords(notthehistoricalcommitrecords).Uniformtotalorderbroadcastisverysimilartodistribut-edconsensus,butthecoordinatormustwaitforallnACKstobereturned.Fig.1.ThreeRepresentativeRSMAssumingwithoutlossofgeneralitythecoordinatorsendsitselfanACKmessage,wecandiscoverthatanobviousdif-ferenceamongthethreerepresentativeRSMimplementationsisthenumberofACKs(denotedbyd)thatthecoordinatorwaitsforbeforecommittingawrite.Andthevaluesofdfornon-uniformtotalorderbroadcast,distributedconsensusanduniformtotalorderbroadcastare1,[n/2]+1andn,respectively.Thuswecaneasilyconcludethatthelatenciesofthemfromlowtohigharenon-uniformtotalorderbroadcast,distributedconsensusanduniformtotalorderbroadcast.Intermsofconsistency,distributedconsensus(Paxos)hasshoweditssafety(strongconsistency)in[33].Thatis,RSMsarealwaysconsistentif[n/2]+1≤d≤n.However,fornon-uniformtotalorderbroadcast,ifacommittingwriteislostinhistoricalcommitrecordsduetocrashfailures(i.e.,allcommittednodesincludingthecoordinatorcrash)orfailuredetectiononthecoordinatorhappens,thesequencenumberheldbythewriteorusedbythefalselysuspectedcoordinatorwillbereusedtokeepsystemliveness.Thesewillleadtowriteconflictandthenresultinwriteinconsistency.RSM-d:AGENERICRSMInthissection,wewillpresentoursystemmodelRSM-d,whichisthe ionofRSMimplementations.SinceourdesiredsystemmodelshouldcoverthethreerepresentativeRSMs,wehavesurveyedthemandtrytoderivethesystemAlthoughwepresent(non-uniformanduniform)totalorderbroadcastalgorithmsinsectionII,theyareonlyoneoffiveclasses.[23]providesacomprehensivesurveyfortotalorderbroadcast,consideringallfiveclassesoforderingmechanismsandbothnon-uniformanduniformalgorithms.Thefirstthreeorderingmechanisms:fixedsequencer(showninsectionII),movingsequencerandprivilege-basedarebuiltbasedonafixedsequenceroramovingtoken;Theothertwo:communicationhistoryanddestinationagreementareimplementedbasedonatotalorderlogicalclock[16].AndthesolutionofdistributedconsensusforRSMisthesamewiththedestinationagreement,wherePaxosisusedtomakeagreementontheorderforawrite,andmulti-paxos[25]forallwrites.Essentiallyweadda’sequencer’whichcancomparetheordersofanytwowritestocausalorder,andthengaintotalorder.Basedontheanalysisabove,weareconcernedwith

Fig.2.RSM-dtotalorderbroadcastwiththefixedsequencerandthePaxosalgorithm.Thereasonsare:(1)thefixedsequenceralgorithmhasthelowestlatencyamongalltotalorderbroadcastones(onpoint-to-pointnetworks)[24];(2)Paxosisprovedtobeoptimalingeneralconsensusalgorithms[34];(3)boththetokeninthemovingsequencerandprivilege-basedalgorithms,andthelogicalclockinthecommunicationhistoryanddesti-nationagreementalgorithms,canbeseenas”anotherfixedThenwecombinetotalorderbroadcastwiththefixedsequencerandthePaxosalgorithmtoanunifiedsystemmodelRSM-d,whichisfeasiblesincetotalorderbroadcastanddistributedconsensusareessentiallyequivalenttoeachother[30].AsdepictedinFigure2,RSM-dincludestheagreementphaseandthecommitphase.Intheagreementphase,anywritemustbeb

温馨提示

  • 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
  • 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
  • 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
  • 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
  • 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
  • 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
  • 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

评论

0/150

提交评论