1、Fault-Tolerant ComputingMotivation, Background, and ToolsSep. 20061Introduction and MotivationAbout This PresentationEditionReleasedRevisedRevisedFirstSep. 2006This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical
2、and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. Behrooz ParhamiSep. 20062Introduction and MotivationDesign, Implementation, Operation, and H
3、uman MishapsSep. 20063Introduction and MotivationSep. 20064Introduction and MotivationThe Curse of ComplexityComputer engineering is the art and science of translating user requirements we do not fully understand; into hardware and software we cannot precisely analyze; to operate in environments we
4、cannot accurately predict; all in such a way that the society at large is given no reason to suspect the extent of our ignorance.11Adapted from definition of structural engineering: Ralph Kaplan, By Design: Why There Are No Locks on the Bathroom Doors in the Hotel Louis XIV and Other Object Lessons,
5、 Fairchild Books, 2004, p. 229Microsoft Windows NT (1992): 4M lines of codeMicrosoft Windows XP (2002): 40M lines of codeIntel Pentium processor (1993): 4M transistorsIntel Pentium 4 processor (2001): 40M transistorsIntel Itanium 2 processor (2002): 500M transistorsSep. 20065Introduction and Motivat
6、ionDefining FailureFailure is an unacceptable difference between expected and observed performance.11 Definition used by the Tech. Council on Forensic Engineering of the Amer. Society of Civil EngineersA structure (building or bridge) need not collapse catastrophically to be deemed a failureReasons
7、of typical Web site failures Hardware problems:15%Software problems: 34%Operator error:51%ImplementationSpecification?Sep. 20066Introduction and MotivationDesign Flaws: “To Engineer is Human”1Complex systems almost certainly contain multiple design flawsRedundancy in the form of safety factor is rou
8、tinely used in buildings and bridges1 Title of book by Henry PetroskiOne catastrophic bridge collapse every 30 years or soSee the following amazing video clip (Tacoma Narrows Bridge):http:/www.enm.bris.ac.uk/research/nonlinear/tacoma/tacnarr.mpg Example of a more subtle flaw: Disney Concert Hall in
9、Los Angeles reflected light into nearby building, causing discomfort for tenants due to blinding light and high temperatureSep. 20067Introduction and MotivationDesign Flaws in Computer SystemsHardware example: Intel Pentium processor, 1994For certain operands, the FDIV instruction yielded a wrong qu
10、otientAmply documented and reasons well-known (overzealous optimization)Software example: Patriot missile guidance, 1991Missed intercepting a scud missile in 1st Gulf War, causing 28 deathsClock reading multiplied by 24-bit representation of 1/10 s (unit of time)caused an error of about 0.0001%; nor
11、mally, this would cancel out in relative time calculations, but owing to ad hoc updates to some (not all) calls to a routine, calculated time was off by 0.34 s (over 100 hours), during which time a scud missile travels more than kmUser interface example: Therac 25 machine, mid 1980s1Serious burns an
12、d some deaths due to overdose in radiation therapyOperator entered “x” (for x-ray), realized error, corrected by entering “e” (for low-power electron beam) before activating the machine; activation was so quick that software had not yet processed the override1 Accounts of the reasons varySep. 20068I
13、ntroduction and MotivationLearning Curve: “Normal Accidents”1Example: Risk of piloting a plane1903First powered flight1908First fatal accident1910Fatalities = 32 (2000 pilots worldwide)TodayCommercial airline pilots pay normal life insurance rates1 Title of book by Charles Perrow (Ex. p. 125)1918US
14、Air Mail Service foundedPilot life expectancy = 4 years31 of the first 40 pilots died in service1922One forced landing for every 20 hours of flightSep. 20069Introduction and MotivationMishaps, Accidents, and CatastrophesMishap: misfortune; unfortunate accidentForum on Risks to the Public in Computer
15、s and Related Systemshttp:/catless.ncl.ac.uk/risks (Peter G. Neumann, moderator)At one time (following the initial years of highly unreliable hardware), computer mishaps were predominantly the results of human error Accident: unexpected (no-fault) happening causing loss or injuryNow, most mishaps ar
16、e due to complexity (unanticipated interactions)Catastrophe: final, momentous event of drastic action; utter failureSep. 200610Introduction and MotivationExample fromOn August 17, 2006, a class-two incident occurred at the Swedish atomic reactor Forsmark. A short-circuit in the electricity network c
17、aused a problem inside the reactor and it needed to be shut down immediately, using emergency backup electricity. However, in two of the four generators, which run on AC, the AC/DC converters died. The generators disconnected, leaving the reactor in an unsafe state and the operators unaware of the c
18、urrent state of the system for approximately 20 minutes. A meltdown, such as the one in Tschernobyl, could have occurred.Coincidence of problems in multiple protection levels seems to be a recurring theme in many modern-day mishaps - emergency systems had not been tested with the grid electricity be
19、ing offSep. 200611Introduction and MotivationLayers of SafeguardsWith multiple layers of safeguards, a system failure occurs only if warning symptoms and compensating actions are missed at each layer, which is quite unlikelyIs it really?The computer engineering literature is full of examples of mish
20、aps when two or more layers of protection failed at the same timeMultiple layers increase the reliability significantly only if the “holes” in the representation above are fairly randomly distributed, so that the probability of their being aligned is negligibleDec. 1986: ARPANET had 7 dedicated line
21、s between NY and Boston;A backhoe accidentally cut all 7 (they went through the same conduit)Sep. 200612Introduction and MotivationA Problem to Think AboutIn a passenger plane, the failure rate of the cabin pressurizing system is 105/ hr (loss of cabin pressure occurs once per 105 hours of flight)As
22、suming failure independence, both systems fail at a rate of 1010/ hr Alternate reasoningProbability of cabin pressure system failure in 10-hour flight is 104 Probability of oxygen masks failing to deploy in 10-hour flight is 104 Probability of both systems failing in 10-hour flight is 108 Why is thi
23、s result different from that of our earlier analysis (109)?Which one is correct?Failure rate of the oxygen-mask deployment system is also 105/ hrFatality probability for a 10-hour flight is about 1010 10 = 109 (109 or less is generally deemed acceptable) Probability of death in a car accident is 1/6
24、000 per year (107/ hr)Sep. 200613Introduction and MotivationCabin Pressure and Oxygen MasksWhen we multiply the two per-hour failure rates and then take the flight duration into account, we are assuming that only the failure of the two systems within the same hour is catastrophicThis produces an opt
25、imistic reliability estimate (1 109)012345678910MasksfailPressure is lost012345678910MasksfailPressure is lostWhen we multiply the two flight-long failure rates, we are assuming that the failure of these systems would be catastrophic at any timeThis produces a pessimistic reliability estimate (1 108
26、)Sep. 200614Introduction and MotivationCauses of Human Errors in Computer Systems1. Personal factors (35%): Lack of skill, lack of interest or motivation, fatigue, poor memory, age or disability2. System design (20%): Insufficient time for reaction, tedium, lack of incentive for accuracy, inconsiste
27、nt requirements or formats3. Written instructions (10%): Hard to understand, incomplete or inaccurate, not up to date, poorly organized4. Training (10%): Insufficient, not customized to needs, not up to date5. Human-computer interface (10%): Poor display quality, fonts used, need to remember long co
28、des, ergonomic factors6. Accuracy requirements (10%): Too much expected of operator7. Environment (5%): Lighting, temperature, humidity, noiseBecause “the interface is the system” (according to a popular saying), items 2, 5, and 6 (40%) could be categorized under user interfaceSep. 200615Introductio
29、n and MotivationProperties of a Good User Interface1. Simplicity: Easy to use, clean and unencumbered look2. Design for error: Makes errors easy to prevent, detect, and reverse; asks for confirmation of critical actions3. Visibility of system state: Lets user know what is happening inside the system
30、 from looking at the interface4. Use of familiar language: Uses terms that are known to the user (there may be different classes of users, each with its own vocabulary)5. Minimal reliance on human memory: Shows critical info on screen; uses selection from a set of options whenever possible6. Frequen
31、t feedback: Messages indicate consequences of actions7. Good error messages: Descriptive, rather than cryptic8. Consistency: Similar/different actions produce similar/different results and are encoded with similar/different colors and shapesSep. 200616Introduction and MotivationOperational Errors in
32、 Computer SystemsHardware examplesPermanent incapacitation due to shock, overheating, voltage spikeIntermittent failure due to overload, timing irregularities, crosstalkTransient signal deviation due to alpha particles, external interferenceSoftware examplesCounter or buffer overflowOut-of-range, un
33、reasonable, or unanticipated inputUnsatisfied loop termination conditionDec. 2004: “Comair runs a 15-year old scheduling software package from SBS International (). The software has a hard limit of 32,000 schedule changes per month. With all of the bad weather last week, Comair apparently hit this l
34、imit and then was unable to assign pilots to planes.” It appears that they were using a 16-bit integer format to hold the count.June 1996: Explosion of the Ariane 5 rocket 37 s into its maiden flight was due to a silly software error. For an excellent exposition of the cause, see:p.lancs.ac.uk/compu
35、ting/users/dixa/teaching/CSC221/ariane.pdf) These can also be classified as design errorsSep. 200617Introduction and MotivationAbout the Name of This CourseFault-tolerant computing: a discipline that began in the late 1960s 1st Fault-Tolerant Computing Symposium (FTCS) was held in 1971In the early 1980s, the name “dependable computing” was proposed for the field, to account for the fact that tolerating faults is but one appr
- 1. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。图纸软件为CAD,CAXA,PROE,UG,SolidWorks等.压缩文件请下载最新的WinRAR软件解压。
- 2. 本站的文档不包含任何第三方提供的附件图纸等,如果需要附件,请联系上传者。文件的所有权益归上传用户所有。
- 3. 本站RAR压缩包中若带图纸,网页内容里面会有图纸预览,若没有图纸预览就没有图纸。
- 4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
- 5. 人人文库网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对用户上传分享的文档内容本身不做任何修改或编辑,并不能对任何下载内容负责。
- 6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
- 7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
- 学习动机与自我激励策略计划
- 社区图书室的建设思路计划
- 三年级上册数学教案-第三单元第2节 东南、西南、东北、西北 西师大版
- 四年级下册数学教案-2.4 问题解决(二) ︳西师大版
- 2025年工控装备:温度控制调节器项目建议书
- 2025年售电公司与客户签订售电合同模板
- 品牌授权与特许经营的机会计划
- 四年级下册数学教案-总复习 复习认识方程|北师大版
- 五年级上册数学教案-1.1 小数乘整数 ︳西师大版
- 单招学前教育讲解
- 学校桌椅采购投标方案(技术方案)
- 中国古代教育1
- 内部控制及内部审计
- 第二章《声现象》超声波测速专题训练(含答案) 2023-2024学年人教版八年级物理上册
- 读后续写+社会温情类(extra+photos)讲义 高三英语一轮复习
- 三年级数学下册课件-制作活动日历-人教版-(共45张PPT)
- 石油化工建设工程竣工报告
- 诗歌鉴赏之思乡怀人诗课件
- 高考语文一轮复习:下定义(含答案)
- 成人高考辅导资料教学课件
- 2022春大学英语A2学习通课后章节答案期末考试题库2023年