### **MHRD Scheme on** ### Eilan GOVAL REMAIN OF ACADIMICNI PAGRIC ## a Global Initiative of Academic Networks (GIAN) **GIAN Course** # Reliability Engineering and Fault Tolerant Computing 9<sup>th</sup> to 13<sup>th</sup> January 2017 Dept of Computer Science and Engineering, Indian Institute of Technology, Patna #### **Overview** Electronics and computing are rapidly transforming society by making people critically dependent on machines that make decisions which influence individual and societal quality of life, health and financial well-being. Consequently, it is becoming imperative that the electronic hardware underlying the ubiquitous computing, communication, sensing, actuation and control platforms be entirely error free and perform correctly as per the design intent. However, it is virtually impossible to manufacture electronic components in aggressively scaled nanometer IC technologies, and staggering (multi-billion transistors and growing) complexities, that never fail. Consequently, there is a need for computing systems, in both hardware and software, to build in a level of resilience and fault tolerance, appropriate for their application. Developing improved reliability and fault tolerance methodologies is a continuing and critical challenge for the microelectronics and computer industry. This course aims at a thorough understanding of the motivations, strengths, limitations, costs and effectiveness of the key hardware fault tolerance approaches that have emerged over the past few decades. While the first part of the course will review test basics of reliability and fault tolerance schemes, including their quantitative evaluation, to facilitate a full understanding of the advanced material to follow, the latter third will focus on discussions of a range of actual fault tolerant architectures that have been delayed in applications as diverse as the space shuttle flight control, in engine control on commercial aircraft, control of nuclear reactors, on-line electronic stock trading exchanges, and mainframe enterprise systems. The course will be presented by Professor Adit Singh, who is an IEEE Fellow and leading expert on electronics system test, reliability and fault tolerance, has worked in this field for nearly 40 years. At various times, he has served as a consultant to most of the major semiconductor companies, and holds international patents in the test field that have been licensed to industry. He has also served as Chair (2007-11) of the IEEE Test Technology Technical Council. Importantly, Dr. Singh has taught dozens of very popular short courses and tutorials worldwide, at conferences and in-house for industry, system reliability and fault tolerance. #### **Objectives** The primary objectives of the course are to: - Provide a strong fundamental understanding of reliability and fault tolerance: the various kinds of threats to electronic systems; defects, faults, and errors; stochastic modeling of failure/hazard rates and operational system lifetimes; redundancy for fault tolerance; TMR and other static (masking) redundancy approaches; active and hydrid redundancy; time redundancy against intermittent errors; checkpointing, rollback and recovery; examples of redundancy in real systems; reliability modeling of redundant systems. - In-depth discussion of some real fault tolerant architectures: the evolution of the Reliability, Availability and Serviceability (RAS) features in IBM mainframes; the Non-Stop architecture from Tandem/Compaq/HP running the NASDAQ electronic stock exchange; the 5 redundant flight control system in the NASA space shuttles; quad redundant engine controllers on GE aircraft engines, etc. #### Course organization The course shall consist of 5 modules each of approximately two hour durations as appropriate for the topic, adding up to a total of 10 hours of lectures. - Introduction: Understanding faults -the threat to computer system reliability - 2. Basic redundancy and fault tolerant approaches - 3. Fault tolerance on chip for yield enhancement and power management - Fault Tolerant Architectures I: Commercial systems 4 - 5. Fault Tolerant Architectures II: Aerospace and life critical systems Each module is described in more detail below. #### **Course Details** #### Module 1: Introduction: Understanding faults -the threat to computer system reliability Introduction and motivation, understanding the various kinds of threats to electronic systems; defects, faults, and errors; intermittent and permanent faults; single event upsets; latent defects and early life failure, component defective parts per million (DPPM) and failure-in-time (FIT) rates and statistics; stochastic modeling of failure/hazard rates and system operational lifetimes. #### Module 2: Reliability Modeling and fault tolerant approaches Redundancy: the key to fault tolerance; TMR, NMR and other static (masking) redundancy approaches; error correcting codes; active and hydrid redundancy; time redundancy against intermittent errors; checkpointing, rollback and recovery; software redundancy approaches against intermittent faults, N-version programing; design diversity against design errors; reliability modeling and estimation in redundant systems. #### Module 3: Fault tolerance on-chip for yield enhancement and power management How manufacturing defects limit IC chip area; on chip fault (defect) tolerance for yield enhancement; IC yield modeling; the Trilogy attempt at wafer scale integration; defect tolerance in memories; fault tolerant interconnections and networks on chips (NOCs); error detection and recovery in power saving better-than-worst (BTWC) case designs; the 32-bit ARM BTWC microprocessor in 65nm technology; Intel's experiments with BTWC design. #### Module 4: Fault Tolerant Architectures I: Commercial systems In-depth discussion of real commercial fault tolerant architectures: the evolution of the Reliability, Availability and Serviceability (RAS) features in IBM mainframes -from the IBM 370 to modern Z series Enterprise systems; the hardware and software evolution of the "Non-Stop" architecture running the NASDAQ electronic stock exchange from Tandem design of the 1980s to the Compaq systems in the 1990s and the most recent HP system upgrades. #### Module 5: Fault Tolerant Architectures II: Aerospace and other life critical systems The evolution of aerospace flight control computers; designing safety critical systems to stringent reliability specification, the hardware architecture and design diversity of the space shuttle computers; architecture of the quad redundant engine controllers of GE aircraft engine controllers; redundant control for nuclear reactors. (In this final module, some time will also be reserved for questions and discussion.) #### Who can attend - Faculty members from reputed academic institutions. - Research scholars and postgraduate students from reputed academic institutions #### How will I register? Step 1: One Time Registration: In order to register for any course under GIAN, candidate will have to get registered at the GIAN Portal of IIT Kharagpur using the following steps: - Create login and password at http://www.gian.iitkgp.ac.in/GREGN/index - Login and complete the Registration Form. - 3. Select course to be attended - Confirm your application and payment information. 4. - 5. - Pay Rs. 500/- (non-refundable) through online payment gateway. Download and print "pdf file" of your enrolment application form for your personal records and copy of the same to be sent to the course coordinator. #### Step 2: Institute Registration 1. Institute registration process is an **offline process**. Contact course co-ordinators #### **Registration Fees** Participants from abroad : US \$100 Industry/ Research Organizations: : Rs. 2500/Faculty members from Academic Institutions : Rs. 2500/Research Scholars/Students : Rs. 1000/- The above fee include all instructional materials, computer use for tutorials, 24 hr free Internet facility. The participants will be provided with accommodation, if available, on payment basis. #### **Faculty** Dr. Adit D. Singh received an undergraduate degree from the Indian Institute of Technology (IIT) Kanpur (1976), and the M.S. (1978) and Ph.D. (1982) from Virginia Tech, all in Electrical Engineering. Since September 2002, he has served as James B. Davis Distinguished Professor of Electrical and Computer Engineering at Auburn University, where he directs the VLSI Design and Test Laboratory. Before joining Auburn in 1991, he was Associate, and earlier Assistant, Professor of Electrical and Computer Engineering at the University of Massachusetts in Amherst, and a full time Instructor at Virginia Tech (1978-82). He has also held visiting positions during sabbaticals at major universities, most recently in 2012 serving as "Guest Professor" at the University of Freiburg, Germany. His research program has received extensive support from US National Science Foundation and private industry, and also from international agencies such as the Max Plank Society of Germany, the Fulbright Foundation, the Ministry of Science and Technology in India, and the National Science Council of Taiwan. Dr. Singh's technical interests span all aspects of VLSI technology, in particular, integrated circuit test, reliability and fault tolerance. He is particularly recognized for his pioneering contributions to statistical methods in test and adaptive testing. He has published over two hundred rese arch papers, served as a consultant to many of the largest semiconductor companies around the world, and holds international patents that have been licensed to industry. He has held leadership roles as General Chair/Co-Chair/Program Chair for dozens of international VLSI design and test conferences, including co-founding the annual India based International Conference on VLSI Design (with Professor Vishwani Agrawal) in 1990-91. Most recently he was Program Chair of the 2014 International Conference on VLSI Design, Co-Chair of the 2014-16 Workshop on Reliability Aware Design, and is the Program Chair for the 2015 Asian Test Symposium. He currently also serves on the editorial boards of IEEE Design and Test Magazine and the Journal of Testing and Test Applications (JETTA), and on the Steering and Program Committees of many of the major IEEE international test and design automation conferences. Dr. Singh is also a very popular lecturer. In addition to the dozens of talks and seminars he has presented around the world on his research, he is regularly invited by conferences and industry to conduct short courses on cutting edge technical topics in his specialty. Over the years, he has conducted almost 100 such courses, ranging from half a day to three days in length, in over a dozen different countries, and in-house for many major companies (IBM, Texas Instruments, AMD, National Semiconductor, NXP, Advantest etc.). Dr. Singh has received numerous research and teaching awards. He was elected Fellow of IEEE in 2002 for "contributions to defect based testing and test optimization in VLSI circuits". He is Golden Core member of the IEEE Computer Society. He served two elected terms (2007-11) as Chair of the IEEE Test Technology Technical Council (TTTC), and on the Board of Governors of the IEEE Council on Design Automation (CEDA) (2011-15). #### **Course Coordinators / Host Faculties** Contact Details: Dr. Jimson Mathew and Dr. Arijit Mondal Department of Computer Science and Engineering Indian Institute of Technology Patna, India, Phone: +91 612 3108347 More information visit: http://iitp.ac.in/ Email: jimson@iitp.ac.in / arijit@iitp.ac.in