Michael R. Lyu (AT&T Bell Laboratories)

Chapter 1: Introduction

Reliable software is crucial for a wide range of applications, including communication systems, computer networks, and financial systems. In this chapter, we will discuss the need for reliable software and explore some of the challenges associated with creating reliable software. We will also look at how AT&T Bell Laboratories has addressed these challenges through their innovative approach to software development.

1.1 The Need for Reliable Software

Reliable software is essential for many applications that rely on time-critical operations or require high availability. For example, in communication systems, reliable software is critical for ensuring that data packets are delivered reliably and accurately. In computer networks, reliable software is necessary for maintaining secure connections between devices. In financial systems, reliable software is critical for handling large volumes of data efficiently and securely.

The challenge of creating reliable software lies in identifying and addressing potential failures before they occur. This often requires extensive testing and analysis to ensure that the software can handle unexpected events and recover from errors gracefully. Additionally, reliable software must be designed with fault tolerance in mind, ensuring that it can continue to function under extreme conditions without significant downtime or loss of data.

To meet these requirements, AT&T Bell Laboratories has adopted a rigorous approach to software development that includes multiple layers of testing and validation. This includes unit testing, integration testing, system testing, and acceptance testing to ensure that each component of the software works as expected. Additionally, AT&T Bell Laboratories uses techniques such as modeling and simulation to identify potential issues early in the development process, allowing developers to make informed decisions about design choices and test coverage.

Overall, creating reliable software is no small feat. It requires a deep understanding of the underlying technologies and systems and the ability to anticipate and mitigate potential failure scenarios. AT&T Bell Laboratories’ approach to software development demonstrates the importance of investing in reliable software and the benefits of adopting a proactive, test-driven development culture.

自从第一台电子数字计算机被发明以来,已经过去了将近五十年(Burk46a),人类的生活已经离不开了电脑。计算机革命创造了世界前所未有的最快技术发展。今天,计算机硬件和软件渗透到了我们的现代社会中。最新的相机、录像机、汽车等无法控制和操作,如果没有电脑的参与是无法做到的。

电脑已被嵌入到手表、电话、家用电器、建筑和飞机等各种物品中。科学技术一直要求高性能的硬件和高质量的软件以实现改进和突破。我们可以查看几乎所有行业——汽车、航空、石油、电信、银行、半导体、制药业——所有这些行业高度依赖或者至少在一定程度上依赖于电脑的功能能力。

Over the past decade, computer systems have evolved to a more advanced level of complexity and sophistication. This trend is expected to continue in the future, with examples of highly complex hardware/software systems found in projects undertaken by various governmental agencies and private industries.

NASA, the Department of Defense, the Federal Aviation Administration, and the telecommunications industry are just a few of the organizations involved in such projects. For instance, NASA's Space Shuttle flies with about 500,000 lines of software code on board, while the International Space Station Alpha is estimated to have millions of lines of software for its navigation, communication, and experimentation. In the telecommunications industry, phone carriers rely on hundreds of software systems for operations that involve hundreds of millions of lines of source code. The avionics industry, meanwhile, has almost all new payload instruments equipped with their own microprocessor systems and extensive embedded software.

The Federal Aviation Administration's Advanced Automation System, which is the new generation air traffic control system, also showcases the extensive use of hardware and complicated software. Our personal computers, for example, cannot function without operating systems like Windows that range from 1 to 5 million lines of code. Many other shrink-wrapped software packages of similar size provide us with daily use for these computers in various applications.

The demand for complex hardware/software systems has increased more rapidly than the ability to design, implement, test, and maintain them. When the requirements for and dependencies on computers increase, the possibility of crises from computer failures also increases. The impact of these failures ranges from inconvenience (e.g., malfunctions of home appliances), economic damage (e.g., interruptions of banking systems), to loss of life (e.g., failures of flight systems or medical software). Needless to say, the reliability of computer systems has become a major concern for our society.

The computer revolution has been marked by an unbalanced achievement: software continues to bear a larger burden while progress remains slow. This is because of the integrative potential of software that allows designers to contemplate more ambitious systems with a broader and multidisciplinary scope. The growth in utilization of software components is largely responsible for the high overall complexity of many system designs. However, when compared to the rapid advancement of hardware technology, proper development of software technology has failed to keep up on all measures, including quality, productivity, cost, and performance.

When we entered the last decade of the 20th century, computer software had already become the major source of reported outages in many systems [Gray90a]. Consequently, recent literature is replete with horror stories of projects gone awry, generally resulting from problems traced to software.

Software failures have become a significant issue in several major projects. In the NASA Voyager project, delays due to late software deliveries and limited capability in the Deep Space Network threatened the Uranus encounter. Similarly, several Space Shuttle missions were delayed due to hardware/software interaction problems, affecting both the operational capabilities and schedules of these critical spacecraft.

In one DoD project, software issues caused delays in the first flight of the AFTI/F-16 jet fighter. The delay lasted over a year, with none of the advanced modes originally planned being used. Critical software failures have also affected numerous civil and scientific applications. For example, the ozone hole over Antarctica might have received earlier attention had it not been for a data analysis program that suppressed anomalous data, labeling it as "out of range."

Moreover, software glitches in an automated baggage-handling system at Denver International Airport forced it to sit empty for more than a year after flights were supposed to fill its gates and runways. These incidents underscore the importance of maintaining reliable software systems, especially in critical infrastructure and transportation systems, where even minor glitches can have significant implications for safety, efficiency, and budgets.

Unfortunately, software can indeed also kill people. The massive Therac-25 radiation therapy machine had a perfect safety record until software errors in its sophisticated control systems malfunctioned and claimed several patients’ lives in 1985 and 1986 [Lee92a]. On October 26, 1992, the Computer Aided Dispatch system of the London Ambulance Service broke down right after its installation, paralyzing the capability of the world's largest ambulance service to handle 5000 daily requests in carrying patients in emergency situations [SWTR93a]. In the recent aviation industry, although the real causes for several airliner crashes in the past few years remained mysteries, experts pointed out that software control could be the chief suspect in some of these incidences due to its inappropriate response to pilots’ desperate inquiries during abnormal flight conditions.

软件故障同样在商业领域引发严重的负面影响。1990年1月15日,一个切换系统的软件漏洞导致一家主要长途电话运营商的网络遭受重大干扰,随后一系列地方电话服务中断事件也被追溯到同年夏季的一连串软件问题[Lee92a]。这些关键事件导致依赖电信公司业务运营的公司遭受了巨大的收入损失。

Many software systems and packages are distributed and installed in identical or similar copies, all of which are vulnerable to the same software failures. This is why even the most powerful software companies like Microsoft are fearful for "killer bugs" which can easily wipe out all the profits of a glorious product if a call-back is required on the tens of millions of copies they have sold for the product. To this end, many software companies see a major share of project development costs identified with the design, implementation, and assurance of reliable software, and they recognize a tremendous need for systematic approaches using software reliability engineering techniques. Clearly, developing the required techniques for software reliability engineering is a major challenge to computer engineers, software engineers, and engineers of various disciplines for now and the decades to come.

软件可靠性工程概念是指用于确保软件系统在各种条件下稳定、有效运行的一系列原则和实践。这些原则和实践旨在提高软件的可用性、可维护性和可测试性,从而增强整个软件系统的可靠性。以下是一些关键的软件可靠性工程概念:

1. 容错性(Fault Tolerance):容错性是指软件系统能够识别错误,并采取适当的措施来恢复或修正这些错误,以保持系统的功能。这包括错误检测、错误报告、错误校正和错误处理等技术。

2. 可恢复性(Recoverability):可恢复性是指软件系统在发生故障后能够迅速恢复到正常工作状态的能力。这通常涉及备份数据、重做操作和恢复服务等措施。

3. 冗余性(Redundancy):冗余性是指软件系统通过增加额外的组件或资源来提高其可靠性。例如,冗余存储、冗余网络连接和冗余电源等措施可以防止单个组件的故障影响整个系统的稳定性。

4. 鲁棒性(Robustness):鲁棒性是指软件系统能够在面对异常条件时保持性能和可靠性的能力。这可以通过设计健壮的算法、选择稳健的数据结构和实现有效的异常处理方法来实现。

5. 可移植性(Portability):可移植性是指软件在不同硬件和操作系统平台上的适应性。这涉及到软件的设计、编码和测试过程,以确保在不同的环境中都能够正常工作。

6. 可维护性(Maintainability):可维护性是指软件系统的可修改性和可理解性。通过遵循规范、编写清晰的代码注释和使用版本控制系统等方法,可以提高软件的可维护性。

7. 安全性(Security):安全性是指保护软件免受未经授权访问、篡改和攻击的能力。通过实施加密技术、身份验证和审计日志等措施,可以提高软件的安全性。

8. 可扩展性(Scalability):可扩展性是指软件系统能够适应不断增长的需求和负载的能力。通过设计模块化的架构、优化资源分配和实现负载均衡等方法,可以提高软件的可扩展性。

9. 可定制性(Customizability):可定制性是指软件能够根据用户的需求进行配置和调整的能力。这可以通过提供灵活的配置选项、支持自定义行为和实现配置驱动的开发等方法来实现。

10. 标准化(Standardization):标准化是指使用统一的标准和协议来设计和实现软件系统。通过遵循相关的工业标准和最佳实践,可以提高软件的互操作性和兼容性。

Software reliability engineering is a fundamental aspect of software quality, centering around an important attribute: reliability. Reliability is defined as the probability of failure-free software operation during a specified period of time within a specific environment, according to ANSI91a standards. This attribute represents one of many dimensions of software quality, including functionality, usability, performance, serviceability, and other factors that impact customer satisfaction.

Despite its importance, it is widely recognized that reliability is crucial in determining software quality. Reliability measures the likelihood of software malfunctions, which are unwelcome events leading to software being useless or even harmful to the entire system. Furthermore, software failures can be fatal, affecting human lives. As such, reliability is seen as a fundamental determinant of customer satisfaction.

In line with this, ISO 9000-3 specifies field failures as the basic requirement for quality metrics, stating that at a minimum, some metrics should represent reported field failures and/or defects from the customer's perspective. The supplier of software products is expected to collect and act on quantitative measures of the product's quality. This emphasis on reliability underscores the importance of ensuring that software meets the needs of customers and remains functional throughout its lifecycle. It reflects a broader recognition of software quality as encompassing not only functionality but also reliability, which is often considered a core metric in assessing overall software quality.

Example 11 illustrates the impact of high-severity failures on customer satisfaction, as shown in Figure 1.1 - Not Shown Here. In this survey, nine large software projects were studied to identify factors contributing to customer satisfaction in telecommunications systems responsible for day-to-day operations in the U.S. local telephone business. The survey requested telephone customers to assess a quality score between 0 and 100 for each system, with an average size of these projects being 1 million lines of source code.

In the meantime, Trouble Reports (failure reports in the field) were collected from these projects. Figure 1.1 shows an overall quality score from this survey plotted against the number of high-severity Trouble Reports received.

根据提供的图1.1,我们观察到每个项目中的总体质量得分与重大严重故障数量之间存在高度负相关性(-0.86)。这个例子展示了在电信行业中,软件的严重故障数量立即反映出了对整体软件质量的负面客户感知。这种质量指标在许多其他行业中也是普遍适用的。

软件开发可靠性是第二章中详细讨论的一个系统可依赖性概念之一。例1.2展示了软件可靠性对系统可靠性的影响。

military distributed processing system, with an MTTF requirement of 100 hours and an availability requirement of 0.99, is illustrated in Figure 1.2, indicating the system comprises three subsystems: SYS1, SYS2, and SYS3, alongside a local area network (LAN) and a 10KW power generator GEN. To ensure the system functions optimally, all components, except SYS2, must be functional. In the initial stages of testing, hardware reliability parameters are estimated according to the Mil-HDBK-217 standard, as shown in Figure 1.2 for each component block. Specifically, two numbers appear above each system component block; the upper number represents the predictedMTTF for that component, and the lower number represents its MTTR. Both units are measured in hours. For example, SYS1 has a predictedMTTF of 280 hours and an MTTR of 0.53 hours, whereas SYS2 and SYS3 have a predictedMTTF of 387 hours and an MTTR of 0.50 hours. Additionally, it should be noted that SYS2 has been configured as a triple module redundant system, as indicated by the dotted-line block. When two or more modules are operational, this fault-tolerant capability results in an enhanced MTTF of 5.01x10**4 hours and an improved MTTR of 0.25 hours.

预测系统可靠性时,必须考虑整个系统中的所有组件。如果假设软件不发生故障(这是系统可靠性工程师们经常犯的错误!),那么得出的结果将是系统平均无故障时间(MTTF)为125.9小时,平均修复时间(MTTR)为0.62小时,系统可用性将达到0.995。这看似系统已经满足其最初要求。

The software does fail. Despite both SYS2 and SYS3 containing a staggering 300,000 lines of source code, the predicted initial failure rates for these systems are alarmingly high. According to the prediction model detailed in Chapter 3 (Section 3.8.3) and further supported by[RADC87a], the estimated failure rate for SYS2 software and SYS3 software is both 2.52 failures per execution hour. Notably, these three instances of SYS2 software are identical copies and lack fault-tolerance capabilities.

Even without accounting for possible SYS1 software failures, the system's Mean Time To Failure (MTTF) would have plummeted to an astounding 11.9 CPU minutes! This figure represents a stark decrease from our earlier estimates, which were based solely on the MTTF values and did not take into account any potential software reinitialization time or other system-specific factors.

Furthermore, if we assume that the Mean Time To Repair (MTTR) remains at the lower estimate of 0.62 hours, while it should ideally be much higher in real-world scenarios since software reinitialization can typically take longer than anticipated, our system availability drops to a dismal 0.24. This figure suggests that the previously predicted reliability levels are now significantly overrated, indicating a substantial discrepancy between our theoretical predictions and the actual operational performance of the system.

Given that CPU time is closely aligned with calendar time in this particular distribution system, it becomes apparent that the observed low availability is not due to any inherent inefficiencies within the underlying hardware or operating environment but rather stems from the unanticipated vulnerabilities present in the software itself. This highlights the importance of rigorous testing and analysis when developing complex systems, as well as the need to continuously monitor and update software components to ensure their reliability and stability.

Note: The system presented in Example 1.2 is a real-world example and the estimated reliability parameters are actual practices following military handbooks [Lyu89a]. This example is not an extreme case. In fact, many existing large systems face the same situation: software reliability is the bottleneck of system reliability, and the maturity of software always lags behind that of hardware. Accurately modeling software reliability and predicting its trend have become critical since this effort provides critical information for decision making and reliability engineering for most projects.

Reliability engineering is a daily practiced technique across many engineering disciplines. Civil engineers use it to build bridges, and computer hardware engineers use it to design chips and computers. Similar concepts are applied in these disciplines to define Software Reliability Engineering (SRE) as the quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. This definition is based on the work by [IEEE95a].

SRE includes two key components:

(1) software reliability measurement, which includes estimation and prediction with the help of software reliability models established in the literature;

(2) the attributes and metrics of product design, development process, system architecture, software operational environment, and their implications on reliability.

.3 书概述

成熟的工程领域将已经验证的解决方案分类和组织在手册中,以便于大多数工程师能够持续处理那些复杂的但常规的设计。就像所有工程师都有一本手册一样,软件工程实践中的手册也会非常有用。长期以来,我们并没有这样一个东西为软件而写,结果项目接一个项目的失败,年复一年地重复错误。这主要是因为软件开发是一门艺术,尽管我们理解了其中很大一部分的艺术,但这还远不是一个已经习练的技能。25年前就已经被识别出的软件危机,今天依然是一个众所周知的问题[Gibb94a]。

软件工程的可靠性组件现已发展成为一种实用的工程学科。现在是时候开始将我们的知识和SRE编码化,并使其成为可用资源-这就是这本手册的主要目的。这本手册提供了关于在SRE中运用的关键方法和方法论的信息,涵盖了最先进的技术和实践方法。本书分为三个部分和17章。每一章都是由SRE专家撰写的,包括研究人员和从业者。这些章节涵盖了SRE技术的理论知识、设计、方法学、建模、评估、经验及评估等方面。

Part I of the book, composed of five chapters, establishes the technical foundations for software reliability modeling techniques. It covers system-level dependability and reliability concepts, software reliability prediction and estimation models, model evaluation and recalibration techniques, and operational profile techniques. In Part I, Chapter 1 provides an introduction to the book, outlining its framework and highlighting the main contents of each subsequent chapter. It introduces basic ideas, terminologies, and techniques in SRE (Software Reliability Engineering).

Chapter 2 offers a general overview of the system dependability concept and demonstrates that classical reliability theory can be extended to include both hardware and software perspectives. The chapter also highlights how this extension can lead to a deeper understanding and more accurate predictions regarding system reliability.

In Part II, the book dives into more specific topics such as data availability and quality requirements, which are critical for reliable software systems. Each subsequent chapter focuses on different aspects of software reliability modeling, including but not limited to error correction strategies, fault tolerance, and performance monitoring.

Part III delves into advanced techniques for reliability analysis and modeling, such as stochastic modeling and simulation. The chapter explores methods for analyzing complex software systems and estimating their reliability. This section also includes practical examples that illustrate how these techniques can be applied to real-world scenarios.

Finally, Part IV presents a collection of tools and resources for software reliability engineers. The chapter includes descriptions of various software development methodologies, test case generation algorithms, and other essential tools that can aid in the construction, testing, and maintenance of reliable software systems.

Throughout Parts I through IV of the book, authors offer a comprehensive overview of software reliability engineering principles, techniques, and best practices. By presenting a systematic and structured approach, the book aims to provide readers with a solid foundational knowledge base for implementing reliable software systems.

Chapter 3 reviews the main software reliability models that have appeared in recent literature from both historical and application perspectives. Each model is presented with its motivation, model assumptions, data requirements, model form, estimation procedure, and general comments about its usage.

In Chapter 4, a systematic framework for conducting model evaluation of several competing reliability models using advanced statistical criteria is introduced. Recalibration techniques that can greatly improve model performance are also introduced.

Chapter 5 details an essential technique for SRE: the operational profile. The operational profile shows how to increase productivity and reliability and speed development by allocating project resources to functions based on how a system will be used.

Part II is comprised of six chapters, each detailing the SRE practices and experiences from leading organizations including AT&T, Jet Propulsion Laboratory, Tandem, IBM, NASA, Northern Telecom, and several international entities. The chapters provide a comprehensive look at various SRE procedures implemented for specific requirements in various environments. The authors of each chapter describe their practical procedures, their experiences, and lessons learned. Specifically:

Chapter 6 delineates the current best practice in SRE adopted by over 70 projects within AT&T. This approach enables you to analyze, manage, and enhance the reliability of software products. It balances customer needs concerning cost, schedule, and quality while minimizing the risks associated with releasing problematic software.

These chapters offer invaluable insights into the successful strategies and approaches utilized by these organizations to maintain and improve software reliability. Whether it's through adopting best practices or implementing innovative solutions, these chapters demonstrate how SRE practices can be tailored to meet the evolving challenges presented by modern software development.

2) Chapter 7 delves into the experiences of applying software reliability models to several large-scale projects conducted at JPL. It discusses the SRE procedures, data collection efforts, modeling approaches, data analysis methods, reliability measurement results, lessons learned, and future directions. A practical scheme for improving measurement accuracy by linear combination models is also presented.

(3) Chapter 8 introduces measurement-based analysis techniques that directly measure software reliability by monitoring and recording failure occurrences in a running system under various user workloads. The experiences gained from using Tandem GUARDIAN, IBM MVS, and VAX VMS operating systems are explored.

4) 在第九章中,提出了一种缺陷分类方案,旨在从软件缺陷中提取语义信息,从而为软件开发过程提供一个衡量标准。本章解释了这一框架的构成、实施步骤及其优势,并展示了该方案在IBM多个项目中的成功应用和部署。

(5) 第十章探讨了软件可靠性趋势分析,这对于项目管理者控制开发活动进度以及确定测试程序的效率具有重要意义。报告中引用了包括切换系统和航空电子应用在内的多项研究结果。

Chapter 11 provides a comprehensive overview of the process of collecting and analyzing field data in the software reliability field. It discusses the underlying principles and case studies to illustrate the approach. The field data analysis includes projects from renowned companies like IBM, Hitachi, Northern Telecom, and Space Shuttle Flight Software. This chapter also delves into emerging techniques used to advance research in software reliability, such as software metrics, testing schemes, fault-tolerant software, fault-tree analysis, simulation, and neural networks.

After elucidating these techniques in detail, the authors in Part III of the book establish the relationships between these methods and their impact on software reliability. Additionally, they provide potential research topics and directions for future exploration in this field. Overall, this chapter is an essential resource for professionals working in software reliability and SRE research fields.

The six chapters in Part III address specific topics that have been instrumental in advancing research in the software reliability field. These include software metrics, testing schemes, fault-tolerant software, fault-tree analysis, simulation, and neural networks. Each chapter provides a detailed explanation of the techniques used and how they relate to software reliability. For example, in one chapter, authors explain how software metrics can be used to detect and prevent bugs early on during the development process.

In addition to technical details, each chapter also addresses potential research topics and their directions for further exploration. For instance, one chapter discusses the use of fault-tree analysis in complex systems to identify potential vulnerabilities. The authors suggest ways to optimize these analyses to improve system reliability and resilience.

Overall, this part of the book provides an excellent understanding of the latest techniques and approaches being used in software reliability research. It is highly recommended for anyone interested in learning about these topics and staying up-to-date with advancements in the field.

1) 本章节详细介绍了如何将软件度量技术整合进可靠性评估中,它强调了软件复杂度与可靠性之间的联系。在开发和维护过程中,对程序的功能复杂度和操作复杂度的深入分析,对于可靠软件的成功实现至关重要。

(2) 紧接着,第13章探讨了软件测试与可靠性之间的关系。本章着重讨论了测试活动对软件可靠性的影响,并采用程序结构指标以及代码覆盖数据来评估软件的可靠性,同时衡量与之相关的风险。

Chapter 14 explores the potential of using a software fault tolerance approach to improve software reliability. This chapter delves into issues surrounding the architecture, design, implementation, modeling, and failure behavior of fault tolerant systems. It also addresses the cost associated with such systems and provides insight into potential solutions to mitigate these challenges.

Chapter 15 introduces the fault trees technique for the reliability analysis of software systems. This method allows for an in-depth assessment of the impact of software failures on a system, enabling the combination of off-line and on-line tests to prevent or detect software failures. Additionally, it provides a comparative analysis between different design alternatives for fault tolerance with respect to both reliability and safety.

The simulation technique has proven to be an invaluable tool in software reliability engineering, particularly when applied to the typical process. In the chapter titled "Chapter 16," it is demonstrated how the technique can be used to overcome many simplifying assumptions that are often made in reliability modeling. This chapter highlights the power, flexibility, and potential benefits of the simulation technique. It also covers methods for representing artifacts, activities, and events of the reliability process.

Moving on to Chapter 17, this chapter elaborates on the use of neural networks technology in software reliability engineering applications. It discusses the application of this technology as a general reliability growth model for improved predictive accuracy, and its use as a classifier to identify fault-prone software modules.

In summary, this content provides a comprehensive overview of the application of both the simulation and neural network technologies in software reliability engineering, highlighting their potential for enhancing predictive accuracy and identifying fault-prone components.

In addition to the provided content, there is one more section called "Basic Definitions", which provides a basic explanation of some terms and concepts used in the context of software reliability.

The Basic Definitions section includes:

- The definition of Software Reliability

- The definition of Reliability Models

- The definition of Toolkits

- The definition of Analysis Modeling Techniques

- The definition of Statistical Techniques

- The definition of Reliability Theory

These definitions are important for understanding the material presented in the book. By providing these definitions, the authors hope to make the material easier to understand for readers who may not have prior knowledge of these subjects.

在软件可靠性的定义中,我们注意到有三个主要组成部分:故障、时间以及运行环境。现在我们将对这些术语和其他相关的SRE术语进行定义。我们首先来了解一个软件系统及其预期的服务。

软件系统是一个由软件子系统组成的交互式集合,这些子系统嵌入在一个计算环境中,该环境向软件系统提供输入并从软件接受服务(输出)。一个单独的软件子系统是由其他子系统组成的,并且可以一直细分下去,直到最小的、有意义的元素(如模块或文件)。

Service: Expected service of a software system refers to the sequence of outputs that align with the initial specification of the software implementation, which is derived from it or what the system users perceive as the correct values.

Now, let us consider a scenario in which a software system named "Program" delivers its expected service to an environment or a user named "user."

Failure: A failure occurs when the user perceives that the program ceases to deliver the expected service. There are various levels of severity for this failure based on the impact it has on the system's services. These levels are usually categorized as: catastrophic, major, and minor, depending on their effects on the system's operation. The definition of these severity levels varies from system to system.

An outage is a special case of failure, which is defined as a period of time during which the service to a customer is lost or degraded (called "outage duration"). Generally speaking, outages can stem from hardware or software malfunctions, human error, and environmental variables including lightning, power failures, fires, and so on. A failure resulting in the total loss of system functionality is referred to as a "system outage."

In telecommunications industry, a specific measure for system outages is to set the outage duration for telephone switching systems to either be longer than 3 seconds due to failures that result in the loss of stable calls, or longer than 30 seconds for failures that do not result in the loss of stable calls. This standard is known as Bell's Definition [Bell90a].

aults are undetected problems within a software system. They occur when either the program fails to function properly or an internal error, such as an incorrect state, is detected within the program. The cause of failure or internal error is referred to as a fault. It is also known as a "bug," and in most cases it can be identified and removed. However, in some cases, it remains a hypothesis that cannot be adequately verified (e.g., timing faults in distributed systems). In summary, a software failure is an incorrect result with respect to the specification or an unexpected software behavior perceived by the user at the boundary of the software system, while a software fault is the identified or hypothesized cause of the software failure.

Defects. When the distinction between "fault" and "failure" is not critical, defects can be used as a generic term to refer to either a fault (cause) or a failure (effect). Chapter 9 provides a complete and practical classification of software defects from various perspectives.

Errors. The term "error" has two different meanings:

(1) A discrepancy between a computed, observed, or measured value or condition and the true, specified, or theoretically correct value or condition. Errors occur when some part of the computer software produces an undesired state. Examples include exceptional conditions raised by the activation of existing software faults, and incorrect computer status due to an unexpected external interference. This term is especially useful in fault-tolerant computing to describe an intermediate stage in between faults and failures.

human action that results in the inclusion of a fault in software. Examples include omission or misinterpretation of user requirements in a software specification, and incorrect translation or omission of a requirement in the design specification. However, this is not a preferred usage, and the term "mistake" is used instead to avoid confusion.

Time is a fundamental aspect of our world that we all experience. The concept of time is defined with respect to time, which means that it can be measured using different bases like program runs or people's experiences. In this article, we will discuss three types of time: execution time, calendar time, and clock time.

Firstly, the execution time is the CPU time that is actually spent by the computer in executing the software. This can include various factors such as the time taken for the software to load and start executing, the time taken to execute the software commands, and the time required to perform any error checking or validation.

Secondly, the calendar time is the time that people normally experience in terms of years, months, weeks, days, etc. This includes both the time spent on the task itself (e.g., writing code) and any waiting times before starting or finishing the task.

Finally, the clock time is the elapsed time from start to end of computer execution in running the software. This can include various factors such as the time taken to launch the software, the time taken to run the software commands, and the time taken for any interruptions or other events that occurred during the execution of the software.

In measuring clock time, it is important to note that periods during which the computer is shut down are not counted. This is because these periods do not directly contribute to the performance of the program being executed by the CPU. Instead, they are simply inconvenient moments that can occur during the execution of the software.

执行时间通常被视为软件可靠性测量和建模中比日历时间更合适的选择。然而,可靠性量度必须最终与日历时间联系起来,以便易于人类解释,尤其是在管理者、工程师和用户想要比较不同系统时。因此,需要在不同的时间尺度之间进行转换。这种转换技术的描述可以参见[Musa87a]。如果无法轻易获得执行时间,可以使用如时钟时间、加权时钟时间、员工工作时长或者与应用天然相关的单位,如执行的交易或测试用例等作为近似值。

The concept of failure function is fundamental in the analysis of time-based systems. Failure functions can be expressed in several ways, including the cumulative failure function, the failure intensity function, the failure rate function, and the mean time to failure function. The cumulative failure function (also known as the mean value function) represents the average cumulative failures associated with each point of time. It is essentially the total number of failures accumulated up to a specific point in time. On the other hand, the failure rate function or hazard rate function denotes the instantaneous failure rate at a given time t, taking into account that the system has not yet experienced a failure up until that time. This function provides information on how frequently the system experiences a failure in a certain period of time.

The Mean Time To Failure (MTTF) function, also known as MTBF or mean time between failures, represents the expected time that the next failure will be observed. In simple terms, it is a measure of the average time between two successive failures within a given time frame. While these three measures are closely related, they can be translated from one to another depending on the context of the study and the type of data available.

It is important to note that the above three measures of failure are closely related and can be translated with one another. Appenzeller provides the mathematics of these functions in detail, offering a comprehensive understanding of their application and interpretation in various scenarios

Mean Time to Repair and Availability (MTTR/Availability) represents the expectation of time until a system can be repaired after a failure is observed. This measure, combined with Mean Time To Failure (MTTF), allows for calculation of the availability of a system. Availability is the probability that a system remains available when needed, typically measured using the formula:

\[ \text{Availability} = 1 - (\text{MTTF} + \text{MTTR}) \]

Chapter 2, Section 2.4.4, provides a theoretical model for calculating availability, whereas Chapter 11, Section 11.8, offers practical examples of this measure.

Operational Profile. An operational profile is defined as a set of operations that the software can execute along with their probabilities of occurring. An operation is a group of runs which typically involves similar processing. A sample operational profile is illustrated in Figure 1.3.

It is important to note that without loss of generality, the operations can be located on the x-axis in order of the probabilities of their occurrence. This allows for the calculation of average running time and other metrics related to the system's performance.

In this chapter, we discuss the operational profile, which is a critical component of software system reliability and performance analysis. An operational profile provides a detailed description of how different possible software operations occur, their development and probabilities, and their impact on system stability and functionality.

The structure of an operational profile is complex, involving various stages and processes, such as input state or system state partitioning, operation identification and classification, probabilistic calculation, and reliability estimate generation. These components are interrelated and interact with each other to determine the overall system behavior under different operational conditions.

One approach to determining the number of possible software operations is through grouping or partitioning. This involves breaking down the input and/or system states into domains, where each domain represents a subset of states that are more likely to result in specific software actions. By focusing on these domains, we can better understand the dynamics of the system and identify areas where potential issues may arise.

When an operational profile is not available or cannot be obtained through direct testing, we can use code coverage data generated during reliability growth testing to estimate system reliability. This data captures the execution paths of the program code, providing insights into the likelihood and severity of software failures. By analyzing the code coverage data, we can infer the probability of occurrence for each possible operation and make decisions based on these estimates.

Overall, the operational profile is essential for understanding and managing software system reliability. It provides valuable information about the likelihood of different software operations occurring and their impact on the overall system performance. By leveraging reliable and accurate data, we can optimize software design, develop effective fault tolerance mechanisms, and enhance system reliability and performance.

The collection of failure data is crucial for assessing the reliability of software systems. Two types of failure data can be collected: Failure-Count (or Failures Per Time Period) Data and Time-between-failures Data. Failure-Count Data refers to tracking the number of failures detected per unit of time. It is important to note that typical failure-count data is shown in Table 1.1.

Time-between-failures Data, on the other hand, tracks the time interval between failure occurrences. This type of data provides insights into how frequently a system fails over a specified period, helping to identify potential causes for recurrent failures.

To effectively collect this data, it is essential to define clear metrics and protocols for failure detection and reporting. This includes identifying appropriate triggers or conditions for failure events, defining criteria for categorizing failures into distinct categories (e.g., transient, transient with warning, permanent), and developing methods for recording and storing these data efficiently.

Furthermore, ensuring the validity and accuracy of the collected data requires regular monitoring and verification processes. This could involve using automated tools to detect and flag anomalies or errors in the data collection process, as well as conducting manual reviews of recorded failures to ensure consistency and completeness.

In summary, collecting failure data through Failure-Count Data and Time-between-failures Data is crucial for assessing the reliability of software systems. By defining clear metrics, establishing protocols for data collection and storage, and implementing regular monitoring and verification processes, organizations can gain valuable insights into system performance and identify areas for improvement.

Table 1.1: Failure-Count Data _______________________________________________

Failures during the period Cumulative Time (hours)9

in the Period

Failures _____________________________________________

8 4 4 16 4 8 24 3 11 32 5 16 40 3 19 48 2 21 56 1 22 64 1 23 72 1 24

Time-Between-Failures (or Inter-Failure Times) Data. This type of data tracks the intervals between consecutive failures. Typical time-between-failures data can be seen in Table 1.2.

Table 1.2: Time-between-failures Data

| Number | Interval (hours) | Times (hours) |

|------------|------------------|--------------------|

| 1 | 0.5 | 0.5 |

| 2 | 1.2 | 1.7 |

| 3 | 2.8 | 4.5 |

| 4 | 2.7 | 7.2 |

| 5 | 2.8 | 10.0 |

| 6 | 3.0 | 13.0 |

| 7 | 1.8 | 14.8 |

| 8 | 0.9 | 15.7 |

| 9 | 1.4 | 17.1 |

| 10 | 3.5 | 20.6 |

| 11 | 3.4 | 24.0 |

| 12 | 1.2 | 25.2 |

| 13 | 0.9 | 26.1 |

| 14 | 1.7 | 27.8 |

| 15 | 1.4 | 29.2 |

| 16 | 2.7 | 31.9 |

| 17 | 3.2 | 35.1 |

| 18 | 2.5 | 37.6 |

| 19 | 2.0 | 39.6 |

| 20 | 4.5 | 44.1 |

| 21 | 3.5 | 47.6 |

| 22 | 5.2 | 52.8 |

| 23 | 7.2 | 60.0 |

| 24 | 10.7 | 70.7 |

在可靠性建模程序中,有的程序具备从失败计数数据或者时间间隔数据来估计模型参数的能力。因为统计建模技术可以适用于这两种数据类型,所以如果一个程序只接受其中一种数据,可能就需要将另一种类型的数据进行转换。

Transformations Between Data Types If the expected input is failure-count data, it may be obtained by transforming time-between-failures data to cumulative failure times and then simply counting the number of failures whose cumulative times occur within a specified time period. If the expected input is time-between-failures data, converting the failure-count data can be achieved by either randomly or uniformly allocating the failures for the specified time intervals, and then by calculating the time periods between adjacent failures. Some software reliability tools surveyed in Appendix A (e.g., "SMERFS" and "CASRE") incorporate the capability to do these data transformations.

Software reliability measurement includes two types of activities: reliability estimation and reliability prediction.

stimation. This activity determines current software reliability by applying statistical inference techniques to failure data obtained during system testing or during system operation. This is a measure of the achieved reliability from the past until the present point. The main purpose of this activity is to assess the current reliability and determine whether a reliability model is a good fit for retrospective analysis.

Prediction. This activity determines future software reliability based on available software metrics and measures. Depending on the software development stage, prediction involves different techniques:

(1) When failure data are available (e.g., software is in system testing or operation stage), the estimation techniques can be used to parameterize and verify software reliability models that can perform future reliability prediction.

当缺少故障数据(例如,软件设计或编码阶段)时,可以从软件开发过程中获得的特性和结果产品的指标来预测软件在测试或交付后的稳定性。第一个定义也称为“可靠性预测”,而第二个定义被称为“早期预测”。在文本中没有歧义的情况下,只使用“预测”一词。

软件可靠性模型的估计是当前大多数软件可靠性预测模型所属的分类。然而,有一些早期的预测模型也被提出并在文献中有描述。第3章中可以查阅现有的估计模型和一些早期的预测模型。第12章提供了一些产品复杂性度量指标,这些指标可以用来进行早期预测。

软件可靠性模型指定了故障过程与影响该过程的主要因素之间的关系的基本形式:故障引入、故障清除以及操作环境。图1.4展示的是基本的软件可靠性建模想法(图未在此显示)。

图1.4:基本的软件可靠性建模想法

在图1.4中,软件系统的失效率通常是由于发现和去除软件故障而逐渐降低的。在任何特定的时间点(例如,标记为“当前时间”),我们可以观察到软件的故障率的历史记录。软件可靠性建模通过统计证据来预测故障率曲线。这一措施的目的是双方面的:一是预测在达到特定目标所需的额外测试时间;二是预测完成测试后的软件预期可靠性。

Software reliability, like that of hardware, is a stochastic process. It involves the likelihood of software failures occurring, which is represented by probability distribution functions. However, there are significant differences between software and hardware reliability.

Unlike hardware, software does not wear out, burn out, or deteriorate over time, meaning its reliability stays constant. This is due to the fact that software can be tested and operated continuously, making it easier to detect and remove software faults before they lead to failures. This continuous testing and operation allow for reliability growth during software development and operation.

On the other hand, software may face a decline in reliability due to sudden changes in its operational usage or incorrect modifications. Additionally, software undergoes continual modifications throughout its life cycle, which complicates the calculation of reliability. Variable failure rates necessitate considering these factors when assessing software reliability and designing reliable systems.

Software reliability is a challenging problem that requires several methods to tackle, unlike hardware reliability which can be largely analyzed using the analysis of stationary processes. Unlike hardware failures that are mostly physical in nature, software faults are design-related and harder to visualize, classify, detect, and rectify. As a result, software reliability measurement is much more challenging than hardware reliability.

Traditional reliability theory relied on the analysis of stationary processes as it only considered physical faults. However, with the increasing complexity of systems and the introduction of design faults in software, this theory becomes unsuitable for addressing non-stationary phenomena such as reliability growth or decrease. This makes software reliability a challenging problem that requires an employment of several methods to attack.

One technical area related to software reliability is the development of new models that better capture non-stationary behaviors. Another area is the use of advanced testing techniques that are capable of identifying both physical and design defects. Additionally, there has been increased focus on developing algorithms that can automate the process of fault detection and rectification. Finally, research has also focused on developing tools that can help system designers understand the impact of their design choices on reliability.

chieving highly reliable software in the customer's perspective is a demanding job for all software engineers and reliability engineers. Adopting a similar notation from LaPr85a, Avi86a for system dependability, four technical methods are applicable for you to achieve reliable software systems:

(1) Fault avoidance: To prevent, by construction, fault occurrences;

(2) Fault removal: To detect, by verification and validation, the existence of faults and eliminate them;

(3) Fault tolerance: To provide, by redundancy, service complying with the specification in spite of faults having occurred or occurring;

(4) Fault/failure forecasting: To estimate, by evaluation, the presence of faults and the occurrence and consequences of failures. This has been the main focus of software reliability modeling.

在讨论这些技术领域时,请参考以下部分。您还可以参考第2章(2.2节)中的完整列表,以了解与可靠性和可依赖性相关的所有概念。

1.5.1 故障避免

软件系统需求的用户交互改进、软件开发规格过程的工程化、采用优秀的软件设计方法、执行结构化的程序设计规范以及鼓励编写清晰代码等,是防止软件中出现故障的一般方法。这些指南已经并且将继续作为防止软件故障产生的基本原则。

近期,在软件开发领域,正式方法已被尝试用于解决软件质量的问题。在这些方法中,要求规格被开发并维护使用能够用数学追踪的语言和工具。当前在这一领域的研究主要集中在语言问题上以及环境支持方面,至少包括以下目标:(1) 可执行的规格用于系统化且精确的评价;(2) 软件验证和验证机制;(3) 开发过程遵循增量细化步骤,以便逐步验证;(4) 每个工作项目,无论是规范还是测试用例,都受到数学验证,以确保其正确性和适当性。

Another important technique for fault avoidance in software development is the use of reuse techniques. The primary indicator of success in this area is the ability to create prototypes and evaluate reusable synthesis techniques. This is why object-oriented paradigms and techniques are receiving much attention nowadays - largely due to their inherent properties that promote software reuse.

Fault removal, on the other hand, involves finding and correcting errors in a software application. While this process may involve various methods such as debugging, testing, and error analysis, it is crucial to ensure that errors do not occur in the first place by implementing proper coding practices and testing protocols. In addition, developers must be vigilant about identifying new potential sources of bugs or errors that could arise from changes to the codebase or system. By continuously reviewing and updating the code base, developers can minimize the occurrence of errors and enhance the overall stability and reliability of the software application.

In summary, while software reuse is an effective technique for fault avoidance, it is equally critical to implement fault removal measures to ensure reliable and stable software applications. By combining these approaches, developers can effectively tackle software development challenges and improve the quality and effectiveness of their work.

当正式方法完全展开时,正式的设计证明可能有助于实现程序的数学证明正确性。同样,可以通过可执行规范使用故障监测断言,并通过自动生成测试案例来实现高效软件验证。然而,在发生这一切之前,从业者将主要依赖软件测试技术来消除现有缺陷。例如,微软会为每位软件开发者分配尽可能多的测试用例,并采用一种“伙伴”系统,该系统将软件开发者和其测试者绑定,以便他们每天进行工作[Cusu95a]。那么,对于可靠性工程师来说,一个关键问题是如何推导出测试质量指标(如测试覆盖率因子)以及确立它们与可靠性之间的关系。

Another practical fault removal plan that is widely practiced in industries is formal inspection, which involves a rigorous process aimed at identifying, correcting, and verifying errors. This thorough approach to problem-solving is typically carried out by groups of peers with vested interests in the quality of work products during the pretest phases of the life cycle. The effectiveness of this method has been claimed by numerous companies, demonstrating its success in enhancing reliability and productivity.

ault tolerance is the survival attribute of computing systems or software, which enables them to deliver continuous service to their users in the presence of faults. It involves all the techniques necessary to enable a system to tolerate software faults remaining in the system after its development. These software faults may or may not manifest themselves during system operations. However, when they do, the software fault tolerance techniques should provide the necessary mechanisms to the software system to prevent system failure from occurring.

在单一版本软件环境里,为了部分容忍软件设计缺陷,可以采取监控、动作原子性、决策验证和异常处理等技术。为了彻底从激活的设计缺陷中恢复过来,引入了通过设计多样性开发而来的多个版本的软件,其中功能上相当而独立开发的软件版本被应用到系统中,以此提供对软件设计缺陷的终极容忍度。主要方法包括恢复区块技巧、N个版本编程技术和N自校验编程技巧。这些方法已在航空工业、核能产业、医疗保健产业、电信产业和地面运动产业中得到广泛应用。

This handbook provides an in-depth exploration of the field of fault/failure forecasting. It encompasses the formulation of the fault/failure relationship, understanding the operational environment, establishment of reliability models, collection of failure data, application of reliability models through tools, selection of appropriate models, analysis and interpretation of results, and guidance for management decisions. The concepts and techniques outlined in [Musa87a] have provided an excellent foundation for this area. Other reference texts include [Xie91a], [Neuf93a]. Beyond these resources, the July 1992 issue of IEEE Software (which also features articles on software design), November 1993 issue of IEEE Transactions on Software Engineering, and December 1994 issue of IEEE Transactions on Reliability are dedicated to this aspect of SRE. This comprehensive treatise aims to provide a solid grounding for all interested parties involved in the realm of fault/failure forecasting.

由于现代软件系统的固有复杂性,软件可靠性工程师必须在交付可靠软件系统时应用上述方法的组合。这四个领域也是涵盖软件工程各个领域的当前技术状态的主要主题。除了关注故障/故障预测领域外,这本书还试图处理其他三个技术领域的问题。然而,这本书没有包括所有可能的软件工程技术,而是审视并强调了成熟和新兴的技术,这些技术可以定量地与软件可靠性相关联。

The book is divided into various chapters, each of which focuses on a particular topic related to software fault prediction and management. Chapters 1 to 5 lay the foundation for understanding the technical concepts that are essential in predicting faults and failures in software systems. These chapters provide a detailed explanation of the underlying principles and algorithms that help in identifying potential vulnerabilities and risks in software systems.

Chapters 6 through 11 delve deeper into practical project experiences. They offer insights into how various techniques have been employed successfully in real-world scenarios, highlighting the challenges and solutions encountered in these projects. This section is particularly valuable for developers who seek guidance on implementing fault prediction and removal strategies in their work.

Furthermore, Chapters 16 and 17 introduce two emerging technologies that are rapidly gaining popularity in the field of software reliability engineering. These topics explore innovative approaches and methodologies that can be used to enhance system resilience and reduce the likelihood of fault occurrences.

In addition to technical aspects, Chapters 9 and 12 focus on the importance of preventing faults before they occur. These chapters provide practical advice and best practices that can be used to identify and mitigate potential issues before they become critical problems.

Finally, Chapters 14 and 15 delve into the realm of fault tolerance techniques. These chapters cover the theoretical aspects as well as the practical applications of various models that help in ensuring that software systems can continue to function even in the face of unexpected failures or disruptions.

For readers looking for a more comprehensive treatment of software fault tolerance, it is recommended to refer to the book volume [Lyu95a]. This reference book provides a detailed overview of the various techniques and models used in software fault tolerance, including their advantages and disadvantages, and how they can be implemented effectively.

The handbook covers a wide range of topics, summarized in Table 1.4. This section provides guidelines for using the book, tailored to various interests, including the four technical areas we've discussed and some special topics that readers may want to delve deeper into. For example, if you are interested in software reliability modeling theory (Topic 1), you're encouraged to read Chapters 1, 2, 3, 4, 9, 10, 12, 14, and 16. Notably, Topics 1 and 2 in Table 1.5 are mutually exclusive, as are Topics 3 and 4, Topics 5 and 6. It's important to understand that the classification of the book's chapters into various topics is merely for your reading convenience and can be subjective.

## Technical Foundations Practices and Experiences

Emerging Techniques

Chapter 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17

Area 1: Fault Avoidance

- X

- X

Topic 1: Modeling Theory

- X

- X

- X

- X

Area 2: Fault Removal

- X

- X

Topic 2: Modeling Experience

- X

- X

- X

- X

Area 3: Fault Tolerance

- X

- X

Topic 3: Metrics

- X

- X

- X

- X

- X

Area 4: Fault/Failure Forecasting

- X

- X

Topic 4: Measurement

- X

- X

- X

- X

- X

Topic 5: Process Issues

- X

- X

- X

- X

- X

Topic 6: Product Issues

- X

- X

- X

- X

- X

Topic 7: Reliability Data

- X

- X

- X

- X

- X

Topic 8: Analysis Techniques

- X

- X

- X

- X

- X

Note: Area 1 - Fault Avoidance, Area 2 - Fault Removal, Area 3 - Fault Tolerance, Area 4 - Fault/Failure Forecasting, Topic 1 - Modeling Theory, Topic 2 - Modeling Experience, Topic 3 - Metrics, Topic 4 - Measurement, Topic 5 - Process Issues, Topic 6 - Product Issues, Topic 7 - Reliability Data, Topic 8 - Analysis Techniques

.6 Summary

The growing trend of software criticality and the unbearable consequences of software failures force us to plead urgently for software reliability engineering. This book codifies our knowledge in SRE and puts together a comprehensive and organized repository for our daily practice in software reliability. The structure of the book and key contents of each chapter are described. The definitions of major terms in SRE are provided, and fundamental concepts in software reliability modeling and measurement are discussed. Finally, related technical areas in software engineering and some reading guide are provided for your convenience.

Problems

(1) Some hardware faults are not physical faults and have similar nature to software faults. What are they?

(2) What are the main differences between software failures and hardware failures?

3) 软件故障和软件失效的例子

(4) 一些人认为软件可靠性建模技术与硬件可靠性建模技术相似,而其他人则持不同意见。列出它们之间的共同点和差异。

(5) 关于失败严重程度的定义,给出一些例子。一种是定性的,另一种是定量的。

(6) 故障与失败之间的映射关系是什么?是否是一对一映射(一个故障导致一个失败),一对一多、多对一、还是多对多?讨论在不同条件下的映射关系以及最理想的映射关系,为什么以及如何实现。

7) "Ultra-reliability" has been employed to denote highly reliable systems. This could be expressed, for instance, as R(10 hours) = 0.9999999. This implies that the likelihood of a system failing during a 10-hour operation is 1 in 10^-7 or 0.1 percent. Some have proposed making it a reliability requirement for software. Discuss the implications of such a reliability requirement and its practicality.

(8) What are the difficulties and issues involved in the data collection of failure-count data and time-between-failures data?

(9) Regarding the failure data collection process, consider the following situations:

(a) How do you adjust the failure times for an evolving program, i.e., a software program which changes over time through various releases?

(b) How do you handle multiple sites or versions of the software?

10) 展示如何将Table 1.2中的时间间隔数据转换为Table 1.1中的故障计数数据。我们假设数据是随机分布的,然后转换Table 1.1中的数据到时间间隔数据。请将您的结果与Table 1.2进行比较。

(11) 对于Table 1.1和Table 1.2中的数据:

(a)计算每个时间周期(对于Table 1.1)或故障间隔(对于Table 1.2)结束时的故障强度。

(b)绘制故障强度量数随时间轴的图示。

(c)尝试手动拟合曲线图。

(d)您对以下估计有何看法?即在观察到Table 1.1数据后的下一个时间期间的故障率,以及在观察到Table 1.2数据后的下一个故障发生的时间。

(e)这两种估计之间的关系是什么?验证这种关系。

12) 比较硬件和软件的MTTR(平均无故障时间)以及分析其差异。

(13) 根据示例1.2和图1.2,请参考:

(a)在图1.2中的各组件的故障率是多少?每个组件的可靠性函数是什么?

(b)在三重冗余配置中计算SYS2的MTTF时假设了什么?如果SYS2的可靠性函数是R928(t),那么在三重冗余配置中的SYS2的可靠性函数是什么?它的MTTF是如何计算的?它的MTTR又是如何计算的?

(c)如何计算整个系统的MTTF?验证当不考虑软件故障时,系统MTTF为125.9小时,当考虑软件故障时,系统MTTF为11.9分钟。

(d)如何计算系统的MTTR?验证其为0.62小时。

e) Does the triplication of SYS2 software help in improving its software MTTF? Why? If not, what techniques could be employed to improve the software MTTF?

The answer to the first question is that the triplication of SYS2 software does not necessarily help in improving its software MTTF. While increasing the number of redundant systems can theoretically provide better fault detection and recovery mechanisms, it may also increase the overall complexity and cost of the system. Therefore, a more effective approach to improving SYS2 software MTTF might involve focusing on enhancing individual components or processes within the system rather than simply increasing the number of redundant systems.

(14) (a) What is the difference between reliability estimation and reliability prediction? Draw a figure to show their differences.

Reliability estimation and prediction are different concepts in reliability analysis. Reliability estimation involves determining the probability of an event occurring over a given time period, while reliability prediction involves forecasting future values based on historical data. A comparison table can be used to summarize the differences between these two concepts as shown in the following table:

| Component | Reliability estimation | Reliability prediction |

| -------- | -------------------- | --------------------- |

| Probability of failure | Determines the likelihood of an event occurring over a period of time | Forecasts the future values of a variable based on historical data |

(15) Section 1.3.2 lists several criteria to evaluate software reliability models. Can you think of a good way to quantify each criterion?

To quantify each criterion in the evaluation of software reliability models, a scoring system can be developed. The criteria can be weighted according to their importance and then scored out of 10 points based on how well they meet the requirements of the model. For example, one criterion for evaluating software reliability models could be “accuracy” where accuracy is scored based on how well the model predicts system failures. Other criteria could include “cost-effectiveness,” “robustness,” and “ease of use.”