Decision-Theoretic Planning for Anticipating and Troubleshooting Faults

Paul O'Rorke

Problems associated with anticipating and troubleshooting faults arise during the design and manufacture of complex devices and systems, in preparing to deploy them, and in operations subsequent to deployment. Prior to deployment, it is often useful to anticipate faults in the components of a system, to determine their consequences, to assess the associated risks, and to propose and decide upon actions that reduce the risks. This collection of problems is called "Failure Modes and Effects Analysis" (FMEA). When malfunctions occur after deployment, it is important to assess the situation and to recommend actions that will help determine the causes and minimize negative impacts. This is called "Failure Detection, Isolation, and Recovery" (FDIR). Decision-theoretic concepts like likelihood, value, and expected utility are frequently useful in anticipating and troubleshooting faults. These concepts break down barriers and unify topics that have been identified as separate areas in previous work. For example, FMEA and FDIR can be seen to have much in common, although they have been pursued as independent topics in previous research and development. In FMEA, anticipating potential risks involves assessing the likelihoods of faults and the costs associated with their effects. Recommending actions to reduce the risks ought to take into account the costs of the actions and the likelihood that they will succeed. In FDIR, prior probabilities of faults are useful in determining the most likely explanations of abnormal behavior, and the main goal is to generate plans for gathering more information about the fault and for fixing or working around it. The costs of information gathering probes and tests ought to be taken into account and the value of the information they provide ought to be weighed in terms of savings in repair or recovery costs. So, even within FDIR, problems such as diagnosis and repair planning that have been viewed as separable problems can be seen to have much in common.

