Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode...

34
Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0

Transcript of Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode...

Page 1: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

Space Systems Engineering: Risk Module

Risk Module: Risk Management, Fault Trees and

Failure Mode Effects Analysis

Space Systems Engineering, version 1.0

Risk Module: Risk Management, Fault Trees and

Failure Mode Effects Analysis

Space Systems Engineering, version 1.0

Page 2: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

2Space Systems Engineering: Risk Module

Module Purpose: Risk

To understand risk, risk management, fault tree analysis and failure mode effects analysis in the context of project development

Acknowledge that risks are inevitable and recognize that through systematic management and analytic techniques they can be reduced

Review three techniques that are used to discover, assess, rank and mitigate risk - risk management, fault tree analysis and failure mode effects analysis

Page 3: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

3Space Systems Engineering: Risk Module

What are Risks and Risk Management?

Risks are potential events that have negative impacts on safety or project technical performance, cost or schedule

Risks are an inevitable fact of life – risks can be reduced but never eliminated

Risk Management comprises purposeful thought to the sources, magnitude, and mitigation of risk, and actions directed toward its balanced reduction

The same tools and perspectives that are used to discover, manage and reduce risks can be used to discover, manage and increase project opportunities - opportunity management

Page 4: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

4Space Systems Engineering: Risk Module

What is Risk Management?

Seeks or identifies risks Assesses the likelihood and impact of these risks Develops mitigation options for all identified risks Identifies the most significant risks and chooses which

mitigation options to implement Tracks progress to confirm that cumulative project risk is indeed

declining Communicates and documents the project risk status Repeats this process throughout the project life

Risk management is a continuous and iterative decision making technique designed to improve the probability of success. It is a proactive approach that:

Page 5: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

5Space Systems Engineering: Risk Module

Risk Management Considers the Entire Development and Operations Life of a Project

Risk Type

Technical Performance Risk

Cost Risk

Programmatic Risk

Schedule Risk

Liability Risk

Regulatory Risk

Operational Risk

Safety Risk

Supportability Risk

Examples

Failure to meet a spacecraft technical requirement or specification during verification

Failure to stay within a cost cap for the project

Failure to secure long-term political support

Failure to meet a critical launch window

Spacecraft deorbits prematurely causing damage over the debris footprint

Failure to secure proper approvals for launch of nuclear materials

Failure of spacecraft during mission

Hazardous material release while fueling during ground operations

Failure to resupply sufficient material to support human presence as planned

Page 6: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

6Space Systems Engineering: Risk Module

Every NASA Space Flight Project Begins with a Plan for Risk Management

This plan reflects the project’s risk management philosophy:• Priority (criticality to long-term strategic plans)• National significance• Mission lifetime (primary baseline mission)• Estimated project life cycle cost• Launch constraints• In-flight maintenance feasibility• Alternative research opportunities or re-flight opportunities

The risk management philosophy is reflected in a number of ways:• Whether single point failures are allowed• Whether the system is monitored continuously during operations• How much slack is in the development schedule• How technical resource margins (i.e., mass, power, MIPS, etc.) are

allocated throughout the development

Page 7: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

7Space Systems Engineering: Risk Module

Other Factors to Consider in Assessing Risk (but not limited to)…

Complexity of management and technical interfaces

Design and test margins

Mission criticality

Availability and allocation of resources such as mass, power, volume, data volume, data rates, and computing resources

Scheduling and manpower limitations

Ability to adjust to cost and funding profile constraints

Mission operations

Data handling, i.e., acquisition, archiving, distribution and analysis

Launch system characteristics

Available facilities

Page 8: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

8Space Systems Engineering: Risk Module

Risk Identification

Risks are identified by the development team, peer reviews, lessons from past projects and expert review

Lessons from past projects are captured via ‘trigger questions’, or questions that challenge a development strategy or design solution

The project risk status and top ten risk list are reviewed periodically - usually monthly - and at the project milestone reviews

Page 9: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

9Space Systems Engineering: Risk Module

Example Risk Trigger Questions

Have requirements been implemented such that a small change

in requirements has the potential to cause large cost,

performance or schedule system ramifications?

Do designs or requirements push the current state-of-the-art?

Has the concept for operating, maintaining, decommissioning or

disposal of the system been adequately defined to ensure the

identification of all requirements?

Has an independent cost estimate (ICE) been performed?

Is the schedule adequate to handle the level of requirements or

objectives changes that are occurring or are likely to occur?

Have the necessary facilities for environmental test been

identified and availability problems been resolved?

Page 10: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

10Space Systems Engineering: Risk Module

More Considerations for Risk Discovery

While each space project has its unique risks, a list of the underlying sources of risks would include the following:

Technical complexity - many design constraints or many dependent operational sequences having to occur in the right sequence and at the right time

Organizational complexity - many independent organizations having to perform with limited coordination

Inadequate margins or reserves Inadequate implementation plans Unrealistic schedules Total and year-by-year budgets mismatched to the actual implementation

risks Over-optimistic designs pressured by mission expectations Limited engineering analysis and understanding due to inadequate

engineering tools and models Limited understanding of the mission’s space environments Inadequately trained or inexperienced project personnel Inadequate processes or inadequate adherence to proven processes

Page 11: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

Space Systems Engineering: Risk Module

Pause and Learn Opportunity

Engage the class in identifying risks for a familiar project.• What kinds of risks are identified? • What is the basis for their search for risks?

After the class has thought for a while, the instructor could present some trigger questions which may help discover new risks and show the value of the trigger questions.

Page 12: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

12Space Systems Engineering: Risk Module

Cartoon: Dilbert Identifies Risks

© United Features Syndicate, Inc.

Page 13: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

13Space Systems Engineering: Risk Module

The Benefits of Preparing for the Unexpected

Mars Spirit Rover Flash Memory Problem

“The thing that strikes me most about all this is how critical it was to have that INIT_CRIPPLED command in the system. It’s not the kind of command that you’d ever expect to use under normal conditions on Mars. But back during the earliest days of the project Glenn realized that someday we might need the flexibility to deal with a broken flash file system, and he put INIT_CRIPPLED in the system and left it there. And when the anomaly hit, it saved the mission.”

–From “Roving Mars” by Steve Squires, Hyperion 2005

Be prepared for the low probability event with a huge consequence.

Mars Spirit Rover Flash Memory Problem

“The thing that strikes me most about all this is how critical it was to have that INIT_CRIPPLED command in the system. It’s not the kind of command that you’d ever expect to use under normal conditions on Mars. But back during the earliest days of the project Glenn realized that someday we might need the flexibility to deal with a broken flash file system, and he put INIT_CRIPPLED in the system and left it there. And when the anomaly hit, it saved the mission.”

–From “Roving Mars” by Steve Squires, Hyperion 2005

Be prepared for the low probability event with a huge consequence.

Background:On January 21, 2004 (Sol 18), Spirit abruptly ceased communicating with mission control. The next day the rover radioed a 7.8 bit/s beep, confirming that it had received a transmission from Earth but indicating that the spacecraft believed it was in a fault mode.

Page 14: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

14Space Systems Engineering: Risk Module

After Identification Risks are Assessed

Risks are assessed by characterizing the probability that a project will experience an undesired event and the consequences, impact or severity of the undesired event, were it to occur

Risks can be compared on iso-curves consisting of a likelihood measure and a consequence measure

Since the assessment of the likelihood and consequence of a risk is both subjective and has significant uncertainty the characterization of risk either qualitative (low medium or high) or semi-quantitative (risk are captured on a 5x5 matrix)

High Risk

MediumRisk

LowRisk

Severity of Consequence

Lik

eli

ho

od

(Pro

bab

ilit

y)

0.0

1.0

Page 15: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

15Space Systems Engineering: Risk Module

An Example of Some Semi-Quantitative Definitions to Enable a Project to Compare and Rank Risks

Impact of Consequences

Class Technical Schedule Cost

Class ICatastrophic

(Scale 5)

A condition that may cause death or permanently disabling injury, facility destruction on the ground, or loss of crew, major systems, or vehicle during the mission

launch window to be missed

cost overrun > 50 % of planned cost

Class IICritical

(Scale 4)

A condition that may cause severe injury or occupational illness, or major property damage to facilities, systems, equipment, or flight hardware

schedule slippage causing launch date to be missed

cost overrun 15 % to 50 % of planned cost

Class IIIModerate(Scale 3)

A condition that may cause minor injury or occupational illness, or minor property damage to facilities, systems, equipment, or flight hardware

internal schedule slip that does not impact launch date

cost overrun 2 % to 15 % of planned cost

Class IVNegligible(Scale 2)

A condition that could cause the need for minor first aid treatment but would not adversely affect personal safety or health; damage to facilities, equipment, or flight hardware more than normal wear and tear level

internal schedule slip that does not impact internal development milestones

cost overrun < 2 % of planned cost

Probability of Occurrence

Scale Measure

5Near certain to occur (80-100%).

4Highly likely to occur (60-80%).

3Likely to occur (40-60%).

2Unlikely to occur (20-40%).

1Not likely; Improbable (0-20%).

Page 16: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

16Space Systems Engineering: Risk Module

A 5x5 Risk Matrix Provides a Quick Visual Comparison of All Project Risks

High risks – mission success jeopardized - immediate action required

Medium risk – review regularly – contingent action if does not improve

Low risk – watch and review periodically

Page 17: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

17Space Systems Engineering: Risk Module

Approach M - Mitigate

W - Watch

A - Accept

R - Research

5

4

3

2

1

1

Like

lihoo

d

CONSEQUENCES

Med

High

Low

Criticality L x C Trend Decreasing (Improving)

Increasing (Worsening)

Unchanged New Since Last Period

More flight testing may be required for Soft V&V

R DFRC-02 8

Limited Flight Envelope, due to technical issues

R DFRC-04 7

Payload Capacity & Volume Trade-offs design issues

R DFRC-11 6

Avionics software behind schedule

W DFRC-01 5

Quality Control Resources insufficient

A DFRC-24 4

Cost growth for engine components

W DFRC-07 3

Sched Integration problems structure vs.. avionics

M DFRC-12 2

Landing Gear Door System Failure

R DFRC-34 1

Risk Title

Appr oach

Risk

ID

Rank & Trend

1

2

3

4

5 6

7 8

Top Risks and their Trends are Periodically Reviewed for the SOFIA Project

2 3 4 5

SOFIA Risk Matrix

Page 18: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

18Space Systems Engineering: Risk Module

Top Risks and their Trends are Periodically Reviewed for the Constellation SE&I

1

5

4

3

2

1

1 2 3 4 5

CONSEQUENCE

3

48

1, 276

5

SE&I Top Risk List

LIKELIHOOD

33003SE&I_PTI_HR

_ _ 1046 - Tailoring of Human -Rating requirements

_8

40004SE&I_SOA

_ 1195 - CxP Lifecycle cost_7

33334CSI_SIG_ 1125 - Software Development and Assurance

_6

44435SE&I_SOA

_ 1603 - (SRR) Abort Site Sea State Limits Launch Availability

N5

40403SE&I-AT&A

_ 1135 - Program Visibility for Closing the Architecture

_4

22202SE&I-PRIMO

_ 1122 - Requirements Maturation

_3

44554FP_SIG_ 1676 - Structural loads on CEV and LSAM during TLI

N2

_ 1677 - Ares I/Orion Ascent Aeroacoustic Environments

Title

N

Trend

555

COST

SCHED

PERF

ConsequenceLIKE

Owning Team

Rank

44FP_SIG1

SAFE

Top Project Risk ( TProjR )_

Top Program Risk (TPR)_

Top Directorate Risk (TDR)_

Unchanged_

Increasing (Worsening)_

Decreasing (Improving)_Legend

Page 19: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

19Space Systems Engineering: Risk Module

The Status of the Most Significant Risks and Their Mitigation Options are Reviewed Periodically Title of risk Description or Root cause Possible categorizations

• System or subsystem• Cause category (technology, programmatic, cost, schedule, etc.)• Resources affected (budget, schedule slack, technical margins, etc.)

Owner Assessment of Implementation risk or Mission risk

• Likelihood - estimate of the probability of the risk event• Consequences - estimate of the performance, cost, safety and

schedule effects Mitigation

• Description, including costs of mitigation options• Mitigation option leverage or reduction in the assessed risk• Current mitigation activities• Current trends in risk significance - likelihood and impact

Significant milestones• Opening and closing of the window of occurrence• Decision points for mitigation implementation effectiveness

Page 20: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

Space Systems Engineering: Risk Module

Part 2 of Risk Module:Fault Tree AnalysisEvent Tree Analysis

Page 21: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

21Space Systems Engineering: Risk Module

Fault Tree Analysis Supports Design Decisions and Failure Investigations

Fault Tree Analysis - FTA - uses a top-down symbolic logic model and estimates of failure probabilities of ‘initiators’ to estimate the occurrence (failure) of the pre-determined, undesirable, ‘top’ event

An initiator is a credible undesirable event that is a contributing cause to top event failure

‘Cut sets’ are groups of initiators, when taken together, cause top event failure

‘Path sets’ are groups of initiators that if none occur the top event does not fail

FTA is both a design and a diagnostic tool As a design tool FTA is used to compare alternative design

solutions and the resulting TOP event probability As a diagnostic tool FTA is used to investigate scenarios that

may have led to the TOP event failure - leading to an estimate of the most likely cut sets

Page 22: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

22Space Systems Engineering: Risk Module

Fault tree analysis is a graphical representation of the combination of faults that will result in the occurrence of some (undesired) top event. In the construction of a fault tree, successive subordinate failure events are identified and logically linked to the top event. The linked events form a tree structure connected by symbols called gates.

Fault Tree Analysis

Page 23: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

Space Systems Engineering: Risk Module

Refer to NASA Reference Publication 1358:System Engineering “Toolbox” for

Design-Oriented Engineers

Section 3.6: Fault Tree Analysis

(Handout)

Particular points:

And/Or Gates explanation

Example Fault Tree (Fig 3-20)

Page 24: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

24Space Systems Engineering: Risk Module

Event Trees

Event trees can be viewed as a special case of fault trees, where the branches are all ORs weighted by their probabilities.

Event trees are generated both in the success and failure domains.

This technique explores system responses to an initiating “challenge” and enables assessment of the probability of an unfavorable or favorable outcome. The system challenge may be a failure or fault, an undesirable event, or a normal system operating command.

In constructing the event tree, one traces each path to eventual success or failure.

This technique is typically performed in phase C but may also be performed in phase B.

See NASA Reference Publication 1358: System Engineering “Toolbox” for Design-Oriented Engineers section 3.8 for additional discussion.

Page 25: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

25Space Systems Engineering: Risk Module

Will the Stage Make it from Hangman’s Hill to Placer Gulch?

StationProbability of no

horses

1, 2, 3 0.2

4 0.1

Placer Gulch event tree example from a Safety &

Mission Assurance training course by Pat Clemons of

Sverdrup.

Page 26: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

26Space Systems Engineering: Risk Module

Fault Tree Analysis of the Placer Gulch Stage

Page 27: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

Space Systems Engineering: Risk Module

Part 3 of Risk Module:Failure Mode Effects Analysis

Page 28: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

28Space Systems Engineering: Risk Module

Failure Mode Effects Analysis

• Objective• To ensure all failure modes have been identified and evaluated

• Technique• Select a method to rank project failure modes

• Identify failure modes including all single point failure modes

• Analyze failure modes and their mission effect

• Determine those failure modes that might benefit from corrective action, e.g.,

– Alternative designs– Redundancy– Increased reliability

• Determine which, if any, corrective actions will be implemented

Page 29: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

29Space Systems Engineering: Risk Module

Failure Mode Effects Analysis

FMEA is a design tool for identifying risk in the system or mission design, with the intent of mitigating those risks with design changes. The FMEA risk mitigation:

1. Recognizes and evaluates the potential failure of a system and its effects;

2. Identifies actions which could eliminate or reduce the chance of a potential failure occurring.

FMEA is initiated in Phase B (Preliminary Design) and used to support design decisions in Phase C (Final Design).

Page 30: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

30Space Systems Engineering: Risk Module

Failure Mode and Effects Analysis

Item Potential PotentialSev

Class

Occur

Current

Detec

RPN

ResponsibilityActions Results

Failure Effects of Recommended & TargetFunction Mode Failure

Potential Causes/Mechanisms(s)

FailureControls

Action(s) Completion Date Det

Occ

RPN

ActionsTaken

Sev

What are the functions

or requirements?

What can go wrong? - No Function - Partially Degraded Function - Intermittent Function

- Unintended Function

What are the

Effects?

How bad is it?

What are the Cause(s)?

Howoften does

ithappen

?

How can this be preventedand detected?

How good is

this method

atdetecting

it?

What can be done?

- Design changes

- Process changes

- Special controls

- Changes to standards, procedures, or guides

Prevention/Detection

Who is goingto do it andwhen?

What did theydo and what

are theoutcomes

Page 31: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

31Space Systems Engineering: Risk Module

Module Summary: Risk

Risk is inevitable, so risks can be reduced but not eliminated.

Risk management is a proactive systematic approach to assessing risks, generating alternatives and reducing cumulative project risk.

Fault Tree Analysis is both a design and a diagnostic tool that estimates failure probabilities of initiators to estimate the failure of the pre-determined, undesirable, ‘top’ event.

Failure Mode Effects Analysis is a design tool for identifying risk in the system design, with the intent of mitigating those risks with design changes.

Page 32: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

Space Systems Engineering: Risk Module

Backup Slidesfor Risk Module

Page 33: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

33Space Systems Engineering: Risk Module

Uncertainties that Plague Projects

Uncertainties Offsets

Mission Objectives

Will the baseline system satisfy the needs & objectives?

Are they the best ones?

Thorough study Analyses Cost & schedule credibility

Technical Factors

Can baseline technology achieve the objectives?

Can the specified technology be attained?

Are all the requirements known?

Technology development plan Paper studies Design reviews Establish performance

margins Engineering model test and

prototyping Test & evaluation

Internal Factors

Can the plan and strategy meet the objectives?

Resources•Manpower skills•Time•Facilities

Program strategyBudget allocationsContingency planning

External Factors Will outside influences jeopardize

the project?ContingencyRobust design

Page 34: Space Systems Engineering: Risk Module Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0.

34Space Systems Engineering: Risk Module

Project Risk Categories

TypicalTechnical

Risk Sources

Typical ProgrammaticRisk Sources

Typical SupportabilityRisk Sources

TypicalCost

Risk Sources

TypicalSchedule

Risk Sources

• Physical properties

• Material properties

• Radiation properties

• Testing/Modeling

• Integration/Interface

• Software Design

• Safety

• Requirement changes

• Fault detection

• Operating environment

• Proven/Unproven technology

• System complexity

• Unique/Special Resources

• COTS performance

• Embedded training

• Material availability

• Personnel availability

• Personnel skills

• Safety

• Security

• Environmental impact

• Communication problems

• Labor strikes

• Requirement changes

• Stakeholder advocacy

• Contractor stability

• Funding continuity and profile

• Regulatory changes

• Reliability and maintainability

• Training

• Operations and support

• Manpower considerations

• Facility considerations

• Interoperability considerations

• System safety

• Technical data

• Sensitivity to technical risk

• Sensitivity to programmatic risk

• Sensitivity to supportability risk

• Sensitivity to schedule risk

• Labor rates

• Estimating error

• Sensitivity to technical risk

• Sensitivity to programmatic risk

• Sensitivity to supportability risk

• Sensitivity to cost risk

• Degree of currency

• Number of critical path items

• Estimating error