Some notes for Software Engineering -- System Failures

Some notes for Software Engineering -- Failures

by Herbert J. Bernstein © Copyright Herbert J. Bernstein, 2002

Failures

Software engineering exists as a discipline because much software fails to be delivered when expected or to perform as expected. The first step to controlling these problems is to understand them. Once the modes of failure are understood, the deficiencies in existing software can be addressed. Unfortunately, doing so often involves the creation of more software, which may also exhibit failures, both in the process of creating the software and in the performance of the software produced.

What is a software failure?
- Failure to conform to specifications
  - wrong output for given input
  - extra output for given inputs or no input
  - no output when output expected
  - untimely output
  - incorrect state
  - mismatched interfaces to external systems
  - ergonomic failures
- Characteristics of failures
  - reproducible (predictable, reliable) failure
    - given inputs always produce same wrong/missing/extra outputs
  - irreproducible (transient, sporadic) failure
  - fatal vs. recoverable
  - corrupting
    - damaging to input
    - damaging to state
    - damaging to other systems
  - critical vs. non-critical subsystem failure
How does a software failure occur?
- incorrect code
- missing code
- extra code
- contractual disagreements -- mismatches between code/data structures in one procedure/module/subroutine and another
- incorrect or mis-sized data structures
- incorrect references to external processes or data structures
- code that does not meet time constraints
- code that does not meet space constraints
- human interfaces that dont match people's training or preferences
Why does a software failure occur?
- lack of management
- lack of clear specifications
- misunderstanding of specifications
- infeasible specification
- coding and transcription errors
- equipment failures
- operating system, compiler and library failures
- changes in external systems
- changes in internal subsystems
- unintended consequences of other changes
- careless or malicious human intervention
When does a failure occur?
- failure is most likely where there is change
- bathtub-shaped curve
- infant mortality
- stable operation
- old age
Who is involved in causing a failure?
- untrained managers and unskilled specifiers
- programmers
- authorized users who intentionally or unintentionally challenge systems
- unauthorized users who intentionally or unintentionally challenge systems
- providers of unreliable support software and hardware
Consequences of failures
- reliability (probability that system will perform as specified)
  MTBF
- availability (probability that system will perform as specified in a given time period -- ratio of periods of performance to total time)
  MTTR (must be small) and MTBF (must be large)
- safety -- system causes damage to other systems
- security -- system is vulnerable to damage from other systems
Discovering Failures
- assume failures will occur
- instrument code to check for failures
- system performance must be measured against standards
- logs
- checksums
- loopback tests (reproduce inputs from output)
- test suites (compare current outputs to expected outputs)
- debuggers, debug printouts
- fault-tree analysis
- human pattern recognition
Fixing failures
- an art, not a science
- return to the specifications
- continue checking after finding the first error
- validate the complete system, not just the failing module
- expect problems resulting from the fix
- if possible, provide a way for the system to recover and continue
Avoiding Failures
- KISS
- validate inputs
  - never trust a user
  - never trust another module
  - check all inputs, not just the first wrong one
- produce diagnostic logs
- avoid manual memory management
- avoid complex i/o
- limit context
- limit the damage a failure can cause
- use tools you understand
- have another person check your work
- let your system age before the final check and handoff
Risk management
- Risk -- probability that the system will fail
- Hazard -- a combination of conditions which might cause failure
  called a threat in security
- Damage -- the consequences of a system failure
- Gambler's ruin -- unacceptable damage even from a low probability risk
- No system is perfect -- all we can do is minimize risk and risk-weighted damage
- Vulnerability
- Attacks
- Controls