Some notes for Software Engineering --
Failures
by Herbert J. Bernstein
© Copyright Herbert J. Bernstein, 2002
Failures
Software engineering exists as a discipline because much software fails
to be delivered when expected or to perform as expected. The first
step to controlling these problems is to understand them. Once the
modes of failure are understood, the deficiencies in existing
software can be addressed. Unfortunately, doing so often involves the
creation of more software, which may also exhibit failures, both in
the process of creating the software and in the performance of the
software produced.
- What is a software failure?
- Failure to conform to specifications
- wrong output for given input
- extra output for given inputs or no input
- no output when output expected
- untimely output
- incorrect state
- mismatched interfaces to external systems
- ergonomic failures
- Characteristics of failures
- reproducible (predictable, reliable) failure
- given inputs always produce same wrong/missing/extra
outputs
- irreproducible (transient, sporadic) failure
- fatal vs. recoverable
- corrupting
- damaging to input
- damaging to state
- damaging to other systems
- critical vs. non-critical subsystem failure
- How does a software failure occur?
- incorrect code
- missing code
- extra code
- contractual disagreements -- mismatches between code/data structures in one procedure/module/subroutine and another
- incorrect or mis-sized data structures
- incorrect references to external processes or data structures
- code that does not meet time constraints
- code that does not meet space constraints
- human interfaces that dont match people's training or preferences
- Why does a software failure occur?
- lack of management
- lack of clear specifications
- misunderstanding of specifications
- infeasible specification
- coding and transcription errors
- equipment failures
- operating system, compiler and library failures
- changes in external systems
- changes in internal subsystems
- unintended consequences of other changes
- careless or malicious human intervention
- When does a failure occur?
- failure is most likely where there is change
- bathtub-shaped curve
- infant mortality
- stable operation
- old age
- Who is involved in causing a failure?
- untrained managers and unskilled specifiers
- programmers
- authorized users who intentionally or unintentionally challenge
systems
- unauthorized users who intentionally or unintentionally challenge
systems
- providers of unreliable support software and hardware
- Consequences of failures
- reliability (probability that system will perform as
specified)
MTBF
- availability (probability that system will perform as specified
in a given time period -- ratio of periods of performance to total
time)
MTTR (must be small) and MTBF (must be large)
- safety -- system causes damage to other systems
- security -- system is vulnerable to damage from other systems
- Discovering Failures
- assume failures will occur
- instrument code to check for failures
- system performance must be measured against standards
- logs
- checksums
- loopback tests (reproduce inputs from output)
- test suites (compare current outputs to expected outputs)
- debuggers, debug printouts
- fault-tree analysis
- human pattern recognition
- Fixing failures
- an art, not a science
- return to the specifications
- continue checking after finding the first error
- validate the complete system, not just the failing module
- expect problems resulting from the fix
- if possible, provide a way for the system to recover and continue
- Avoiding Failures
- KISS
- validate inputs
- never trust a user
- never trust another module
- check all inputs, not just the first wrong one
- produce diagnostic logs
- avoid manual memory management
- avoid complex i/o
- limit context
- limit the damage a failure can cause
- use tools you understand
- have another person check your work
- let your system age before the final check and handoff
- Risk management
- Risk -- probability that the system will fail
- Hazard -- a combination of conditions which might cause
failure
called a threat in security
- Damage -- the consequences of a system failure
- Gambler's ruin -- unacceptable damage even from a low
probability risk
- No system is perfect -- all we can do is minimize risk
and risk-weighted damage
- Vulnerability
- Attacks
- Controls