Safety engineering is an
engineering discipline which assures that engineered systems provide acceptable
levels of safety.
It is strongly related to systems engineering, industrial engineering and the subset system
safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when
components fail.
Overview
The primary goal of safety
engineering is to manage risk, eliminating or reducing it to acceptable levels.
Risk is the combination of the probability of a failure event, and the severity
resulting from the failure. For instance, the severity of a particular failure
may result in fatalities, injuries, property damage, or nothing more than
annoyance. It may be a frequent, occasional, or rare occurrence. The
acceptability of the failure depends on the combination of the two. Probability
is often more difficult to predict than severity due to the many factors that
could lead to a failure, such as mechanical failure, environmental effects, and
operator error.
Safety engineering attempts to
reduce the frequency of failures, and ensure that when failures do occur, the
consequences are not life-threatening. For example, bridges are designed to
carry loads well in excess of the heaviest truck likely to use them. This
reduces the likelihood of being overloaded. Most bridges are designed with
redundant load paths, so that if any one structural member fails, the structure
will remain standing. This reduces the severity if the bridge is overloaded.
Ideally, safety engineering starts
during the early design of a system. Safety engineers consider what undesirable
events can occur under what conditions, and project the related accident risk.
They may then propose or require safety mitigation requirements in
specifications at the start of development or changes to existing CAD designs
or in-service products to make a system safer. This may done by full
elimination of any type of hazards or by lowering accident risk. Far too often,
rather than actually influencing the design, safety
engineers are assigned to prove that an existing, completed design is safe.
If the engineer discovers significant safety problems late in the development
process, correcting them can be very expensive. This type of error has the
potential to waste large sums of money and likely more important, human lives
and environmental damage.
The exception to this conventional
approach is the way some large government agencies approach safety engineering
from a more proactive and proven process perspective, known as "system
safety". The system safety philosophy is to be applied to complex and
critical systems, such as commercial airliners, complex weapon systems,
spacecraft, rail and transportation systems, air traffic control system and
other complex and safety-critical industrial systems. The proven system safety
methods and techniques are to prevent, eliminate and control hazards and risks
through designed influences by a collaboration of key engineering disciplines
and product teams. Software safety is a fast-growing field since modern systems
functionality are increasingly being put under control of software. The whole
concept of system safety and software safety, as a subset of systems
engineering, is to influence safety-critical systems designs by conducting
several types of hazard analyses to identify hazards, validate
hazards & verify design, assess and if needed to specify (new) design
safety features and procedures to strategically mitigate risk to acceptable
levels before the system is certified.
Additionally, failure mitigation
can go beyond design recommendations, particularly in the area of maintenance.
There is an entire realm of safety and reliability engineering known as Reliability Centered Maintenance
(RCM), which is a discipline that is a direct result of analyzing potential
failures within a system and determining maintenance actions that can mitigate
the risk of failure. This methodology is used extensively on aircraft and
involves understanding the failure modes of the serviceable replaceable
assemblies in addition to the means to detect or predict an impending failure.
Every automobile owner is familiar with this concept when they take in their
car to have the oil changed or brakes checked. Even filling up one's car with
fuel is a simple example of a failure mode (failure due to fuel exhaustion), a
means of detection (fuel gauge), and a maintenance action (filling the car's
fuel tank). (The use of a car's odometer also to gauge fuel illustrates the
concept of "redundant sensors".)
For large scale complex systems,
hundreds if not thousands of maintenance actions can result from the failure
analysis. These maintenance actions are based on conditions (e.g., gauge
reading or leaky valve), hard conditions (e.g., a component is known to fail
after 100 hrs of operation with 95% certainty), or require inspection to
determine the maintenance action (e.g., metal fatigue). The RCM concept then
analyzes each individual maintenance item for its risk contribution to safety,
mission, operational readiness, or cost to repair if a failure does occur. Then
the sum total of all the maintenance actions are bundled into maintenance
intervals so that maintenance is not occurring around the clock, but rather, at
regular intervals. This bundling process introduces further complexity, as it
might stretch some maintenance cycles, thereby increasing risk, but reduce
others, thereby potentially reducing risk, with the end result being a
comprehensive maintenance schedule, purpose built to reduce operational risk
and ensure acceptable levels of operational readiness and availability.
Analysis techniques
Analysis techniques can be split
into two categories: qualitative and quantitative methods. Both approaches
share the goal of finding causal dependencies between a hazard on system level
and failures of individual components. Qualitative approaches focus on the
question "What must go wrong, such that a system hazard may occur?",
while quantitative methods aim at providing estimations about probabilites,
rates and/or severity of consequences.
Traditionally, safety analysis
techniques rely solely on skill and expertise of the safety engineer. In the
last decade model-based approaches have become prominent. In contrast to
traditional methods, model-based techniques try to derive relationships between
causes and consequences from some sort of model of the system.
Traditional methods for safety
analysis
The two most common fault modeling
techniques are called failure mode and effects analysis
and fault tree analysis. These techniques are just
ways of finding problems and of making plans to cope with failures, as in probabilistic risk assessment. One of
the earliest complete studies using this technique on a commercial nuclear
plant was the WASH-1400
study, also known as the Reactor Safety Study or the Rasmussen Report.
Failure modes and effects
analysis
Failure Mode and Effects Analysis
(FMEA) is a bottom-up, inductive analytical method which may be
performed at either the functional or piece-part level. For functional FMEA,
failure modes are identified for each function in a system or equipment item,
usually with the help of a functional block
diagram. For piece-part FMEA, failure modes are identified for each
piece-part component (such as a valve, connector, resistor, or diode). The
effects of the failure mode are described, and assigned a probability based on
the failure
rate and failure mode ratio of the function or component. This
quantiazation is difficult for software ---a bug exists or not, and the failure
models used for hardware components do not apply. Temperature and age and
manufacturing variability affect a resistor; they do not affect software.
Failure modes with identical
effects can be combined and summarized in a Failure Mode Effects Summary. When
combined with criticality analysis, FMEA is known as Failure Mode, Effects,
and Criticality Analysis or FMECA, pronounced "fuh-MEE-kuh".
Fault tree analysis
Fault tree analysis (FTA) is a
top-down, deductive analytical method. In FTA, initiating
primary events such as component failures, human errors, and external events
are traced through Boolean logic gates to an undesired top event such as
an aircraft crash or nuclear reactor core melt. The intent is to identify ways
to make top events less probable, and verify that safety goals have been
achieved.
FTA may be qualitative or quantitative. When failure and
event probabilites are unknown, qualitative fault trees may be analyzed for
minimal cut sets. For example, if any minimal cut set contains a single base
event, then the top event may be caused by a single failure. Quantitative FTA
is used to compute top event probability, and usually requires computer
software such as CAFTA from the Electric Power Research Institute
or SAPHIRE
from the Idaho National Laboratory.
Some industries use both fault
trees and event
trees. An event tree starts from an undesired initiator (loss of critical
supply, component failure etc.) and follows possible further system events
through to a series of final consequences. As each new event is considered, a
new node on the tree is added with a split of probabilities of taking either
branch. The probabilities of a range of "top events" arising from the
initial event can then be seen.
Safety certification
Usually a failure in safety-certified systems is acceptable if, on
average, less than one life per 109 hours of continuous operation is
lost to failure. Most Western nuclear
reactors, medical equipment, and commercial aircraft are
certified to this level. The cost versus loss of lives has been considered
appropriate at this level (by FAA for aircraft systems under Federal Aviation Regulations).[1][2][3]
Preventing failure
A NASA graph shows the
relationship between the survival of a crew of astronauts and the amount of redundant equipment in their spacecraft
(the "MM", Mission Module).
Once a failure mode is identified,
it can usually be mitigated by adding extra or redundant equipment to the
system. For example, nuclear reactors contain dangerous radiation,
and nuclear reactions can cause so much heat that no substance
might contain them. Therefore reactors have emergency core cooling systems to
keep the temperature down, shielding to contain the radiation, and engineered
barriers (usually several, nested, surmounted by a containment building) to prevent accidental
leakage. Safety-critical systems are commonly required to permit no single
event or component failure to result in a catastrophic failure mode.
Most biological
organisms have a certain amount of redundancy: multiple organs, multiple limbs,
etc.
For any given failure, a fail-over
or redundancy can almost always be designed and incorporated into a system.
Safety and reliability
Safety is not reliability. If a
medical device fails, it should fail safely; other alternatives will be
available to the surgeon. If an aircraft fly-by-wire control system fails,
there is no backup. Electrical power grids are designed for both safety and
reliability; telephone systems are designed for reliability, which becomes a
safety issue when emergency (e.g. US "911") calls are placed.
Probabilistic risk assessment has
created a close relationship between safety and reliability. Component
reliability, generally defined in terms of component failure
rate, and external event probability are both used in quantitative safety
assessment methods such as FTA. Related probabilistic methods are used to
determine system Mean Time Between Failure (MTBF), system
availability, or probability of mission success or failure. Reliability
analysis has a broader scope than safety analysis, in that non-critical
failures are considered. On the other hand, higher failure rates are considered
acceptable for non-critical systems.
Safety generally cannot be achieved
through component reliability alone. Catastrophic failure probabilities of 10−9
per hour correspond to the failure rates of very simple components such as resistors or capacitors. A
complex system containing hundreds or thousands of components might be able to
achieve a MTBF of 10,000 to 100,000 hours, meaning it would fail at 10−4
or 10−5 per hour. If a system failure is catastrophic, usually the
only practical way to achieve 10−9 per hour failure rate is through
redundancy. Two redundant systems with independent failure modes, each
having an MTBF of 100,000 hours, could achieve a failure rate on the order of
10−10 per hour because of the multiplication rule for independent
events.
When adding equipment is
impractical (usually because of expense), then the least expensive form of
design is often "inherently fail-safe". That is, change the system
design so its failure modes are not catastrophic. Inherent fail-safes are
common in medical equipment, traffic and railway signals, communications
equipment, and safety equipment.
The typical approach is to arrange
the system so that ordinary single failures cause the mechanism to shut down in
a safe way (for nuclear power plants, this is termed a passively safe design, although more than
ordinary failures are covered). Alternately, if the system contains a hazard
source such as a battery or rotor, then it may be possible to remove the hazard
from the system so that its failure modes cannot be catastrophic. The U.S.
Department of Defense Standard Practice for System Safety (MIL–STD–882) places
the highest priority on elimination of hazards through design selection.
One of the most common fail-safe
systems is the overflow tube in baths and kitchen sinks. If the valve sticks
open, rather than causing an overflow and damage, the tank spills into an
overflow. Another common example is that in an elevator the
cable supporting the car keeps spring-loaded brakes open. If the
cable breaks, the brakes grab rails, and the elevator cabin does not fall.
Some systems can never be made fail
safe, as continuous availability is needed. For example, loss of engine thrust
in flight is dangerous. Redundancy, fault tolerance, or recovery procedures are
used for these situations (e.g. multiple independent controlled and fuel fed
engines). This also makes the system less sensitive for the reliability
prediction errors or quality induced uncertainty for the separate items. On the
other hand, failure detection & correction and avoidance of common cause
failures becomes here increasingly important to ensure system level reliability.
Containing failure
It is common practice to plan for
the failure of safety systems through containment and isolation methods. The
use of isolating valves, also known as the block and bleed manifold, is very common
in isolating pumps, tanks, and control valves that may fail or need routine
maintenance. In addition, nearly all tanks containing oil or other hazardous
chemicals are required to have containment barriers set up around them to
contain 100% of the volume of the tank in the event of a catastrophic tank
failure. Similarly, in a long pipeline, there are remote-closing valves at
regular intervals so that a leak can be isolated. Fault isolation boundaries
are similarly designed into critical electronic systems or computer software.
The goal of all containment systems is to provide means of mitigating the
consequences of failure. Fault isolation might also refer to the extent to
which detected failures might be isolated for successful recovery. The
isolation level shows the system identure level at which the failure cause can
be recovered (often by replacement of a line replaceable unit).
SUBSCRIBERS - ( LINKS)
:FOLLOW / REF / 2 /
findleverage.blogspot.com
Krkz77@yahoo.com
+234-81-83195664
For affiliation:
No comments:
Post a Comment