System Design According to IEC/EN 62061

System Design According to IEC/EN 62061

IEC/EN 62061, "Safety of machinery - Functional safety of safety related electrical, electronic and programmable electronic control systems," is the machinery specific implementation of IEC/EN 61508. It provides requirements that are applicable to the system level design of all types of machinery safety related electrical control systems and also for the design of non-complex subsystems or devices.

The risk assessment results in a risk reduction strategy which in turn, identifies the need for safety related control functions. These functions must be documented and must include:



The functional requirements include details like frequency of operation, required response time, operating modes, duty cycles, operating environment, and fault reaction functions. The safety integrity requirements are expressed in levels called safety integrity levels (SIL). Depending on the complexity of the system, some or all of the elements in Table 11 must be considered to determine whether the system design meets the required SIL.

Element for SIL Consideration Symbol
Probability of Dangerous Failure per Hour PFHD
Hardware Fault Tolerance No Symbol
Safe Failure Fraction SFF
Proof Test Interval T1
Diagnostic Test Interval T2
Susceptibility to Common Cause Failures ß
Diagnostic Coverage DC
 
Table 11: Elements for SIL Consideration

For electronic systems, a significant contribution to failure is time, as compared to number of operations for electro-mechanical devices. Therefore the failure rate of electronic systems is considered on an hourly basis. An analysis of the components must be undertaken to determine their probability of failure. Safety systems are specifically interested in not just the probability of failure, but more importantly, the probability of failure to danger on an hourly basis, the PFHD. Once this is known, Table 12 can be used to determine which SIL is achieved.

SIL (Safety Integrity Level) PFHD (Probability of Dangerous Failure per Hour)
3 ³10–8…<10–7
2 ³10–7…<10-6
1 ³10–6…<10–5
 
Table 12: Probabilities of Dangerous Failure for SILs

The safety system is divided into subsystems. The hardware safety integrity level that can be claimed for a subsystem is limited by the hardware fault tolerance and the safe failure fraction of the subsystems. Hardware fault tolerance is ability of the system to execute its function in the presence of faults. A fault tolerance of zero means that the function is not performed when a single fault occurs. A fault tolerance of one allows the subsystem to perform its function in the presence of a single fault. Safe Failure Fraction is the portion of the overall failure rate that does not result in a dangerous failure. The combination of these two elements is known as the architectural constraint and is designated as SILCL. Table 13 shows the relationship of the architectural constraints to the SILCL.

Safe Failure Fraction (SFF) Hardware Fault Tolerance
0 1 2
<60% Not allowed unless specific exceptions apply SIL1 SIL2
60%…<90% SIL1 SIL2 SIL3
90%…<99% SIL2 SIL3 SIL3
³99% SIL3 SIL3 SIL3
 
Table 13: Architectural Constraints on SIL

For example, an architecture that possesses single fault tolerance and has a safe failure fraction of 75% is limited to no higher than a SIL2 rating, regardless of the probability of dangerous failure.

To compute the probability of dangerous failure, each safety function must be broken down into function blocks, which are then realized as subsystems. The system design of many safety functions include a sensing device connected to a logic device connected to an actuator. This creates a series arrangement of subsystems. If we can determine the probability of dangerous failure for each subsystem and know its SILCL, then the system probability of failure is easily calculated by adding the probability of failures of the subsystems. This concept is shown in Figure 149.


Click to enlarge - Fig 9.01 Subsystem PFHd
 
Figure 149: Example subsystem combination into system implementing a SIL 2 safety related electrical control function.

If, for example, we want to achieve SIL 2, each subsystem must have a SIL Claim Limit (SIL CL) of at least SIL 2, and the sum of the PFHD for the system must not exceed the limit allowed in Table 12.

The term “subsystem” has a special meaning in IEC/EN 62061. It is the first level subdivision of a system into parts which if they fail, would cause a failure of the safety function. Therefore if two redundant switches are used in a system neither individual switch is a subsystem. The subsystem would comprise both switches and the associated fault diagnostic function (if any).


Subsystem Design: IEC/EN 62061

If a system designer uses components ready “packaged” into subsystems according to IEC/EN 62061 life becomes much easier because the specific requirements for the design of subsystems do not apply. These requirements will, in general, be covered by the device (subsystem) manufacturer and are much more complex than those required for system level design.

IEC/EN 62061 requires that complex subsystems such as safety PLCs comply with IEC 61508. This means that, for devices using complex electronic or programmable components, the full rigor of IEC 61508 applies. This can be a very difficult and involved process. For example, the evaluation of the PFHD achieved by a complex subsystem can be a very complicated process using techniques such as Markov modeling, reliability block diagrams or fault tree analysis.

IEC/EN 62061 does give requirements for the design of lower complexity subsystems. Typically this would include relatively simple electrical components such as interlock switches and electromechanical safety monitoring relays. The requirements are not as involved as those in IEC 61508 but can still be very complicated.

IEC/EN 62061 supplies four subsystem logical architectures with accompanying formulae that can be used to evaluate the PFHD achieved by a low complexity subsystem. These architectures are purely logical representations and should not be thought of as physical architectures. The four subsystem logical architectures with accompanying formulae are shown in Figures 150 through 153.


Click to enlarge - Fig 9.02 Subsystem A
 
Figure 150: Subsystem logical architecture A

lDssB = (1-ß)2 x lDe1 x lDe2 x T1 + ß x (lDe1 + lDe2) / 2

PFHDssB = lDssB x 1h

For a basic subsystem architecture shown in Figure 150, the probability of dangerous failures are simply added together.

l, Lambda is used to designate the failure rate. The units of the failure rate are failures per hour. lD, is the dangerous failure rate. lDssA is the dangerous failure rate of subsystem A. lDssA is the sum of the failure rates of the individual elements, e1, e2, e3, up to and including en. The probability of dangerous failure is multiplied by 1 hour to create a unitless probability of failure.

Figure 151 shows a single fault tolerant system without a diagnostic function. When the architecture includes single fault tolerance, the potential for common cause failure exists and must be considered. The derivation of the common cause failure is briefly described later in this chapter.


Click to enlarge - Fig 9.03  Subsytem B
 
Figure 151: Subsystem logical architecture B

lDssB = (1-ß)2 x lDe1 x lDe2 x T1 + ß x (lDe1 + lDe2) / 2

PFHDssB = lDssB x 1h


The formulae for this architecture takes into account the parallel arrangement of the subsystem elements and adds the following two elements from Table 11:

ß – the susceptibility to common cause failures (Beta)

T1 – the proof test interval or lifetime, whichever is smaller. The proof test is designed to detect faults and degradation of the safety subsystem so that the subsystem can be restored to an operating condition.

As an example, assume the following values:

ß = 0.10

lDe1 = 1 x 10 -6 failures/hour

lDe2 = 1 x 10 -6 failures/hour

T1 = 87600 hours (10 years)

The failure rate for the system is 1.70956E-07 failures per hour (SIL2).


Affect of the Proof Test Interval

Let’s look at the affect the proof test interval has on the system. Assume the proof test interval was reduced to twice a year. This reduces T1 to 4380 hours, and the dangerous failure rate improves to 1.03548E-07 failures per hour. This is still only SIL2. If the proof test were reduced to a monthly interval (730 hours), the dangerous failure rate improves to 1.0059E-07 failures per hour. This is still only SIL2. Additional improvement in failure rate, proof test interval, or common cause failure is needed to achieve a SIL3 rating. In addition, the designer must keep in mind that this subsystem must be combined with other subsystems to calculate the overall dangerous failure rate.

Affect of Common Cause Failure Analysis

Let’s look at the affect the common cause failures have on the system. Suppose we take additional measures and our beta value improves to its best level of 1% (0,01), while the proof test interval remains at 10 years. The dangerous failure rate improves to 9.58568E-08. The system now meets SIL3.

Figure 152 shows the functional representation of a zero fault tolerant system with a diagnostic function. Diagnostic coverage is used to decrease the probability of dangerous hardware failures. The diagnostic tests are performed automatically. Diagnostic coverage is the ratio of the rate of detected dangerous failures compared to the rate of all dangerous failures. The type or number of safe failures is not considered when calculating diagnostic coverage; it is only the percentage of detected dangerous failures.


Click to enlarge - Fig 9.04 Subsystem C
 
Figure 152: Subsystem logical architecture C

lDssC = lDe1 (1-DC1)+ . . . + lDen (1-DCn)

PFHDssC = lDssC x 1h


This formulae includes the diagnostic coverage, DC, for each of the subsystem elements. The failure rates of each of the subsystems are reduced by the diagnostic coverage of each subsystem.

The fourth example of a subsystem architecture is shown in Figure 153. This subsystem is single-fault tolerant and includes a diagnostic function. The potential for common cause failure must also be considered with single-fault tolerant systems.


Click to enlarge - Fig 9.05 Subsystem D
 
Figure 153: Subsystem logical architecture D

If the subsystem elements are the same, the following formulae is used:

lDssD = (1-ß)2 {lDe2 x 2 x DC x T2/2 + lDe2 x (1-DC) x T1 }+ ß x lDe

PFHDssD = lDssD x 1h

If the subsystem elements are the different, the following formulae is used:

lDssD = (1-ß)2 { lDe1 x lDe2 x (DC1+ DC2) x T2/2 +

lDe1 x lDe2 x (2- DC1 - DC2) x T1/2 } +

ß x ( lDe1 + lDe2 ) / 2

PFHDssD = lDssD x 1h

Notice that both formulas use one additional parameter, T2 the diagnostic interval.

As an example, assume the following values for the example where the subsystem elements are different:

ß = 0.10

lDe1 = 1 x 10 -6 failures/hour

lDe2 = 2 x 10 -6 failures/hour

T1 = 87600 hours (10 years)

T2 = 876 hours

DC1 = 0,8

DC2 = 0,6

PFHDssD = 2.36141E-07 dangerous failures per hour


Transition Methodology for Categories

During the writing of IEC/EN 62061, it was realized that all the required data for systems and devices would take some considerable time to become fully available. Two tables were included to help with the existing subsystem designs that are based on the original Categories concept and have been proven in use to be effective. They provide equivalency for PFHd and Architectural Constraints (Hardware Fault Tolerance). They facilitate a useful transition path to the functional safety standards. Tables 14 and 15 below are shown in a simpler form than what appears in the Standards. If they are studied, it becomes apparent that as the architectures of the Category systems can be converted to probability of failurer of danger that can be claimed for a subsystem.

Category Hardware Fault Tolerance Diagnostic Coverage PFHD (Can Be Claimed for the Subsystem)
1 0 0% See IEC 62061
2 0 60…90% ³10–6
3 1 60…90% ³2 x 10–7
4 >1 60…90% ³3 x 10–8
1 >90% ³3 x 10–8
 
Table 14: Category based PFHD claim

Also, for low complexity category based subsystems, Table 7 from IEC/EN 62061 is available. Table 14 is a simplified version of Table 7 from the standard. Use this table when a category-based subsystem becomes part of the SRCS that must meet IEC/EN 62061. For simplicity, the safety system designer can claim a PFHD of 2 x 10-7 for a category 3 based system that has 60% diagnostic coverage. Alternatively, the safety system designer can perform a complete analysis to determine if a better PFHD can be claimed.

Category Hardware Fault Tolerance SFF Max. SIL Claim Limit According to Architectural Constraints
1 0 <60% See IEC 62061
2 0 60…90% SIL 1
3 1 < 60% SIL 1
1 60…90% SIL 2
4 >1 60…90% SIL 3
1 >90% SIL 3
 
Table 15: Category based architectural constraints

Table 15 can be used to determine the SIL Claim Limit of a category-based subsystem. The diagnostic coverage of the category-based system must be converted to safe failure fraction.

Knowing the PFHD and SILCL of a category-based system, the safety system designer can apply these values into one of the subsystems shown in Figure 149. If the category-based system is the complete SRCS, then equivalent SIL and PFHD are determined by Tables 14 and 15. The safety system designer must also satisfy the requirements for common cause failures, systematic failures and proof test interval. The scoring system for common cause failures is slightly different for each standard. The concepts for systematic safety integrity are similar in both standards; neither standard uses a scoring system. The proof test interval may be considered the same as the mission time, or a shorter interval may be chosen.


IEC/EN 62061 Terminology Overview

Architectural Constraints

The safety integrity level that can be claimed for a system or subsystem is limited by the architectural characteristics. The two primary characteristics are hardware fault tolerance and safe failure fraction. Secondary characteristics include common-cause faults and fault exclusion.

When combining subsystems, the SIL achieved by the SRCS is constrained to be less than or equal to the lowest SIL Claim Limit of any of the subsystems involved in the safety related control function.


B10 and B10d

For electromechanical subsystems, the probability of failure should be estimated taking into account the number of operating cycles declared by the manufacturer, the load and the duty cycle. The probability of failure is expressed as the B10 value, which is the expected time at which 10% of the population will fail. B10d is the expected time at which 10% of the population will fail to danger.

Common Cause Failure (CCF)

CCF (common-cause failure) is when multiple faults resulting from a single cause produce a dangerous failure. Information on CCF will generally only be required by the subsystem designer, usually the manufacturer. It is used as part of the formulae given for estimation of the PFHD of a subsystem. It will not usually be required at the system design level.

Annex F of IEC/EN62061 provides a simple approach for the estimation of CCF. The table below shows a summary of the scoring process.


No. Measure Against CCF Score
1 Separation/Segregation 25
2 Diversity 38
3 Design/Application/
Experience
2
4 Assessment/Analysis 18
5 Competence/Training 4
6 Environmental 18
 
Table 16: Scoring Process Summary

Points are awarded for employing specific measures against CCF. The score is added up to determine the common cause failure factor. The beta factor is used in the subsystem models to "adjust" the failure rate.

Overall Score Common Cause Failure Factor (ß)
<35 10% (0.1)
35…65 5% (0.05)
65…85 2% (0.02)
85…100 1% (0.01)
 
Table 17: Common-Cause Failure Factor

Diagnostic Coverage (DC)

Automatic diagnostic tests are employed to decrease the probability of dangerous hardware failures. Being able to detect 100% of the dangerous hardware failures would be ideal, but is often very difficult to accomplish.

Diagnostic coverage is the ratio of the detected dangerous failures to all the dangerous failures.

Rate of Detected Dangerous Failures, lDD

DC = -------------------------------------------------------

Rate of Total Dangerous Failures, lDtotal

The value of diagnostic coverage will lie between zero and one.


Hardware Fault Tolerance

Hardware fault tolerance represents the number of faults that can be sustained by a subsystem before it causes a dangerous failure. For example, a hardware fault tolerance of 1 means that 2 faults could cause a loss of the safety related control function but one fault would not.

Management of Functional Safety

The standard gives requirements for the control of management and technical activities that are necessary for the achievement of a safety related electrical control system.

Probability of Dangerous Failure (PFHD)

Part of the requirements needed to achieve any given SIL capability for a system or subsystem is data on PFHd (probability of a dangerous failure per hour) due to random hardware failure. Table 12 gives the probability ranges for each SIL.

This data will be provided by the manufacturer. Data for recent Rockwell Automation safety components and systems (e.g. GuardLogix, GuardPLC, SmartGuard, Kinetix with GuardMotion) is already available. Data for other Rockwell Automation safety components and systems will become available during 2007.

IEC/EN 62061 also makes it clear that reliability data handbooks can be used if and where applicable.


For low-complexity electromechanical devices, the failure mechanism is usually linked to the number and frequency of operations, rather than just time. Therefore, for these components, the data will be derived from some form of lifetime testing; e.g. B10 testing. Application-based information such as the anticipated number or operations per year, is then required in order to convert the B10d or similar data to MTTFd (Mean-Time-To-Dangerous Failure). This, in turn, is then converted to PFHd.

In general, the following can be assumed:

PFHd = 1/MTTFd

And for electromechanical devices:

MTTFd = B10d/(0.1 x mean number of operations per year)


Proof Test Interval

The proof-test interval represents the time after which a subsystem must be either totally checked or replaced to ensure that it is in an "as new" condition. In practice, in the machinery sector, this is achieved by replacement. So the proof-test interval is usually the same as lifetime. ISO 13849-1:2006 refers to this as Mission Time. A proof test is a check that can detect faults and degradation in a SRCS so that the SRCS can be restored as close as practical to an as new condition. The proof test must detect 100% of all dangerous failures. Separate channels must be tested separately.

In contrast to diagnostic tests, which are automatic, proof tests are usually performed manually and off line. Being automatic, diagnostic testing is performed often as compared to proof testing which is done infrequently. For example, the circuits going to an interlock switch on a guard can be tested automatically for short- and open-circuit conditions with diagnostic (e.g., pulse) testing.

The proof-test interval must be declared by the manufacturer. Sometimes the manufacturer will provide a range of different proof-test intervals. The appropriate proof-test interval is determined by reviewing the formulae for the selected architecture. In general, the shorter the proof-test interval, the lower the failure rate.


Safe Failure Fraction (SFF)

The Safe Failure Fraction is similar to Diagnostic Coverage (DC) but also takes account any inherent tendency to fail towards a safe state. For example, when a fuse blows, there is a failure but it is highly probable that the failure will be to an open circuit which, in most cases, would be a safe failure. SFF is (the sum of the rate of safe failures plus the rate of detected dangerous failures) divided by (the sum of the rate of safe failures plus the rate of detected and undetected dangerous failures). It is important to realize that the only types of failures to be considered are those which could have some affect on the safety function.

Most low-complexity mechanical devices such as E-stop buttons and interlock switches will (on their own) have an SFF of less than 60%. But most electronic devices, used for safety, have designed in redundancy and monitoring. Therefore, an SFF of greater than 90% is common. The SFF value will normally be supplied by the manufacturer.


The Safe Failure Fraction (SFF) can be calculated using the following equation:

SFF = (Sl S + Sl DD) / (Sl S + Sl D)

where


l S = the rate of safe failure,
Sl S + Sl D = the overall failure rate,
l DD = the rate of detected dangerous failure
l D = the rate of dangerous failure.

Systematic Failure

The standard has requirements for the control and avoidance of systematic failure. Systematic failures differ from random hardware failures which are failures occurring at a random time, typically resulting from degradation of parts of hardware. Typical types of possible systematic failure are software design errors, hardware design errors, requirement specification errors and operational procedures. Examples of steps necessary to avoid systematic failure include:


The standard provides additional and more detailed requirements needed to avoid systematic failures.