An integrated modelling framework for complex systems safety analysis

The ever‐increasing complexity of engineering systems has fuelled the need for novel and efficient computational tools able to enhance the accuracy of current modelling strategies for industrial systems. Indeed, traditional Fault and Event Tree techniques still monopolize the reliability analysis of complex systems despite their limitations, such as the inability to capture underlying dependencies between components or to include degradation processes and complex maintenance strategies into the analysis. However, the lack of alternative solutions able to tackle large‐scale modelling efficiently has contributed to the continued use of such methodologies, together with their robustness and familiarity well rooted in engineering practice. The current paper defines a novel modelling framework for safety system performance which retains the capabilities of both fault and event tree methods, but also overcomes their limitations. The ambition is to provide a technique for application to real‐world systems preserving a familiar user–model interface and grounding the novel approach in well‐known and established reliability techniques. In order to describe the methodology developed and demonstrate its validity, five case‐studies referring to a simplified industrial plant cooling system are analysed and discussed. Further discussion regarding the scalability of the proposed approach is provided, outlining the advantages of the current implementation and its computational cost.

terms of their applicability to modern systems. Such limitations are intrinsically linked to the simplifying assumptions of the constancy of components failure rates and their independence. Both these hypothesis affect the ability of the model to capture the realistic system behaviour. In the case of constant failure rates, this results in the adoption of the exponential distribution for the representation of component failure time and may lead to the misrepresentation of the actual deterioration process to which systems are unavoidably subject. This may result in the overestimation of failure probability (e.g., in the case of decreasing failure rates) and hence overengineering, or in the underestimation of risks (e.g., in the case of increasing failure rates and aging components). Similarly, disregarding the dependencies influencing the performance of multiple components or sub-systems may conceal failure mechanisms or their significance for the overall system's safety, preventing an adequate understanding of the system's behaviour.
Several research efforts have been focused on overcoming the limitations of fault and event trees, 1 leading to the development of dynamic FTs, 2,3 Boolean logic driven Markov processes 4 and Pandora temporal FTs, 5 relying on the use of dynamic gates for modelling sequence-dependent failure mechanisms and their computation through the use of Markov Models (MMs). 6 However, such solutions come with a wide range of shortcomings and remain mostly restricted to academic applications or to small-scale problems, due to the computational burden generally associated with this kind of techniques. [7][8][9] The objective of the research presented in this paper is to address the need for novel approaches to safety analysis able to expand the capacity of existing techniques and complement traditional risk assessment methodologies. While FT/ET come with limitations, it is also true that they have played a crucial role in the successful design and operation of complex engineering systems for decades and remain the common language shared between designers, analysts and regulators. The aim is then to implement a modelling framework able to enhance the strengths of traditional techniques (e.g., modelling simplicity, relatively low computational costs, robustness) by overcoming the limitations. In the proposed methodology, this is achieved by identifying the sections of the model where underlying traditional assumptions are unjustified and adopting numerical techniques better suited for their assessment. The approach implies the integration of several existing techniques, such as FTs, ETs, MMs, 10 Binary Decision Diagrams (BDDs) 11 and Petri Nets (PNs), 12 under the same framework. The novel technique can be applied to any system requiring FT/ET analysis, providing the analyst with a higher degree of modelling flexibility.
More detail regarding the structure and nature of the proposed modelling framework is provided in Section 2. The validity and efficiency of the proposed framework are then tested through its application to a simplified industrial system, whose description is provided in Section 3. Five case-studies are then considered, in order to compare the proposed solution with traditional techniques for several types of system dependencies. Finally, Section 4 is dedicated to discussing the computational feasibility of the proposed approach for large scale systems.

METHODOLOGY
Large-scale engineered systems consist of many concurrently operating elements, which interacting as an ensemble provide the designed system functionality. The malfunctioning of one or more of such elements can hinder the correct operation of systems, which are therefore vulnerable to faults due to their spatial distribution and sub-systems interdependencies. 13 The resulting emergent and multiscale properties of large engineered systems have often earned them the classification of complex systems, 14 although there is no widely agreed upon definition of what makes an engineering system complex. 15 The use of the term complexity in this study is strictly intended from a reliability modelling point of view, and refers to those dynamic features of systems, for example, degradation processes, stochastic dependencies, feed-forward and feedback connections, that are not representable through traditional tools. Indeed, the most straightforward approach to model the failure of engineering systems is to interpret these latter as a network of interconnected and intercommunicating components. Under this assumption, the losses of functionalities in the system can be mapped through the so called fault trees 16 : such graphical tools allow to factorize the main system functionality in terms of sub-functionalities (e.g., through component operational state), defining the logical rules regulating the interaction between the nodes of the tree/network and hence the systems components. While this approach has been adopted successfully in the safety analysis of large-scale engineered systems, 17 it relies on the strong simplifying assumption of components independence, hence neglecting the dynamic features that are proper of real-world systems. Still, the behaviour of complex stochastic systems is determined by dynamic logical interactions between components (e.g., synchronization, sequentiality, concurrency and conflict), as therefore must be their reliability.

F I G U R E 1 Overview of the proposed simulation framework
The modelling framework proposed in this study aims at overcoming these limitations, allowing the accurate representation of actual complex dynamic features of the system within the FT formalism. This is achieved through the integration into the latter of techniques suitable for the depiction of components interaction in time. [18][19][20] The key to such integration is the use of common probabilistic formalism, able to translate the behaviour of the systems and its components on different modelling levels: the components level, which can be characterized by complex, dynamic interactions which are quantified probabilistically through the use of PNs and MMs; and the system functionality level, where the information collected previously is re-introduced in terms of its probabilistic contribution to the overall system loss of functionality over the time interval of study, with regards to which is a static value.
This results in a novel simulation framework that, rooted in well-established mathematical modelling formulations, extends the capabilities of current risk analysis methods, overcoming their most critical limitations. The hope is very much to promote through similar tools a significant shift towards more realistic system modelling, without renouncing the familiarity of techniques already deeply entrenched in engineering practice.
The proposed solution relies on th integration of traditional FT/ET with PNs and MMs, marrying computational feasibility with analysis accuracy. An overview of the simulation framework implemented is presented in Figure 1.
The input required from the analyst consists of the system setting (in the form of conventional FTs and ETs as well as component failure modes) and, where necessary, of PN/MM models representing complex components' maintenance strategies and dependencies. The developed algorithms identify and isolate independent modules within individual or combined (in the case of dependencies between sub-systems) fault trees containing Dependency Groups (DGs). These refer to closed sets of components whose failures are mutually dependent on each other. The isolated complex structures are then implemented (when not pre-defined by the user) and analysed in the form of individual PNs or MMs, generating a high level of detail in the simulation. Finally, the results obtained are re-integrated within the original FT structure and subsequently within the ET, completing the analysis. The following sections provide a detailed description of the computational steps entailed by the methodology.

Input structure
The first step towards the generation of models for the proposed methodology is the submission of • system event tree structure: this provides crucial information on the nature of the interaction between sub-systems, capturing how their failure or working states impact the safety of the overall system of interest. • sub-systems fault tree structures: these represent the causes of failure patterns resulting from the malfunctioning of the sub-system's components. • components failure mode information: a complete list of the system's components and their associated failure/repair/inspection time distributions. • PNs/MMs models: these provide user-defined PN or MM structures modelling complex features of the system (such as complex maintenance strategies or degradation processes), offering a further degree of flexibility to the overall methodology. Figure 2 gives an overview of the input structure and its implementation in the analysis.

Components reliability metrics computation
The first step of the procedure involves the calculation of the reliability metrics, that is, failure frequency and unavailability, associated with each system component. Where conventional assumptions, such as constant failure and repair rates components, independence and simplistic maintenance strategies, are relevant, the unavailability and failure frequency of the component is calculated according to traditional failure models. 16 Conversely, when failure and repair times are represented by distributions other than exponential, the proposed approach relies on the use of PNs. These are utilised to compute the component's unavailability and failure intensity simulating the stochastic occurrence of alternating failure and repair events. For instance, let the failure time of a component T1 be represented by a Weibull distribution, hence having a nonconstant failure rate. In addition, assume the repair time associated with T1 to be lognormally distributed. The resulting PN model portraying the failure mode of 1 is shown in Figure 3. The network consists of two places: one associated with the correct operation of the component (labelled 1), the other ( 1) indicating its failure state. The presence of a token in one or the other place denotes the current state of the component. The firing of the stochastic failure transition, occurring at time 1 , causes the movement of the token from the working to the failure state, while the repair transitions (occurring at time 1 ) simulates the completion of corrective maintenance and hence the restoration of component functionality. In the framework implementation, similar life-cycle PNs are automatically generated for any component in input characterised by non-traditional failure models, and then simulated to convergence. The component unavailability information is then extracted from the simulation results, being calculated as the fraction of downtime over the total simulation time. The failure intensity is estimated as the number of failures occurring over the simulated time interval. More complex maintenance strategies or degradation processes affecting the component can be captured through the same approach: complex features may be captured by user-defined models provided in input, through the use of PN or MM structures. The use of MMs in particular can offer advantages in the case exponentially distributed failure and repair times, since the associated steady-state solution may reduce the computational demand associated with the calculation. Nevertheless, it is worth emphasizing that even the use of more expensive simulation techniques (i.e., PN) is restricted only to those components for which traditional simplifying assumptions are unjustified and thus resulting in potential inaccuracies, while computational ease is maintained untouched for the remaining components.

Dependency groups identification and computation
The unavailability and failure intensity of dependent components are estimated in relation to the DG to which they belong. This is achieved through the implementation and computation of PNs or MMs models representing the dependency relationship, resulting in the estimation of the joint failure probability and intensity associated with the components included in the group. For instance, with reference to Figure 4, let 1 and 2 be two stochastically dependent components (pumps): these share the load equally but when one fails and the remaining functioning component must supply the required flows alone. Under these conditions, the functioning pump experiences a higher load and is more likely to fail.
If both components are in working order (places 1, 2), their failure time is represented by the distributions 1 and 2 , respectively. However, when 1 fails (place 1), the redundant component 2 experiences a greater load. This increases its failure probability, changing the failure time distribution from 2 to ′ 2 (see transitions named accordingly in Figure 4). Symmetrically, if 2 fails (place 2), the failure time distribution for the component 1 increases from 1 to ′ 1 . The relationship between the two components 1 and 2 is then fully captured by the PN in Figure 4. The simulation of the latter is carried out using the Monte Carlo method 21 and results in the computation of the joint reliability metrics of interest for the two components: for example, the joint availability can be calculated dividing the simultaneous working time of the two components (i.e., token in the joint place 1 2) over the simulated time interval. Similarly, the failure intensity of the event 1 2, referring to the working state of 1 and failure state of 2, can be calculated as the number of times a token entered the associated place 1 2 over the total simulated system life.
If all the residence times in any state of the model are exponentially distributed, the failure mechanisms within the DG could be represented through the use of a MM, quantifying the joint probabilities of interest through the computation of its steady-state solution. This alternative approach may reduce the computational effort when compared to the equivalent PN, hence enhancing the efficiency of the calculation. Figure 5 shows the MM associated with the DG entailing 1 and 2, where 1 and 2 refer to the rates associated with the failure of one component experienced when the second component is working or failed, respectively, while indicates the repair rate common to the two components.
This step of the methodology is comparable to the procedure discussed in Section 2.2 for non-conventional component failure models, with the only difference that the simulation, and subsequently its output, are here referring to a group of dependent components rather than to individual, independent ones.

FTs identification and conversion
The next phase of the procedure focuses on the identification of independent system FTs and their conversion into equivalent BDDs. If all sub-systems are independent, the FTs submitted in input remain untouched. Conversely, if any type of dependency exists between two or more sub-systems, the respective FTs are merged together. This operation is carried out directing the top events of the individual dependent FTs into an AND gate whose output becomes the top event of the new merged FT. For example, let 1 and 2 be the top events of two individual FTs representing the failure of a primary and secondary cooling sub-system, respectively (see Figure 6 and left hand side of Figure 7). Assuming the two sub-systems to have a common component 2, the related FTs fail the assumption of independence, resulting in the need to compute the two models jointly and hence to merge the two FTs as shown in Figure 7. Overall, the result of this step is to obtain a set of independent FT models, that can then be analysed separately without affecting the accuracy of the analysis.
Once the independent FTs are identified, their conversion to BDDs is carried out. A BDD is a directed acyclic graph consisting of terminal and non-terminal vertices connected by edges (also referred to as branches). 11,22 Each non-terminal vertex is associated with a basic event (e.g., the failure of individual components in the current application) and is the origin of two branches: a 0 branch representing the non-occurrence of the basic event (e.g., the working state of the component) and a 1 branch representing the occurrence of the basic event (e.g., component failure). Terminal vertices, in which all paths through the BDD terminate, assume either a 0 value, associated with the working state of the system, or F I G U R E 6 Event tree for the model in Section 3

F I G U R E 7
Merging procedure for dependent FTs 1, indicating instead the failure of the system. The Boolean function underlying a BDD structure can be factorised node by node through the use of − ℎ − (ite) structure. Any BDD node N can be represented as where , which labels the node, refers to the associated basic event, while 1 and 0 are further Boolean function representing children nodes lying on the 1 and 0 branch of , respectively. The expression in Equation (1) translates as: if fails then consider the Boolean function G1; else consider function G0 which, lying on the branch 0, implies the working state of the same component. 23 The structure of the BDD can be then expressed as the combination of the structure of its nodes.

F I G U R E 8 BDD structures for sub-systems
For instance, the structure of the BDD shown in Figure 8 can be expressed as where refers to the root node labelled by the failure event HX2. Each path, that is, chain of events interconnecting through the BDD structure, linking the root node of the graph with a terminal 1 describes a system cut set and hence represents a possible failure mechanisms of the system.
One of the advantages associated with BDDs is the ease with which a tree can be converted to represent its complementary top event. This is represented by a dual binary decision diagram (DBDD), which can be easily obtained from the associated BDD by inverting the value of the terminal vertices while maintaining the structure unaltered. In terms of BDD structure, the result of the conversion procedure depends strongly on the variable ordering selected. In the current study, the conversion method developed by Rauzy 24 has been adopted applying the special ordering suggested by Sinnamon and Andrews. 25 According to this, FT gate events are considered in a top down ordering, with the only exception that at each gate, the input basic events are listed with the repeated events first (if the gate has more than one repeated event as an input, then the most repeated event is placed first).

Models integration
BDDs encode Boolean functions through the combination of a graphical structure entailing edges, terminals and nodes, and the probabilistic information associated to these latter. The analysis of the graphical structure allows to define the possible combinations of events (or components states) resulting in the overall system failure, while the manipulation of the numerical information associated to each node allows to quantify the likelihood of the failure. In light of this, the graphical layout of the BDD depends exclusively on the corresponding FT structure and the variable ordering adopted for the conversion, and hence is not affected by the nature of the events embraced and their probabilistic characterisation. However, in the proposed methodology, the integration of the auxiliary PN and MM models discussed in Sections 2.2 and 2.3 is still accomplished within the BDD definition, and more specifically by means of the numerical characterization of the nodes according to the nature of the event depicted. As discussed so far, the probabilistic information associated with each basic FT event, and therefore with each BDD node, can be gathered from four possible sources: • traditional maintenance models: referring to independent components or events characterised by constant failure and repair rates; • marginal MM models: this option is available to model the life cycle of independent components characterised by constant failure/repair rate as well as complex maintenance strategies not captured by canonical maintenance models; • marginal PN models: adopted to describe individual components characterised by non-exponentially distributed failure/repair times or by complex maintenance strategies (e.g., user-defined); • joint MM models: used to model the interaction between failure mechanisms and repair processes of two or more components characterised by constant failure/repair rates; • joint PN models: as for the former, this solution can be adopted to model the interdependencies and dynamic features existing between components, even when characterised by non-constant rates.
The first three categories result in the estimation of marginal unavailability values for independent components, and can, therefore, be associated with the corresponding nodes as for traditional BDDs, regardless of their origin (e.g., if MM/PN output or from traditional models). On the contrary, the computation of the joint PN and MM models results in the estimation of joint probabilities associated with each possible combination of outcomes of the dependent events included in a DG previously identify. Therefore, each BDD node belonging to a DG is associated with a joint probability table covering all possible combinations of events entailed in the group rather than an individual probability value like in the other cases. Such tables are common to all nodes of the same DG and allow to integrate the output of the secondary PN and MM models within the BDD structure even when including component dependencies.
In light of this, the numerical characterisation of each BDD through the probabilistic output of PN and MM models (joint or marginal) represents the core of the multi-model integration on which the proposed methodology relies.

BDDs computation
The computational steps discussed so far provide a structure of independent BDDs as well as the reliability information associated with independent components and DGs. This information can now be processed in order to quantify each BDD, focusing on the prediction of three relevant reliability metrics: failure probability, failure frequency and component importance measures.
As discussed in the previous section, each path connecting a BDD root node with a terminal 1 represents a combination of components states leading to the failure of the system. In light of this, the failure probability of the system (i.e., the probability associated with the top event of the corresponding FT) can be calculated as the sum of the probability values associated with each such path. This can be expressed as where ( ) indicates the probability associated with the ℎ of the disjoint paths connecting the BDD root to a terminal 1.
In order to achieve an adequate understanding of a system's behaviour, it is desirable to measure the contribution of individual components to its overall failure probability. This can be estimated for a generic component as the difference of the system failure probability assuming the component has failed (i.e., ( )), minus the system failure probability assuming the component to work correctly (i.e., ( )). This quantity is known as the Birnbaum's measure of importance of the component, and can be expressed as The third metrics of interest in this study can be finally computed from the Birnbaum's measure of importance for the components as where ( ) refers to the failure intensity of the ℎ of the components of the system, and is the overall failure intensity.
If the BDD refers to an independent sub-system, and hence matches one of the sub-system FTs in the input, the reliability information obtained is directly relevant to the independent sub-system itself. Conversely, if the BDD results from the creation of a merged FT due to the existence of dependencies between two or more sub-systems (as discussed in Section 3.4), the reliability information refers to the joint state of the two sub-systems. For instance, with reference to the example of Figure 7, the quantification of the BDD obtained from the conversion of the merged FT will result in the probability associated with the top event indicating the simultaneous failure of both sub-systems.
Finally, the occurrence of one or more DGs within a FT implies the presence of components dependencies in the resulting BDD. Currently available algorithms for the computation of BDDs do not allow for stochastic dependencies, relying instead on the assumption of components independence typical of FT analysis. In light of this, the simulation framework proposed required the development of novel solutions for the quantification of BDDs in the presence of dependencies. The approach adopted relies on the calculation of the probability of all BDD paths to failure using independent and joint probabilities, as well as their manipulation, as appropriate. This strategy has been implemented in a novel algorithm 26 and integrated in the simulation framework presented.

ET computation
The computation of the independent BDDs provides the information necessary to complete the safety analysis, through the calculation of the overall system ET. If the sub-systems defining the ET branches are all mutually independent, the computed BDDs match the input FTs structure and hence refer to individual sub-systems: in this case, the computation of the ET is straightforward, utilising the initiating event failure intensity and of the probability of failure of the remaining sub-systems. If instead two or more sub-systems included in the ET are linked by some sort of dependency, the computation of the ET requires the use of conditional, rather than marginal, probabilities. This can be easily gathered from the joint probabilities calculated from the merged BDD (as discussed in Sections 2.4 and 2.6) according to the marginalisation and conditioning rules. The first of this can be applied in order to extrapolate marginal values of probability from joint estimates, as where the marginal value ( ) was obtained summing over the set of probability values ( = , ) obtained from BDDs analysis and associated with the occurrence of state regardless of the state of dependent component . Similarly, the conditional probability of component to be in its failed state given the failure of can be computed as F I G U R E 9 System model for demonstration of concepts

NUMERICAL APPLICATION
To demonstrate the methodology and explore the effects of different assumptions in the analysis, the proposed approach has been applied to five simple case-studies characterised by different degrees of complexity: 1. Case Study A: relies on conventional assumptions such as full independence with no component degradation. 2. Case Study B: investigates the inclusion of component degradation in the analysis while maintaining the assumption of independence. 3. Case Study C: focuses on dependencies resulting from shared basic events between two or more sub-systems. This is referred as a 'hard' dependency. 4. Case Study D: investigates a category of dependencies referred to as 'soft'. These include dependencies triggered by secondary procedures or processes, which may be not strictly connected with the hardware function (e.g., maintenance, load, surrounding conditions, etc.). 5. Case Study E: considers the overlapping of the dependency types investigated in Cases C and D. This is referred to as a 'complex' dependency.
The following sections provide a detailed description of the analysis carried out for each case listed above. All case studies refer to the same system design, whose detail are provided in the following section.

System model
A simplified power plant cooling system has been selected for testing the proposed methodology. The system design is shown in Figure 9 and embraces four sub-systems: • Primary cooling system: this consists of a heat exchanger 1 which is fed cooling water from a storage tank 1. The circulation is ensured by the operation of two identical pumps working in parallel, 1 and 2 (both normally operational). The failure of either 1 or 1 prevents the correct functioning of the primary cooling system. Similarly, the simultaneous unavailability of the circulation pumps 1 and 2 leads to the overall failure of the sub-system. Conversely, the failure of only one of the two pumps does not affect the operation of the primary cooling system since both pumps have the capability to provide the coolant required. The FT summarising the possible failure mechanisms of the primary cooling system is shown in Figure 10A. • Secondary cooling system: it partially provides the required vessel cooling capability. Similarly to the primary cooling, it comprises a heat exchanger 2, a storage tank 2 and a pump 3 responsible for the circulation of cooling water. Additionally, the valve 1 opens to allow water to flow only when the primary system has failed and 3 is activated F I G U R E 1 0 BDD structures for sub-systems and working correctly. The failure of any of the mentioned components prevents the correct operation of the system, as shown by the FT in Figure 10B, resulting in the unavailability of secondary cooling.
• Detection system: the temperature within the vessel is expected to rise in the case of failure of the primary cooling system. By design, such increase is detected by dedicated sensors, 1 and 2, working in parallel and transmitting the reading to a programmable logic controller . When the temperature readings from 1 and 2 exceed a preestablished threshold, the controller activates both the secondary and fan cooling systems. The failure of the sub-system controller results in the unavailability of the entire system, as shown by the FT of Figure 10C. On the contrary, the failure of only one of the two sensors does not affect the correct operation of the sub-system. • Fan system: like the secondary cooling system, the fan has the capability to provide partial cooling of the vessel. When relay 1 energises and its contacts close, the associated circuit activates the electric motor operating fan ( ). The failure of any of these components results in the unavailability of the sub-system, as highlighted by the FT in Figure 10D.
Any combination of sub-systems states has the potential to result in different consequences, as summarised by the ET in Figure 6. The failure of the primary cooling system can result in the total loss of cooling only if in combination with the failure of the detection system (preventing the activation of secondary mitigation measures) or in the case of simultaneous loss of the secondary and fan cooling systems. Conversely, if one of the latter two sub-systems operates correctly but not the other, there will be a partial cooling loss. Finally, if all but the primary cooling sub-system are available, no cooling loss is registered.

Case A
This is the simplest case considered, relying on the assumption of full independence and neglecting any degradation process to which the components may be subject. In light of this, the reliability metrics of the components can be calculated according to conventional failure models, as discussed in Section 2.2. For a component whose failure is immediately apparent and for which corrective action is in place (e.g., revealed failure), the unavailability ( ) can be expressed as where and represent the component's failure and repair rates, respectively. This is the case for all the components belonging to the primary system, since their failure would directly affect the performance of the system. On the other hand, the malfunctioning of components belonging to the secondary, detection and fan systems might be not detected on occurrence due to the sub-systems not being continuously operational (i.e., unrevealed failure). Scheduled maintenance with regular inspections is assumed to be in place for such components, whose unavailability can hence be calculated as where refers to the time interval between inspections. The failure intensity ( ) can be then calculated for the generic component as The components reliability information and the relative metrics calculated according to the above equations are shown in Table 1. A regular time interval of 4380 h between inspections, corresponding to a 6-monthly maintenance schedule, has been assumed for all components except those belonging to the primary cooling system. According to the procedure discussed in Section 2, the next step in the methodology focuses on the identification and computation of DGs. However, this does not apply in the current case since all components and sub-systems are assumed independent. Subsequently, the FTs shown in Figures 10A-10D are independent and can be converted to BDD structures as discussed in Section 2.4. The resulting BDDs are shown in Figure 8, where dashed lines indicate 0 branches, continuous lines refer to 1 branches. TA B L E 2 Sub-systems failure probability for case-study A Once the structure of the BDDs has been generated, their numerical analysis is carried out as discussed in Section 2.6. The resulting failure probability values associated to each sub-system are shown in Table 2.
The sub-systems FTs were also analysed adopting the minimal cut set upper bound approximation, 16,27 in order to verify the algorithms implemented.
As shown in Figure 6, the failure of the primary cooling system works as trigger event for the overall system failure. For this reason, the failure frequency associated with the primary system was computed, resulting in 1.1901 −05 h −1 . From this and the sub-systems failure probabilities previously calculated, it is finally possible to estimate the system ET outcome frequencies for the possible degrees of cooling loss. The results obtained are shown in Table 3.

Case B
Assume the components 1 and 1 from the primary cooling system to be characterised by a non-constant failure rate, and their failure time to follow a Weibull distribution. In addition to this, let assume 1 to be characterised by a lognormal repair time distribution.
As described in Section 2.2, the proposed approach relies on the use of PNs to compute the reliability of components characterised by non-constant failure or repair rates. The lifecycle of 1 is then simulated running to convergence the PN shown in Figure 3, computing then the component unavailability as the ratio between the total simulated downtown over the overall simulated life. An identical PN structure has been adopted for 1. Specifics of the distributions adopted in the current case-study and the resulting reliability metrics are provided in Table 4. Substituting such values into the analysis of the primary sub-system BDD (to which the components under study belong), the associated failure probability Also in this case, the computation of the system ET is conventional, thanks to the assumption of full independence. Table 3 shows the results obtained updating the ET calculation in view of the new probability and intensity values for the primary system failure in Equation (11).

Case C
Consider now the situation where pump 2 is common to both the primary and secondary cooling systems. As such, 2 replaces 3 in the FT representing the failure of the secondary sub-system (see Figure 10B). To model the real setting of the system, it is now necessary to take into account the dependency triggered by the shared component 2 and linking the primary and secondary cooling system FTs. As discussed in Section 2.4, this is achieved merging the dependent FTs, computing the resulting BDDs and extracting from the analysis the joint probability associated with all the possible combinations of states of the two sub-systems. In this case, only combinations of states including the failure of the primary cooling system are of relevance for the safety analysis, since the working state of the primary cooling would instead guarantee the correct operation of the overall system (see Figure 6). Hence, the probability estimates of interest are • ( , ): probability of simultaneous failure of the primary and secondary cooling; • ( , ): probability of the failure of the primary cooling but the correct operation of the secondary cooling.
The first of these is estimated from the analysis of the FT obtained merging (ANDing) the two sub-system FTs, as shown in Figure 7. Similarly, the probability (PRIMARY, SECONDARY) can be calculated as the top event of a FT obtained merging the primary cooling FT in input with the working state (dual) model of the secondary cooling system FT initially provided. The resulting tree is shown in Figure 11.
Having defined the FTs of interested, they are converted to BDDs, giving the structures shown in Figure 12.
Since the dependency considered in this case is caused by the repetition of the component 2 in two sub-systems, the merging of the FTs is sufficient to guarantee the validity of the independence assumptions and the resulting BDDs can be computed using traditional algorithms.
The results of the BDDs analysis are shown in Table 5: while the probability associated with the failure of the detection and fan sub-system remain unchanged (the parameters for these systems are identical in cases A, B and C), joint values are provided for the primary and secondary cooling.
As discussed in Section 2.7, the system ET can be now computed taking into account the dependencies underlying the two systems and manipulating the relative joint probability. For instance, the frequency associated with the consequence F I G U R E 1 2 BDDs for relevant combination of states of primary and secondary system TA B L E 5 Sub-systems failure probability for case-study C can be now calculated as where ( | ) refers to the probability of the secondary system to work given that the primary coolant system has failed, and ( ) to the failure frequency of the primary coolant sub-system. This can be calculated according to the conditioning procedure shown in Equation (7), resulting in where ( ) can in turn be estimated through marginalisation (see Equation 6) as using the joint probability values ( , ) and ( , ) obtained from the BDD analysis. Similarly, the probability of the secondary cooling to fail given the failure of the primary system can be obtained as In light of this, the computation of the system ET can be completed, resulting in the loss frequency values shown in Table 3.

TA B L E 6
Input of the MM in Figure 5 Parameter Value

Case D
In the system, the circulation of cooling water within the primary sub-system is ensured by the operation of two identical pumps working in parallel, namely P1 and P2. Let now account for the fact that the failure of one pump will require the second pump to deliver the full supply and therefore be subject to a higher load, increasing its probability of failure. Such a relationship between the two components is modelled by the PN shown in Figure 4 as well as by the MM shown in Figure 5. Assuming the repair and failure times of both pumps to be exponentially distributed, the steady-state solution can be easily calculated for the associated MM, to gain advantage in terms of computational time. The two components are assumed to be subject to corrective maintenance with the same repair rate used in the previous cases, as well as the same failure rate when under design load. However, the failure rates of the two components increase when the pump is subject to a higher load (i.e., when the other pump is out of order). Adopting the parameters shown in Table 6, the steady-state solution of the MM leads to the joint values shown in Table 7: as expected, the joint unavailability of 1 and 2 results to be one order of magnitude higher than the product of their unavailability calculated assuming independence.
Carrying out the analysis of the BDD associated with the primary cooling using the joint values obtained, the probability of the sub-system to fail results equal to 1.7460 −04 , which is more than two times higher the estimate of case A. The probability values associated with the failure of the remaining sub-systems remain instead unchanged (see Table 2). The analysis of the ET results in the loss frequency provided in Table 3. The obtained frequencies show the impact of the 1 and 2 dependency on the overall safety of the system: both the values associated with partial and total loss of cooling result to be higher than what computed in case of independence.

Case E
Engineering systems can be characterised by complex dependency structures, not easily attributable to a single type as for the examples above. In order to test the proposed approach against a similar case, case-studies C and D have been merged into one: let P2 be shared by the primary and secondary cooling systems, as well as being dependent on P1. The resulting BDD structures are then identical to those shown in Figure 12. However, as for case D, their computation requires the use of the joint probability associated with the failure and operation of both primary and secondary cooling sub-system (see Table 7). Therefore case E contemplates the stochastic dependency between P1 and P2, where P2 is a component of both the primary and secondary cooling sub-systems. The quantification of the failure probability of these latter allows to estimate the effect of such system setting in terms of system reliability. Indeed, as shown in Table 8, the probability of a simultaneous failure of both sub-systems is equal to 1.3405 −04 , which is over three times higher than the estimate obtained in case C (i.e., considering the stochastic dependency between the two pumps but assuming P3 to service the secondary cooling instead of P2). However, when compared to the system setting entailing only stochastic dependency (case D), the gap between values widens considerably, with the joint probability of the primary and secondary system to fail being three orders of magnitude higher than its value assuming independence. Therefore, both the stochastic dependency between P1 and P2 and the dependency between the two cooling sub-systems (due to the shared use of P2) rise individually the likelihood of a simultaneous failure of the primary and secondary cooling, with the second giving the highest contribution TA B L E 8 Sub-systems failure probability for case-study E to the increase. This is due to the decline of system redundancy triggered by the substitution of P3 (as in case D) with P2.

Sub-system
On the other hand, the probability associated with the failure of the primary sub-system and the correct operation of the secondary cooling decreases when considering the stochastic dependency between the two parallel pumps P1 and P2: the estimate indeed results to be only 20% of the value calculated in case C, assuming components independence. This suggests that, when considering the components dependency, the failure of the primary system is more likely to occur in combination with the failure of the secondary cooling rather than along its working state. This reflects on the overall system loss frequencies: as shown in Table 3, the frequency values for the current case are the highest across all system setting considered as well as losses type. This is true also when considering the occurrence of no cooling losses for which, involving the working of the secondary system, a lower frequency value could be expected for case E in comparison with case D, due to the difference in the relative joint probabilities. However, the gap between the failure probabilities discussed before is mitigated, and its effect on the overall loss frequency reverted, by the increased frequency for the primary cooling failure registered when considering the stochastic dependency between P1 and P2.

ACCURACY AND SCALABILITY
The strength of the proposed methodology lies with the capability of isolating aspects of the system behaviour which defy conventional FT/ET analysis assumptions, modelling them with more detailed and flexible strategies such as MMs and PNs. This is achieved integrating the initial FT models and the MM/PN output within the BDD frameworks, thanks to the development of algorithms for the computation of these latter taking into account dependencies. Of course, while hopefully paving the path towards more realistic and accurate modelling, this kind of approach opens the doors to new challenges associated with the nature of the newly available modelling capability and their computational cost, such as • The use of BDD with dependencies, which comes at the cost of higher computational power. The magnitude of such an increase depends on the model structure. As a limited term of comparison, the computation of case A in the current study was carried out in 0.01 s, while the analysis of case E employed 0.03 s. To mitigate the costs associated with real-world systems and ensure the scalability of the approach, the size of the model sections entailing dependencies can be reduced, hence minimizing the use of more expensive algorithms associated with their computation. This can be achieved through modularisation, currently under implementation, consisting in reducing the initial FT structure, 28 and subsequentially extracting independent sections of the tree circumscribing the dependent components. 29,30 Such sub-models can be then analysed separately with the novel approach for the computation of BDD in presence of dependencies, and the results re-introduced in the initial FT model in the form of an independent surrogate event retaining reliability information equivalent to the section analysed. The FT so modified could be then analysed with traditional approaches, whose feasibility for large scale systems has been largely proved. • The computational burden of stochastic models and simulation techniques, such as PNs and MMs, increases significantly with model size or when rare events are involved. As for the case above, this can be mitigated minimizing the model size. In light of this, an unconventional PN modelling strategy has been implemented in the proposed framework. This technique inherits the main aspects of traditional PNs but allows to record the joint state of the places of interest in the network whenever a transition is fired, giving directly as output joint probabilities and frequencies and hence avoiding the use of places representing the joint state under study. In the case of n dependent and Boolean components, the associated PN model would require 2 ⋅ + 2 places, while the alternative PN solution reduces such number to 2 ⋅ . With regards to the PN network in Figure 4, this would translate in the computation of only the red section of the model, with advantages in terms of efficiency and without affecting the accuracy of the analysis. Other possible strategies to enhance the efficiency of the simulations would entail the use of advanced Monte Carlo sampling techniques. 31,32 • The proposed approach removes simplifying assumptions and limitations associated with traditional FT analysis, providing analysts with a higher degree of flexibility but also transferring to them the task of identifying modelling assumptions. Indeed, while the FT construction process would remain unvaried, the proposed framework allows to specify further the relationship between components. This implies, first, the need to identify relevant dependencies or complex relationships between components. The first of these two tasks should be addressed on the basis not only of the meticulous knowledge of the system (as for FT analysis), but also of the nature of the interaction between components. Indeed, the model implementation should be preceded by an analysis of component dependencies, which could be carried out systematically on the basis of the different source of dependency, for example, causal dependencies (when the state of a component directly affects the probability of failure of another component), maintenance strategies, common environmental factors and so forth. Once identified possible dependency relationships between components, the further step is to estimate their relevancy. Such task is not banal, since any direct numerical estimation of the dependency relevancy would have to rely on the comparison of the joint and marginal probabilities of the interested components, hence implying the modelling of the dependency relationship. While the possible exclusion of the investigated dependency from the model due to lack of significance would decrease the computational burden of the analysis, it would not reduce the modelling load for the analyst, due to the necessity to estimate the joint probabilities. Further support for the decision making process could be provided by importance measures, allowing for example to exclude complex feature modelling when this entail low importance components. Ultimately, the identification of complex features to include in the modelling would be mostly dictated by the expertise and knowledge of the system: while this may increase the complexity of the modelling process, it would also push towards a better understanding of the system as an interconnected network of components, forcing the analyst to consider aspects of the system which would be left implicitly hidden when adopting more traditional techniques. • Aside from determining which complex feature to consider in the analysis, the new modelling capability requires also to explicitly decide the degree of detail of the more sophisticated model sections (e.g., PN and MM). This would partly depend on the data availability, but mostly leads back to the long-standing issue of finding a balance between model accuracy and its costs. Also in this case, the strength of the model and its robustness could be evaluated through numerical strategies, such as sensitivity analysis, but would still increase the modelling burden. However, as for the previous point, it is worth to highlight that the need to explicitly identify these analysis details is in itself a significant achievement, even if not without challenges. Indeed, the new capabilities allow to reduce (if not eliminate) implicit assumptions of traditional techniques, returning to the analyst a realistic picture of the complexity of the systems under study and providing the basis for a better understanding, and potentially representation, of complex systems.

CONCLUSIONS
The research presented proposes a novel methodology aiming to overcome the limitations of traditional fault and event tree techniques, whilst preserving their familiar modelling formulation as well as their computational efficiency. This is achieved through a surgical approach to modelling, relying on the identification of minimal model subsets requiring a degree of simulation fidelity beyond the capability of traditional methodologies (e.g., dependencies, degradation processes, complex maintenance strategies, etc.) and their resolution through the use of PNs and MMs. On the one hand, the use of the latter ensures extreme modelling flexibility and promotes the shifts towards more realistic modelling. On the other, the restriction in the use of these more expensive modelling solutions to only the sections of the system for which traditional simplifying assumptions (e.g. components independence, failure rate constancy) are unjustified ensures the computational feasibility and scalability of the approach. The methodology developed is tested against five case-studies, covering a range of component dependency types and system settings which cannot be fully represented through the use of conventional fault and event trees. The results obtained are compared to those achieved with existing techniques, in order to verify the accuracy as well as the computational efficiency of the implemented algorithms.

A C K N O W L E D G E M E N T
This work was supported by the Lloyd's Register Foundation, a charitable foundation in the U.K. helping to protect life and property by supporting engineering-related education, public engagement and the application of research. This work was funded by the Lloyd's Register Foundation, with the project Next Generation Prediction Methodologies and Tools for System Safety Analysis.

C O N F L I C T O F I N T E R E S T
The authors declare that there is no conflict of interest.

D ATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.