How reliable are case formulations? A systematic literature review.

OBJECTIVES
This systematic literature review investigated the inter-rater and test-retest reliability of case formulations. We considered the reliability of case formulations across a range of theoretical modalities and the general quality of the primary research studies.


METHODS
A systematic search of five electronic databases was conducted in addition to reference list trawling to find studies that assessed the reliability of case formulation. This yielded 18 studies for review. A methodological quality assessment tool was developed to assess the quality of studies, which informed interpretation of the findings.


RESULTS
Results indicated inter-rater reliability mainly ranging from slight (.1-.4) to substantial (.81-1.0). Some studies highlighted that training and increased experience led to higher levels of agreement. In general, psychodynamic formulations appeared to generate somewhat increased levels of reliability than cognitive or behavioural formulations; however, these studies also included methods that may have served to inflate reliability, for example, pooling the scores of judges. Only one study investigated the test-retest reliability of case formulations yielding support for the stability of formulations over a 3-month period.


CONCLUSIONS
Reliability of case formulations is varied across a range of theoretical modalities, but can be improved; however, further research is required to strengthen our conclusions.


PRACTITIONER POINTS
Clinical implications: The findings from the review evidence some support for case formulation being congruent with the scientist-practitioner approach. The reliability of case formulation is likely to be improved through training and clinical experience. Limitations: The broad inclusion criteria may have introduced heterogeneity into the sample, which may have affected the results. Studies reviewed were limited to peer-reviewed journal articles written in the English language, which may represent a source of publication and selection bias.

Due to various schools within the profession of psychology, formulations inevitably focus on different aspects of a case depending on the theoretical orientation of the clinician. For example, a cognitive therapist is likely to focus on cognitive mechanisms, whereas a psychodynamic therapist may focus more on unconscious processes. Furthermore, formulations can be developed at the problem level or the case level. The former focuses on a specific issue whereas the latter takes account of all of the client's difficulties.
It has been purported that formulation follows the scientist-practitioner approach (Tarrier & Calam, 2002) by utilizing an evidence base to understand a concept. More specifically, formulation uses 'psychological science to help solve human problems' (DCP, 2010, p. 3). For the cognitive model (Beck, 1976) in particular, formulation has been described as 'the heart of evidence-based practice' (Kuyken, Fothergill, Musa, & Chadwick, 2005, p. 1188. However, for a skill considered so pertinent to the role of a clinical psychologist, there is a paucity of empirical research and formulation should be open to scientific examination. One area of scientific investigation concerns reliability. Investigations into the reliability of formulation can be traced back to 1966, where Philip Seitz (1966) published 'the consensus problem in psychoanalytic research ' (p. 206). This article detailed a 3-year research study involving a group of sixpsychoanalysts, which concluded that agreement was achieved in very few of the formulations. Seitz refers to one possible reason for this being the 'inadequacy of our interpretive methods ' (p. 214) where participants demonstrated the tendency to develop complex inferences regarding the cases. Seitz also recognized that participants had the tendency to rely on intuitive impression without critically checking these. Theoverall value of Seitz' work was highlighting the 'consensus problem', however, in the years following, a range of researchers sought to improve reliability in case formulations. One key researcher and the first to achieve this was Lester Luborsky (1977) using his core conflictual relationship theme (CCRT) method. Whilst the majority of formulation methods were developed within a psychodynamic framework, other methods such as cognitive-behavioural, behavioural and integrative have also been proposed (Eells, 2007).
Formulations must not be wholly subjective, it is therefore important to understand and establish reliability. Bieling and Kuyken (2003) review of literature in relation to cognitive case formulation concluded that good levels of reliability have been obtained for descriptive aspects of a case, with reliability being somewhat compromised and subsequently decreasing for the more inferential and theory-driven aspects. In addition, they briefly reviewed the psychodynamic literature, which showed promising results for reliability. Other reviews of case formulation literature have reported similar results (Aston, 2009;Mumma, 2011). Most research has focused on inter-rater reliability, that is, the rate of consistency between clinicians on aspects of a case. Test-retest reliability, whether formulations remain stable over time, has had much less of a focus (Bieling & Kuyken, 2003), with some research evident in relation to the psychodynamic model (e.g., Barber, Luborsky, Crits-Christoph, &Diguer, 1995) and none known to the cognitive model. Whilst the Bieling and Kuyken (2003), Aston (2009) and Mumma (2011) reviews considered some of the available research in the area, they were not systematic reviews and were by no means exhaustive.

Rationale and aim
Whilst it could be argued that some theoretical modalities place more emphasis on the relation of formulation to evidence-based practice, it is a skill central to the work of all clinical psychologists. It is therefore important to develop a scientific foundation for formulation, which includes reviewing the reliability. Clinical psychologist Gillian Butler (2006) suggests that low reliability is inevitable due to there being no one correct way to formulate. She argues that clinicians presented with the same information may well develop alternative formulations, even if they are formulating from the same psychological model. Whilst this may be the case, the literature suggests a tension between formulation being viewed as a 'science' (with its trappings of measurability, reliability, etc.) and an 'art' (with its emphasis on an ideographic approach that is beyond the realms of scientific scrutiny). We therefore felt that from a scientific perspective, a systematic literature review appears necessary to draw conclusions from the available literature about constructs related to reliability. To date, there has been no systematic literature review on any aspect of formulation. The overall aim of this systematic review is to answer the following question: What is the reliability of case formulations? In attempting to answer this question, we focus on the reliabilities of various theoretical modalities, and comment on the overall quality of the primary research.

Method
Due to previous reviews of case formulation (e.g., Aston, 2009;Bieling & Kuyken, 2003), we were aware that the number of studies that examined the reliability of case formulation would be limited. Therefore, no restriction in date was applied other than the start of the searched databases.

Inclusion criteria
Studies were eligible if they:  Examined the inter-rater or test-retest reliability of case formulation. This required reporting the results of a reliability measure.  Outlined the theoretical model of the formulation method, as psychology is a profession based on a variety of theoretical modalities.  Included adult, child formulations, or both.  Investigated the reliability of case formulations developed by any mental health professional, including studies that utilized a combination of clinicians and students.  Were peer-reviewed journal articles, to control for quality.  Were written in the English language.

Exclusion criteria
Studies were excluded if they:  Had formulators recruited entirely from a student population, which would reduce the ecological validity of this review.  Consisted of a review of previous research with no new research being undertaken.  Focused on the assessment and reliability of measures that may serve to influence the process of formulation, for example, pre-therapeutic assessment tools.

Search overview
Studies were accessed through a range of databases in addition to reference list trawling. This included reference list trawling of reviews of formulation research such as those of Barber and Crits-Christoph (1993) and Bieling and Kuyken (2003). Five databases were searched in April 2014: PsycINFO (1806 1 ); MEDLINE (1948); AMED (1985); CINAHL (1981); Web of Science (1900). These databases are similar to the ones utilized in a narrative review of the case formulation literature (Aston, 2009), and cover journal articles that relate to psychology.
The following search terms were used: formulation OR case formulation OR case conceptualization OR case conceptualization AND statistical reliability OR reliability OR inter-rater reliability OR inter-rater reliability OR test reliability OR test-retest reliability. These search terms yielded a total of 4,318 articles from all five databases. After applying the inclusion and exclusion criteria in addition to reference list trawling, 18 articles remained (see Figure 1 for the quorum diagram).

Data extraction
Specific data were extracted for each selected study. Table 1 details the data extracted.

Assessment of methodological quality
Several scales have been developed to assess methodological quality of studies and to standardize this process. However, due to the variability in research designs of the selected articles, none of the pre-existing tools could be applied to the current review. This follows from suggestions that authors can develop their own tool by adapting available tools (e.g., Parker, 2004). Therefore, our quality assessment tool was developed with reference to the Critical Appraisal Skills Programme (CASP, 2004) and the Newcastle-Ottowa Scale (Wells et al., 2010). The resultant tool comprised five questions, each with a rating out of three and a total sum of 15. Furthermore, a separate section incorporated information regarding additional sources of bias to provide further information related to quality. This information is tabulated in Table 2. For the current review, a hierarchy was decided upon for the reliability data. This was based on ecological validity and the real life experiences of formulators. Therefore, video recordings were assigned a score of 3, audio recordings were assigned a score of 2, and transcripts or written vignettes were assigned a score of 1. Furthermore, if reliability data were not outlined then studies were also assigned a score of 1. With regard to reliability measurement, studies were assigned a score of 3 if they used an appropriate statistical measure of reliability (that accounts for chance agreement) for all aspects of data analysis.
To obtain a score of 3, studies could also incorporate percentage agreement but would be required to report statistical measures also for the same focus of agreement or reliability. A score of 2 was assigned if studies used statistical measures for some aspects and percentage agreement for others, and a score of 1 was assigned if studies used percentage agreement only.
Inter-rater reliability for quality assessment was assessed through the use of a second rater (LB) scoring over 20% of the studies. For this review, articles were not excluded through quality assessment. This was to avoid excluding potentially relevant studies. However, the quality assessment tool informed the interpretation of the findings. Quality rating was conducted by two reviewers (LF and LB) and inter-rater reliability between the      Table   1.
(1) two raters resulted in 83% agreement. Discrepancies were resolved through discussion. The 18 studies included in the review yielded total quality scores ranging from 7 to 14.

Results
All 18 studies tested the inter-rater reliability of formulations with one study (12) also investigating test-retest reliability. This was achieved by investigating the stability of formulations over a 3-month period in the absence of new client information.

Study location
The majority of studies were conducted in the USA, two studies were conducted in England (1 and 5) and one study was conducted in Israel (9).

Participant demographics
In total, the studies used data from 152 client participants and between 550 and 553 formulators/raters. The exact number is not available as Studies 10 and 11 provided participant numbers within a range. The majority of studies explicitly detailed the demographics of participants, including reference to amount of clinical experience and professional role. However, several studies (11, 14, 15 and 18) provided only minimal demographics, often referring to 'experienced clinicians' with no information as to how they defined experience.

Training
Some studies reported training that was offered to participants as part of their participation (1, 2, 3, 8, 9, 12, 15 and 16). Another study directly recruited from training courses (5). Some studies did not refer to training (4, 6, 7, 9, 10, 11, 14 and 18), but it is possible that participants had completed training as part of their professional role. A study where training was completed (12) demonstrated higher levels of agreement in comparison to a study that offered no training (7). However, for two studies where participants completed training as part of the research (2 and 3), intraclass coefficients ranged between .07 and .70 for underlying mechanisms and between 13% and 100% agreement of a client's presenting problems.

Sample
Some studies (6, 9, 10, 11, 12, 13 and 14) used clinicians to assess the reliability of case formulations. However, several studies used both clinicians and students (1, 2, 3, 4, 5, 16 and 17), although the students in study number 17 were recruited to assess the similarity of formulations as opposed to the formulators. Two studies (8 and 15) recruited both clinicians and research assistants.
Results from Studies 1 and 5 indicated that greater clinical experience was linked to increased agreement with the benchmark formulations. Study 1 found that the prequalified student participants were least likely to identify an important aspect of the benchmark formulation. However, it is of note that for some of the inferential aspects of the formulation, pre-qualification students actually demonstrated a higher rate of agreement in comparison to the accredited practitioners.

Formulation data
The majority of studies provided participants with one source of material in which to formulate clients ' problems, including transcripts (2, 4, 8, 9, 10, 14 and 15), written vignettes (7 and 11), audio recordings (2, 3, 6 and 13), video recordings (1, 5, 13 and 16), or a combination (2 and 13). However, one study did not outline the full material used but referred to a written narrative for one of the two cases (12). Two studies (1 and 5) used multiple sources of information (video and assessment measures), which could provide more ecological validity with clinical practice. However, factors that may have served to decrease ecological validity were also present, including the use of a fictional vignette (1) and having an actor role play a client (5). In comparison to other studies, using more than one source of material did not appear to increase levels of agreement (1 and 5).
Some studies asked participants to combine their formulation items with those considered plausible but less relevant (6 and 9), which were then rated by separate participants. This could be a potential source of bias with the alternative formulation being 'straw men' and therefore easier to rate, inflating reliability. One study combined formulations of a different theoretical modality, which may have inflated reliability through theoretical bias (12).
Several studies provided participants with standard categories to choose from, for example, lists of wishes and fears for the CCRT method (8, 15 and 18) and lists of underlying cognitive mechanisms (2 and 3). Although this could serve to increase reliability, the results from the current review do not necessarily support this and results ranged from slight to substantial for cognitive formulations (2 and 3) and moderate to substantial for psychodynamic formulations (8, 15 and 18).

Blinding
The majority of studies evidenced no use of blinding (1, 2, 3, 4, 5, 6, 8, 10, 14, 17 and 18) which may have introduced bias into the research. Several studies, however, did evidence some or full use of blinding (7, 9, 11, 12, 13, 15 and 16). For example, study number 11 used reliability judges that adhered to the same theory as the formulators but were blind to the identities of the client, therapist and treatment outcome. Study number 15 used matched and mismatched formulations based on three possible types, that is, mismatched for diagnosis but matched for gender. Similarity judges in this study were blind to the hypothesis and comparison type.

Reliability measurement
Although percentage agreement can be used as a measure of reliability, it has been described as flawed in many respects (Hayes & Krippendorff, 2007) as it does not take into consideration agreement based on chance (McHugh, 2012). To account for this, alternative measures of reliability were developed such as Cohen's kappa (Cohen, 1960) and the intraclass correlation coefficient (Shrout & Fleiss, 1979).

Key findings
Results from the 18 studies demonstrated agreement and reliability ranging from virtually none to substantial (14). Six studies yielded slight to substantial reliability (2, 3, 4, 7, 9 and 10). However, fair to moderate reliability was found in two studies (16 and 18), moderate reliability was demonstrated in two studies (13 and 15), moderate to substantial Slight/fair to substantial 2 (9 and 10) Moderate to substantial 3 (11, 12 and 17) reliability was shown in four studies (8, 11, 12 and 17), and substantial reliability was shown in one study (6). The full range of reliability across all studies included in the review (excluding 1 and 5 which used percentage agreement) is detailed in Table 3.

Cognitive formulations
The reliability of cognitive formulations ranged from virtually none to substantial. In general, there was higher agreement for the descriptive aspects of cognitive formulations, that is, overt difficulties (1, 2 and 3) and less agreement for the more theory-driven inferential aspects (1 and 5). Although not accounting for agreement based on chance, studies that employed purely percentage agreement for all aspects of the formulation demonstrated less than a third of items meeting the 70% threshold (1 and 5). With the limitations associated with percentage agreement, it is possible that agreement may be even less. Although these studies used both clinicians and students, these findings were maintained when levels of agreement were considered for the accredited clinicians only. Case level formulations appeared to produce fairly low levels of reliability (1 and 5), with problem/situation specific formulations yielding substantial reliability (4). However, when study number 2 attempted to increase the reliability of a case level formulation by replicating the study but providing clinicians with specific contexts in which to rate a client's schemas, reliability was not increased (3).

Behavioural formulations
As there was only one study that used a behavioural formulation (7), comparisons with the same theoretical modality were not possible. However, it is of note that this study demonstrated low percentage agreement (30-43%) in the identification of overt problems in addition to low coefficients (.13-.58).

Psychodynamic formulations
Formulations developed from a psychodynamic theoretical modality mainly yielded moderate to substantial reliability (8, 13, 15 and 18). However, study number 16 generated fair to moderate levels of reliability. Clinicians in study 14 largely demonstrated moderate to substantial levels of reliability for orderwhether items are ranked in a similar way. However, for agreement in relation to the magnitude of items, scores ranged between slight and substantial. It should be noted that scores in Studies 14 and 15 were pooled over four judges, which may have served to inflate reliability.

Integrative formulations
When correlations were pooled and averaged over several judges, results for integrative formulations demonstrated moderate to substantial reliability (10, 11, 12 and 17). However, when an average was taken for a single judge, reliability appeared to be in the slight/fair to substantial range (9 and 10), which demonstrates that pooling scores can serve to inflate reliability. There was only one study that assessed the test-retest reliability of formulations (12). Integrative formulations demonstrated good stability over a 3-month period through Pearson product-moment and percentage agreement ratings (85-97% for Case A and 90-96% for Case B; 12).

How reliable are case formulations?
This review investigated the reliability of case formulations. Studies yielded mixed results, with reliability mainly ranging from slight to substantial. Reliability did not appear to increase when formulators were asked to identify discrete areas, for example, overt problems in behavioural formulations (7). However, results indicated the moderate agreement for the identification of underlying cognitive mechanisms and overt problems in cognitive formulations (2 and 3). When comparing different theoretical modalities, psychodynamic formulations appear to generate higher levels of reliability, however, these studies utilized methods that may have inflated reliability, such as using standard categories (8, 15 and 18) and pooling the scores of judges (14 and 15). In general, results indicate that reliability in case formulation can be achieved across all modalities. However, it is difficult to draw clear conclusions due to the dearth of literature, the varying methodologies employed and the limitations associated with these.
One methodological limitation concerns the use of students (1, 2, 3, 4 and 5). It is difficult to ascertain the standard at which a student can formulate. In clinical psychology doctoral training programmes, case formulation features heavily and is an essential skill that all trainees are required to develop (BPS, 2011). There is much less of an emphasis on case formulation in undergraduate or masters level psychology courses. It is therefore questionable at what level a psychology student or graduate could formulate. This is particularly relevant when considering the more inferential and theory-driven aspects of a case that may require advanced training and clinical experience. Although some studies incorporated training into their research (1, 2, 3, 8, 9, 12, 15 ad 16) or recruited from training programmes (5), it is questionable whether this can be comparable to the years of experience a clinician may have within a particular theoretical modality.
Another limitation concerns the use of transcripts (2, 8, 9, 10 and 14). It has been argued that using transcripts is likely to increase reliability due to the decreased chance of idiosyncrasies (Barber & Crits-Christoph, 1993). In general, these studies indicated a moderate level of reliability. However, as reliability ranged from slight to substantial it is difficult to draw clear conclusions regarding the impact that the use of transcripts has on inter-rater reliability. It could be argued that the use of transcripts, vignettes, and audio recordings decreases the ecological validity of case formulation research. In clinical practice, formulations are largely developed through engagement and exploration with the client and the use of such materials loses the collaboration that is often associated with formulation. Whilst this is unlikely to be overcome for research, it could be argued that using such materials prevents the formulator from noticing and interpreting non-verbal cues. Therefore, studies that incorporate video recordings (1, 5, 13 and 16) may be more ecologically valid than transcripts or vignettes.
The difficulties forming conclusions are further compounded by the different measures of reliability employed by the studies. The variability and levels of agreement required by measures make comparisons difficult (Bland & Altman, 1990). Furthermore, the use of certain statistical methods has been criticised. For example, Rankin and Stokes (1998) suggest that the Pearson correlation coefficient is inappropriate because the measurement responds to the linear association as opposed to agreement. It is of note that this measure was used to investigate the only known study of test-retest reliability of case formulation (12). In addition, there are several intraclass correlation coefficient equations available, which can result in different values being produced from the same data (Shrout & Fleiss, 1979). It has been suggested that Cohen's kappa is the most appropriate statistical measure of reliability for nominal data, weighted kappa is most appropriate for ordinal data and the one-way analysis of variance is the most appropriate measure for continuous data (Haas, 1991).

Implications for clinical practice
Although results from the current review highlight that reliability in case formulation can be achieved, from a scientific perspective, the wide range in levels of inter-rater reliability provides modest support for formulation being at 'the heart of evidence-based practice' (Kuyken et al., 2005(Kuyken et al., , p. 1188. Results from the current review suggest that training may lead to higher levels of agreement and reliability, particularly with inferential and theorydriven aspects of a case. Therefore, in clinical practice, it is plausible to suggest that training and greater clinical experience may serve to increase reliability between clinicians. Although not linked to reliability, research suggests that an increase in training leads to higher quality case formulations (Kendjelic & Eells, 2007). Whilst several of the studies offered training to their participants, the amount of training varied and it is unlikely that a single workshop would be adequate to develop the skill of formulation.
A possible explanation for the limited inter-rater reliability for the inferential aspects of formulation concerns the potential of cognitive shortcuts that therapists may take, such as availability and anchoring heuristics (Corrie & Lane, 2010;Kuyken, Padesky, & Dudley, 2009). It is of note that study number 16 requested that participants include supporting evidence for their formulations and generated fair to moderate reliability. The authors suggest that this may have kept inferences at a low level. Whilst inferential aspects of a formulation are important, providing evidence may limit the amount of cognitive shortcuts that therapists take, potentially leading to higher reliability.
High levels of reliability do not necessarily imply validity and it is questionable whether validity could be scientifically evaluated, particularly with clients who are acquiescent with credible formulations. Butler (2006) suggests 'formulations, as hypotheses (are) a way of making theory-based guesses' (p. 9) and therefore may be very different but potentially equally valuable. However, this poses questions regarding the implications for treatment outcome if reliability between clinicians is low and different areas are being targeted through treatment. It should be noted that not everyone advocates the importance of case formulation reliability. For example, Wilson (1996) has likened the case formulation to clinical judgement, which he argues 'can be all too fallible' (p. 299). He has therefore placed more emphasis on treatment outcome, arguing that standardized manual-based treatments are no less effective than formulation-based individualized therapy (Wilson, 1996). Unfortunately, there is a paucity of research investigating the link between formulation and treatment outcome (BPS, 2011). 4. With regard to formulation data, it has been argued that the use of transcripts may lead to scientific bias with researchers selecting particular cases or small samples (Barber & Crits-Christoph, 1993). As explained previously, this may also lead to important non-verbal cues being missed. To provide more ecological validity, future research should use audiovisual material. 5. Developing formulations in teams is likely to increase reliability. Although formulations can be developed as part of a multi-disciplinary team (BPS, 2011;Johnstone & Dallos, 2006, 2014, such research may have limited applicability to clinical practice where clinicians work alone. Therefore, future research should compare the reliability of two or more independent formulators. 6. The BPS (2011) highlights best practice guidelines for the use of formulation, which includes grounding formulation in an appropriate level of assessment. This is likely to include information from multiple sources, such as assessment measures and clinical interview. In this way, information can be triangulated to provide a comprehensive understanding of the client. Although two studies (1 and 5) used more than one source of material, that is, video recordings and assessment measures, most studies used only one. Future research may benefit from examining differences in reliability when participants are provided with more than one source of client information.

Limitations
One potential limitation of the current review concerns the broad inclusion criteria, which may have introduced heterogeneity into the sample, potentially affecting the results. This is one possible reason why the range of reliability was so varied, from virtually none to substantial. As can be seen from the data extraction table (Table 1), there were a variety of disorders within the client samples and a range of professions in the formulator and rater samples. It is possible that narrower inclusion criteria may have affected the levels of agreement and reliability, and subsequently the conclusions that can be made. Furthermore, we only included peer-reviewed journal articles written in the English language, which may represent a source of selection and publication bias.

Conclusion
This review has shed light on the reliability of case formulation and demonstrated that it can be achieved through a range of psychological modalities. However, this review has also highlighted a fairly under-researched area for a skill so pertinent to the profession of clinical psychology and requires further investigation.