Identifying Candidate Risk Factors for Prescription Drug Side Effects using Causal Contrast Set Mining

Big longitudinal observational databases present the opportunity to extract new knowledge in a cost effective manner. Unfortunately, the ability of these databases to be used for causal inference is limited due to the passive way in which the data are collected resulting in various forms of bias. In this paper we investigate a method that can overcome these limitations and determine causal contrast set rules efficiently from big data. In particular, we present a new methodology for the purpose of identifying risk factors that increase a patients likelihood of experiencing the known rare side effect of renal failure after ingesting aminosalicylates. The results show that the methodology was able to identify previously researched risk factors such as being prescribed diuretics and highlighted that patients with a higher than average risk of renal failure may be even more susceptible to experiencing it as a side effect after ingesting aminosalicylates.


Introduction
Longitudinal observational data potentially hold a wealth of information, however we are currently limited in the ability to efficiently extract causal relationships from this form of data due to bias and confounding [1]. In randomised clinical trials confounding can be overcome by manipulating the variables and mixing the potential confounders equally between the group given the drug and the control group. Unfortunately, this is not possible for observational data as the data are passively observed. As a consequence, spurious results are common when analysing observational data due to the various forms of bias in the data. In the medical field the gold standard for causal discovery are randomised clinical trials [2]. However, these are costly and sometimes unethical [3]. If medical longitudinal observational data could be successfully analysed and the results used to complement randomised trials for causal discovery, then this would address these issues. This would enable a greater understanding of various medical mechanisms and enhance current knowledge.
Bayesian causal discovery techniques that learn complete causal models have often been used to identify causal relationships in longitudinal observational data [4]. Due to scalability issues the recent focus has shifted towards constraint based methods [5]. Although the constraint based methods have performed well in some domains, they rely on numerous assumptions [6] that may not always hold true and may still be inefficient for data with high volume and high variety. A recent approach for identifying causal association rules included a two step method, of firstly mining association rules and secondly implemented a cohort study to filter out those that are likely to be causal. This was accomplished by identifying controls that had the antecedent and matched specific attributes of the cases. The odds ratio was then used as the filter, as only the rules with a significant deviation between how often the consequence occurred for the cases and controls were kept [7]. In this paper we attempt a similar approach for identifying causal contrast sets but use logistic regression as a filter. Rather than using the odds ratio, we use the p-values of the logistic regression variables to indicate how significant having the antecedent is for the occurrence of the consequence. As the logistic regression can consider covariates such as age, and gender into the model, we can filter contrast set rules that are caused by observed confounders.
In this paper we present a proof-of-concept candidate risk factor detection algorithm based on causal contrast set mining. Causal contrast set mining is a term we use to define the discovery of causal association rules that identify differences between various groups. The algorithm firstly identifies interesting rules consisting of sets of events that commonly precede a user specified event and then investigates how often these interesting rules occur in general. Rules that occur more often before the user specified event are then investigated via a logistic regression model. This reduces age/gender confounding and highlights the most interesting rules. We implement the methodology to a real word dataset. The dataset we use is a UK general practice database containing complete medical and drug prescription records for millions of patients within the UK. Our focus is towards identifying risk factors for patients' experiencing prescription drug side effects for the drug family aminosalicylates (5-ASAs). These drugs are often given to treat inflammatory bowel disease but are known to cause renal failure with an incidence rate of 0.17 cases per 100 patients per year [8]. The purpose of this research is to investigate a new technique for mining contrast set causal relationships efficiently and evaluate its potential for identifying candidate risk factors of patients experiencing side effects to prescribed medication.

The Health Improvement Network
The Health Improvement Network (THIN) database (www.thin-uk.com) is a large longitudinal observational database containing medical records for millions of patients within the UK. There are over 600 general practices within the UK that are registered to the scheme consisting of over 3.5 million active patients. For each patient within THIN, their demographics such as age, gender and location are known, as well as their complete medical and therapy record histories during the period of time they are registered at a participating practice. The suitability of this database for epidemiological study has been investigated and the results show it is reasonably representative of the general UK population [9]. It is worth highlighting that the database does have some potential issues, such as not containing over the counter prescriptions, only containing data that patients have told their doctors about and delays in the recording of medical event into the database. A common problem with the database is historical event dropping, when a patient moves general practices, it is common for the patient to have historical illnesses/events recorded shortly after registering. To prevent this biasing analyses, it is standard to exclude the first year of a patient's records after moving to a new general practice [10]. This preprocessing was implemented in this study. The READ code system is the coding system used within UK primary care to record medical events [11]. Each READ code corresponds to a medical event (e.g., a diagnosis, an administrative event, a laboratory result or a symptom). The READ codes consist of 5 alphanumeric digits and have a hierarchal tree structure based on the level of detail of the corresponding medical event being recorded. The level of a READ code corresponds to how many non dot digits it contains, for example the READ code 'A10..' is a level 3 READ code, whereas the READ code 'A....' is a level 1 READ code. A level 2 READ code is the child of a level 1 READ code if the READ codes have the same first digit. This is generalised to a level n ∈ {2, 3, 4, 5} READ code being the child of the level n − 1 READ code if the first n − 1 digits of both READ codes are the same. The advantage of this hierarchal structure is that a child READ code represents a more specific version of its parent READ code's corresponding medical event. For example, the READ code 'A....' corresponds to the description 'Infection' and is the parent of the READ code 'A1...' corresponding to 'Tuberculosis', which is the parent of the READ code 'A11..' corresponding to 'Pulmonary tuberculosis'.
Prescriptions are recorded into THIN using a drug code and each prescription also contains the drug's British National Formula (BNF) code [12]. The BNF code groups drugs into similar families. Each prescription can be linked to up to three BNF codes.

Algorithms
Association rules mining Association rules mining [13] is a method for discovering relations between variables in large databases. It was originally designed to identify relationships between items that are commonly purchased together (occur in the same shopping baskets). The relations are normally of the form {antecedent events } → {consequence}, meaning that if we find all of the antecedent events in a shopping basket, then we have a good chance of finding the consequence. An example of an association rule is {milk, butter} → {bread}, which means shoppers that buy milk and butter are also likely to buy bread.
The search space for identifying association rules can be extremely large with big datasets. Therefore it is common to restrict the search to only include rules containing sets of items that appear frequently in baskets. This is accomplished by specifying a minimum support threshold, and only items/itemsets that occur more often than the support are considered. These are referred to as frequent itemsets.
Formally, let I = {i 1 , i 2 , ..., i n } be a set of n items and t = X ⊂ I be a transaction containing a set of items. We denote the database by D = {t 1 , t 2 , ..., t m }. This is a set of m transactions. The support of an itemset X is the proportion of transactions within the database that contain X, An itemset X is said to be frequent if its support is greater than a given threshold supp(X) > ω, where ω is called the minimum support. The confidence of an association rule X → Y is the fraction of baskets that contain both X and Y (supp(X∪Y )) divided by the number of baskets containing this is similar to the conditional probability of Y given X. In general, the association rules X → Y are identified such that the support and confidence of X → Y are greater than the minimum support and confidence thresholds. There are various methods for identifying contrast set rules, including discovering emergent patterns by considering the ratio of two supports [14], using a suitable search technique combined with statistical hypothesis testing [15] or creatively using a classifier [16]. Emergent pattern discovery is suitable for simple problems that only require contrasting two groups. This is what we will do to identify candidate risk factors, as we just need to compare the patients that experienced the adverse drug reaction with those that did not.
Logistic Regression Logistic regression [17] is a method that expresses the log odds of belonging to a class as a linear combination of the features, The parameters w i are found using maximum likelihood. This is re-arranged to give the conditional probability of belonging to each class as, therefore, class 0 is chosen when exp(w 0 + i w i X i ) > 1 and 1 is chosen otherwise. The parameter w i and its standard error of the logistic regression tell us how significant the i th feature, X i , is in determining the class. In this paper we use a significance level of 5%.

Methodology
The proposed candidate risk factor identification methodology consists of four steps. The first step is creating two different databases based on whether a patient who was prescribed a 5-ASA experienced renal failure or not. The second step is to identify frequent itemsets for the patients who experience renal failure after 5-ASAs and calculate whether these itemsets occur more often for these patients than for the patients prescribed 5-ASAs in general. This identifies any potential risk factors that are common (occur in more than 5% of the patients). The third step is to identify whether these potential risk factors are a significant influence on experiencing renal failure after a 5-ASA when accounting for age and gender confounding. The final step is presenting the frequent itemsets that occur more than in general for the patients who experience renal failure after a 5-ASA ordered by the p-value indicating the significance of the itemset's presence in predicting the chance of renal failure after a 5-ASA.
Step 1: Partition Databases Similar to market baskets, patients medical baskets can be constructed based on the records they have in the THIN database and frequent itemset mining can be applied to find frequent medical events sets. Due to the number of possible itemsets being very large, frequent itemset mining is often restricted so that only interesting itemsets are discovered.
To generate association rules for the THIN database we consider the items to be all the medical events and all the drugs recorded within the THIN database. So the THIN items are I = {all the medical events and all the drugs} and a transaction is X ⊂ I. Then we generated two databases from the THIN database: D1 contains the itemsets of patients that took 5-ASA but did not suffer from renal failure within a month and D2 contains the itemsets of patients that took 5-ASA and suffered from renal failure within a month. For each transaction, t D1 i ∈ D1 or t D2 i ∈ D2, the transaction consists of all the items within the THIN database that are recorded for the i th patient in the database.
Step 2: Calculating Support Ratio In general the THIN data is sparse and the majority of items have a low support. However, to identify risk factors for renal failure after ingesting a 5-ASA we only need to investigate the itemsets that are frequent in the patients that took 5-ASA and suffered from renal failure (frequent itemsets in D2). Then we need to find which of these frequent itemsets from D2 have a higher support than within D1, as this indicates itemsets that are more common in the 5-ASA patients who experience renal failure compared to all the 5-ASA patients. Therefore, we apply frequent itemset mining to the database D2 with minimum supports of ω = 0.05 and for each frequent item we also calculated its support in D1. We then calculate the support ratio for each frequent itemset X from D2, where m 1 and m 2 are the number of patients that took 5-ASA but did not suffer from renal failure and took 5-ASA and suffered from renal failure, respectively. The value ω = 0.05 was chosen as this means that any identified risk factors occur for at least 5% of the patients experiencing renal failure after 5-ASA. Therefore we are identifying common risk factors, however this value can be adjusted.
After applying the association rules, we will get a table containing the frequent itemsets of D2 and their support in both D1 and D2. The rate of each frequent itemset corresponds to the ratio of two support values (support(X,ASA→RF) / support(X,ASA→ ¬RF)), see Table 1. The itemsets with a suppRatio greater than 1 are considered potential risk factors that will be further evaluated using logistic regression.
Step 3: Logistic Regression We then applied logistic regression with the independent variables: presence of potential risk factor, presence of 5-ASA, age and gender and dependant variable indicating renal failure. This identified whether the potential risk factors are in fact significant risk factors for experiencing renal failure after 5-ASAs when accounting for age/gender confounding.
To apply the logistic regression we needed to consider a set of cases (the patience with renal failure recorded in THIN) and a set of controls (the patients with no renal failure recorded in THIN). For each patient experiencing renal failure we selected 5 controls who did not. Increasing the number of controls per case is a technique that can increase the power of the analysis and 5 controls per case were chosen as we have a large number of controls available but only a limited number of cases. For each case, the age used in the logistic regression is considered as the age when the case first suffered from renal failure in life. Each control was selected by picking a random non-renal failure patient and a random point in the time while the patient is active in THIN such that the age/gender distributions of the cases and controls were the same. Then, for each potential risk factor frequent itemset identified in step 2 (each X) we created the case/control data as displayed in Table 2, where the variable X is True if the patient's itemset up to their specified age contains X, the variable ASA is True if the patient was prescribed a 5-ASA before the specified age and RF is True if the patient has renal failure recorded in THIN and False otherwise. The logistic regression with RF as the dependant variable was then applied considering the independent variables: age, gender, X, and ASA. The interaction between the ASA variable and the X variable was also included.
Step 4: Ranking The p-value of the interaction between the frequent itemset and 5-ASA was calculated to evaluate whether the frequent itemset is a risk factor of experiencing renal failure after 5-ASA. The smaller the p-value is, the greater the confidence that the frequent itemset corresponds to a risk factor. The p-value of each frequent itemset is extracted and listed in the result table.
The results are returned ordered by the p-values in ascending order. The final Table 3. Example of the output of the methodology.
Itemset (X) P-value(Age) P-value(Gender) P-value(ASA*Rules) output of the methodology is this ranked list of frequent itemsets as illustrated in Table 3.

Software
We use SQL to manage the data and R [18] to perform the analysis. The package arules [19] was used to identify the frequent itemsets. The top 30 antecedents that occur significantly more often for patients who experience renal failure after ingesting a 5-ASA, ordered by the logistic regression p-value, are presented in Table 4. The results suggest that some potential risk factors for experiencing renal failure after ingesting a 5-ASA are hypertension, diuretics, pain, arthritis, diabetes, influenza vaccination, anaemia, dehydration and antibiotics. The results identified some known risk factors. However, in general there is little information about the risk factors making the evaluation difficult. This highlights the importance of a new methodology for discovering risk factors. In a previous study it was observed that diuretics and dehydration may be risk factors [20]. The diuretic drug furosemide was ranked second by the methodology and patients with a history of furosemide were 3.7 times more likely to experience renal failure after 5-ASAs. We found that those with a history of co-proxamol and furosemide were 4.89 times more likely to experience renal failure after 5-ASAs. The drug loperamide was also identified as a risk factor by the method. This drug is used to treat diarrhoea and may indicate that the patients who experienced renal failure after loperamide and 5-ASAs were dehydrated.

Results & Discussion
Hypertension is a general risk factor for developing renal failure. Interestingly, this research suggests that 5-ASAs increase hypertension suffering patients' susceptibility to renal failure. Therefore 5-ASA may need to be prescribed more carefully to patients who are already susceptible to renal failure. It is common for side effects to occur in patients that have a higher background risk of the event, so this is not unexpected.
Some painkillers and drugs used to treat hypertension are known to cause renal failure. The identification of pain and hypertension as risk factors may indicate an interaction between these drugs and the 5-ASAs that results in the side effect of renal failure. Therefore the methodology may highlight indirect risk factors. This does highlight one limitation of this methodology, it is difficult to identify whether the medical event or the drugs used to treat the medical event may be risk factors. Additional work will be required to determine whether the identified potential risk factor is a direct or indirect risk factor.
It is worth highlighting that this methodology cannot definitively determine the risk factors of known adverse drug reactions. Any results obtained need to be validated via formal epidemiological studies. However, this method can highlight the most likely risk factors and can be considered to be a filter. Therefor this methodology may lead to more efficient discovery of unknown risk factors by identifying which candidate risk factors should be investigated further. Effectively this methodology is an ADR risk factor filter.
In this paper we chose to use a minimum support of 0.05 as this ensured any identified risk factors occurred for more than 5% of the patients who experienced the side effect. This value may need to be adjusted based on the type of risk factors of interest or based on how common the side effect being investigated is.

Conclusions
In this paper we have presented a proof-of-concept of a novel methodology for identifying causal contrast set rules in big longitudinal observational data. The methodology was able to identify known risk factors for patients experiencing renal failure after ingesting a 5-ASA drug. However this methodology cannot be considered to definitively identify risk factors. Rather, it acts as a filter for highlighting the most interesting.
Potential areas of future work are developing a way to tune the minimum support used to identify the frequent itemsets and applying the methodology to a range of known prescription side effects to determine its robustness.