A Novel Semi-Supervised Algorithm for Rare Prescription Side Effect Discovery

Drugs are frequently prescribed to patients with the aim of improving each patient's medical state, but an unfortunate consequence of most prescription drugs is the occurrence of undesirable side effects. Side effects that occur in more than one in a thousand patients are likely to be signalled efficiently by current drug surveillance methods, however, these same methods may take decades before generating signals for rarer side effects, risking medical morbidity or mortality in patients prescribed the drug while the rare side effect is undiscovered. In this paper we propose a novel computational meta-analysis framework for signalling rare side effects that integrates existing methods, knowledge from the web, metric learning and semi-supervised clustering. The novel framework was able to signal many known rare and serious side effects for the selection of drugs investigated, such as tendon rupture when prescribed Ciprofloxacin or Levofloxacin, renal failure with Naproxen and depression associated with Rimonabant. Furthermore, for the majority of the drug investigated it generated signals for rare side effects at a more stringent signalling threshold than existing methods and shows the potential to become a fundamental part of post marketing surveillance to detect rare side effects.


I. INTRODUCTION
Negative side effects caused by prescribed medication currently present a huge burden for the healthcare service in terms of causing both patient morbidity or mortality and costing large sums of money [1] [2] [3].Investigations have shown that the rate of unwanted side effects has been increasing annually [4] [5].
Possible reasons for this are an increase in the number of annual prescriptions due to an aging population or an increase in polypharmacy, when numerous drugs are prescribed at the same time [6].Although it is common for a patient to develop side effects due to prescribed medication there is currently no efficient means of identifying all the side effects of a drug.When the side effect is detrimental to the patient's quality of life, it is often referred to as an Adverse Drug Event (ADE) and when the drug causing the ADE is known, it is termed an Adverse Drug Reaction (ADR).A study conducted in the UK between November 2001 to April 2002 indicated that 6.5% of admissions to hospital were due to ADRs, with the mortality rate for an ADR patient of 2.3% [7].Interestingly, it was found that over 70% of these ADRs were potentially avoidable.A more recent study in Brazil suggests ADRs may be the cause of an even higher proportion of hospital admissions for the elderly as it showed that ADRs were the cause of hospitalisation for over 50% of elderly patients [8].It also highlighted that a significant factor for developing an ADR was polypharmacy [8], when patients are prescribed multiple drugs.Some obvious ADRs can be discovered during the experimental stages of a drug's development, but the occurrence of an ADR can depend on a magnitude of factors and it is impossible to investigate all the possible situations that may occur when the drug is taken.For example, testing for ADRs that result from polypharmacy would require clinical trials with millions of people to be able to investigate all the different drug combinations and this is not possible.Due to the limitations of clinical trials, rare ADRs, including fatal ones, are in most circumstances not discovered before a drug is marketed [9] [10].As a consequence, after a drug is approved and available to patients, possible ADRs are investigated during the whole lifetime of a drug by a process known as post-marketing drug surveillance.
Post-marketing surveillance (such as doctors being vigilant and noticing possible drug and illness associations) can identify common ADRs and in general the more common the ADR is, the fewer the number of patients that need to be prescribed the drug before it is discovered.However, ADRs that occur for drugs that are rarely prescribed or rare ADRs may go unnoticed by medical practitioners and may cause morbidity or mortality in patients that could have been prevented with more efficient drug surveillance methods.For example, it took 23 years before there was sufficient evidence that the drug Tamoxifen used to treat breast cancer caused endometrial cancer in about 1 in 6000 patients [11] [12].
Current methods to discover rare ADRs often involve using a Spontaneous Reporting System (SRS) database that contains a collection of voluntary suspected drug and ADR reports, such as the database containing information from the UK yellow card scheme.The algorithms that signal ADRs by mining SRS databases calculate a measure of how disproportionally more often the medical event is reported with a specific drug of interest compared with any drug.The frequently implemented measures of disproportionality involve using standard epidemiology measures [13], estimating the information component using a neural network approach [14] or calculating a modified version of the relative risk by applying a Bayesian model [15].SRS databases combine reports of possible ADRs from a large population enabling the identification of possible ADR signals more efficiently, but they are known to suffer from underreporting [16] and this causes a lag in the time it takes to confidently signal a potential ADR.The under-reporting may also prevent the detection of rare ADRs, as these ADRs may never be suspected and therefore never be reported to an SRS database.
A potential new way to detect the rare ADRs that cannot be identified by doctors or by current methods applied to SRS databases is to use The Health Improvement Network (THIN) database (www.thinuk.co.uk), an Electronic Healthcare Database (EHD) containing complete UK General Practice records for registered patients.The THIN database contains all medical events (such as illnesses, laboratory results, signs and symptoms or administrative events) that a doctor is informed of for a patient as well as their complete prescription histories.Therefore, any rare ADRs that are serious enough to be reported to a doctor are more likely to be detected at an earlier point in time by applying a suitable data mining method on the THIN database rather than mining the SRS databases.
Existing methods developed for the EHDs are often disproportionality based methods (methods that contrast how often the event of interest occurs after the specified drug relative to how often the event of interest occurs after any drug) similar to the SRS methods [17] [18] [19] or association rule mining methods [20] [21] [22].As EHDs do not contain links between drugs and suspected ADRs these are often inferred by investigating medical events that occur within some time period around a drug, but many of these medical events are linked to the cause of taking the drug and these 'therapeutically related' medical events present a major issue with the majority of the existing methods.It has been demonstrated that these existing methods are currently not suitable for signalling rare ADRs [23].However, it is the rare ADRs that are unlikely to be detected by mining SRS databases, due to under-reporting, so developing an algorithm that can signal rare ADRs using the THIN database would be beneficial.
In this paper we develop a novel computational meta-analysis framework that integrates the existing methods (MUTARA, HUNT, OE ratio, see section III-A) and uses information obtained from the internet to efficiently and accurately identify rare ADRs that occur immediately or shortly after a drug is prescribed.The framework uses the dependency measures obtained from some of the existing electronic healthcare based methods and novel values of interest as attributes for each medical event that occurs within a 30 day period after the drug of interest is prescribed for any patient.After the attributes are generated for each medical event we label some of the medical events by extracting information from the internet informing us of the indicator events and known ADRs for the drug of interest.The unlabelled medical events then have labels assigned by applying metric learning and semi-supervised clustering.
Finally using the labels we develop a novel filter that removes medical events labelled as indicator events and then return the remaining medical events ordered by how often they occurred unpredictably within 30 days after the drug being investigated multiplied by weights based on their assigned labels.
The continuation of this paper is as follows, section II contains the background information on the THIN database and section III describes the problem formulation.This is followed by the description of the novel methodology we developed that is able to identify rare ADRs in section IV.Section V contains the results of comparing our novel algorithm with a section of existing methods for the detection of known rare ADRs and the discussion of these results is contained in section VI.The paper finishes with the conclusion and future work suggestions in section VII.

II. THIN DATABASE
The THIN database contains complete medical and prescription records for registered patients at participating general practices within the UK.The medical information is recorded into the THIN database by Read Codes that correspond to illnesses, so each Read Code is paired with the illness description.Each drug is recorded into the THIN database by a drugcode (sometimes called the multilexeid) that is paired to the generic name.The drugcode consists of nine numbers and does not have a structure we use but the drugcode does specify the way the drug is ingested and the dosage.Each entry also includes the date that a Read Code or drugcode is recorded but does not contain the time.In this paper we used a subset of the THIN database containing records from 495 general practices.The subset contained approximately four million patients, over 358 million prescription entries and over 233 million medical event entries.
Patients can register at a new practice at any point over their lifetime and it has been shown that statistical studies using the THIN database will be biased if records from the first year after registration for a patient are included in the study [24].The reason for this is that newly registered patients will need to inform their new doctor of any chronic illnesses they have, but these illnesses will be recorded on the day they inform the doctor of them rather than the actual day they were discovered.To prevent bias in this study we do not include the first year of a patient's medical records since registration.We also ignore the last 30 days of prescription records for each patient to reduce potential under reporting that may occur by including patients with less than 30 days of medical records after the first prescription of a drug.

III. PROBLEM FORMULATION
The focus of this paper is to develop a method for detecting rare ADRs that occur shortly after prescription, so we can restrict our attention to the medical events that occur around the time of the drug prescription.In this paper we consider the natural numbers N to include 0 and use the interval [a...b] to be the interval of natural numbers from a to b ({n ∈ N : a ≤ n ≤ b}).
If we let M denote the set of Read Codes, D denote the set of drugcodes and Ω denote the set of patients in the THIN database then the sequence of Read Codes or the sequence of drugcodes for a patient can be represented as two discrete functions.The discrete function representing a patient's Read Code sequence is a mapping f M : N × Ω → P(M ); f M (t, ω) → e ω,t where t is the age in days of the patient ω and e ω,t is the set of Read Codes that are recorded into the THIN database for patient ω when they are t days old.Similarly, the discrete function representing a patient's drugcode sequence where d ω,t is the set of drugcodes recorded in THIN for the patient ω when they are t days old.
Using the example THIN database entries shown in Table I, we have, M = {A123.,B21., C1..., C11.., C12.., C121., D25.., F 1..., F 12..} The set of Read Codes that patient jj3 has recorded when he is 9999, 10000 and 10002 days old are The reason f M (10002, jj3) = {} = ∅ is that patient jj3 does not have a Read Code recorded when he is 10002 days old.Similarly, the set of drugcodes that patient jj3 has recorded when he is 10000 and The set consisting of every age in days where patient ω has a drugcode recorded into the THIN database is, The set of all drugs that are prescribed for patient ω is the finite union of the set of drugcodes recorded daily for the patient while they are active in the THIN database, ∪f D (t, ω) , so to determine the age that patient ω is first prescribed the drug of interest d * ∈ D we first find the set of ages that the patient was prescribed the drug, and then define a new function that finds the minimum age that the patient is prescribed the drug or returns −1 if the patient has never been prescribed the drug.The set of patient's ages where the drug is prescribed for the first time in 13 months is determined by the function α : If a patient is not prescribed the drug then the function returns −1 otherwise it returns the set of ages in days that the patient took the drug and had a minimum of 386 days between previously taking the drug.
For the example entries in Table I, the set of ages in days that patient jj3 is prescribed any drugs with the first prescription occurring at 15001 days old (α 1 (aa2, 912314611) = 15001) but the drug was then repeated monthly two times and then the patient had a 10243 day break before being prescribed the drug for the final time, therefore the set of ages in days that patient aa2 is prescribed drug 912314611 for the first time in 13 months is α(aa2, 912314611) = {15001, 25304} as the instances when the drug was repeated after 30 days are not included.
In the continuation of this paper we will use α(ω, d * ) k to refer to the k th time patient ω is prescribed drug d * for the first time in 13 months, so α(ω, where n is the total number of times patient ω is prescribed drug d * for the first time in 13 months (n = |α(ω, d * )|).Following on with our example, α(aa2, 912314611) 1 = 15001 and α(aa2, 912314611) 2 =

25304.
This then enables us to define a interval of interest around each time the patient is prescribed a drug for the first time in 13 months, for each Equation ( 6) is defining a time interval centred around the K th time drug d * is prescribed for the first time in 13 months for patient ω determine by the integers t 1 and t 2 .For example, if we wanted to investigate the 30 days after a prescription we would use t 1 = 1 and t 2 = 30 and the time period around the first prescriptions in 13 months of drug 912314611 for patient aa2 would be T (aa2, 912314611, 1, 30 As the function f M (t, ω) returns a set of Read Codes that are recorded into the THIN database for patient ω when they are t days old, we can find the set of ages in days that ω has any Read Code recorded, and the finite union over each t ∈ A M (ω), ∪f M (t, ω) , is the set of all Read Codes that are recorded into the THIN database for the patient ω.It follows that ∪f M (t, ω) is the set of all Read Codes that are recorded into the THIN database for the patient ω during the period of interest determined by t 1 and t 2 around the K th prescription of drug d * .The function h : is one if the patient ω has the Read Code e i recorded within the time period of interest around the first prescription of drug d * and zero otherwise.To determine the number of times the drug d * is prescribed to patient ω for the first time in 13 months and the Read Code e i is recorded within the time period of interest we define another function ĥ : The total number of patients in the THIN database that have Read Code e i recorded within the time period of interest centred around the first prescription of drug d * is then, and the total number of first times prescriptions of drug d * in 13 months where the Read Code e i is recorded within the time period centred around the prescription determined by t 1 and t 2 is, ω∈Ω ĥ(e i , d * , ω, t 1 , t 2 ) (11)

A. Existing Algorithms
The observed to expected (OE) ratio [17] calculates the information component (IC) that looks at the disproportionality between how often a Read Code is recorded within some time period after the first prescription in 13 months of the drug of interest compared to how often it is recorded within the same time period after the first prescription in 13 months of any drug but also adds a bias lowering the IC value if the Read Code or drug is rare.
The function used to test if a set of medical events is empty is, Ĥ : September 3, 2014 DRAFT using the above function, we define the following values, Where n d * ei (0, 30) is the number of times that event e i occurs in the month after a first prescriptions in The value of a half added to both the numerator and denominator in the IC(t to the IC value the month prior to the prescription being higher than the month after prescription, or the IC value on the day of prescription being higher than a month after.In this paper we will use the IC ∆ (e i , d * ) as described above as an attribute for each possible drug and Read Code combination and use the Heaviside step function H : R → {0, 1}, to define the filter functions, as two addition binary attributes.
The algorithm Mining Unexpected Temporary Association Rules given the Antecedent (MUTARA) [21]  this is within a month of the first prescription or the patient's age in days for the first time the drug is prescribed if the second prescription is not within a month of the first.
MUTARA uses patients that have not been prescribed the drug (so Putting this all together, the time interval of interest around the first prescription of drug d * for patients prescribed the drug or the time interval chosen at random for patients who have never been prescribed the drug used by MUTARA is, where, For patients who are prescribed the drug d * a filter is applied to ignore any 'expected' Read Codes that are recorded within the time interval of interest after the first prescription of the drug.A Read Code is 'expected' for the patient during the time interval of interest after the drug is prescribe if the patient also had the Read Code recorded within t 3 ∈ N days prior to drug being prescribed.We define this time interval prior to the prescription as, September 3, 2014 DRAFT We then find how many patients have each Read Code recorded 'unexpectedly' during the time interval of interest after the drug by, The Heaviside step function applied to α 1 (ω, d * ) returns 1 if patient ω has been prescribed the drug d * , as in that case α 1 (ω, d * ) > 0 or returns 0 if patient ω has never been prescribed the drug d * .MUTARA then calculates the unexpected-leverage as, where the fraction of patients in the database who have the drug multiplied by the fraction of patients who have the Read Code recorded during their interval of interest is subtracted from the fraction of patients in the database who have the Read Code recorded within a month of the first time the drug is prescribed (or up to a month after the second prescription if the drug is repeated within a month).
HUNT calculates a similar value to the unexpected-leverage known as the leverage that does not include a filter to remove 'expected' Read codes based on a patient's history, and HUNT then ranks the Read Codes in descending order of the leverage to attain the leverage rank and in descending order of the unexpected-leverage to attain the unexpected-leverage rank and then calculates the rank ratio between the leverage rank and unexpected-leverage rank.Finally, HUNT returns the Read codes in descending order of this rank ratio.

B. Novel Algorithm Attributes
The attributes of interest for detecting ADRs are how many patients have the Read Code recorded a month prior to the first prescription compared with how many patients have it recorded a month after, As the THIN database does not record the time that a Read Code or drugcode is recorded, Read Codes that are recorded on the same day as the drugcode could correspond to a serious ADR that occurs immediately or an indicator, this is why the day of prescription is not included in the above ABratios and instead is used as another attribute, In previous methods there has been a patient level filter that removes medical events from a patient's sequence that occur 30 days after the drug if the patient also had the medical event shortly prior to the drug [20].The justification for this is, if a patient had an illness shortly before the drug then having the illness repeated after the drug is unlikely to be an ADR.This inspires the third attribute that is based on the number of patients who have the Read Code recorded in the month after and also have it recorded during the month prior to the first prescription, The final attributes of interest make use of the Read Code tree structure.When a patient first has an illness it is likely that not much detail is known, so a low level Read Code will probably be recorded into the THIN database, over time more information may be discovered about a patient's illness possibly due to laboratory results and this may then result in a more specified higher level Read Code being entered (a child of the original less detailed Read Code).As a consequence, it is common for higher level Read Codes related to the cause of taking the drug to only be recorded after the drug, so to help distinguish between these and ADRs we calculate the Read Code after and before ratio when only considering the first two or three elements of a Read Code.For example, considering the first three elements of the Read Code, the Read Code A11ab becomes A11 and the ABratio lev3 is the number of patients that have a Read Code starting with A11 within the month after the first prescription of the drug divided by the number of patients that have a Read Code starting with A11 within the month prior to the first prescription of the drug.Similarly, the ABratio lev2 for a Read Code is the number of patients who have a Read Code with the same first two elements within a month after the first prescription divided by the number of patients who have a Read Code with the same first two elements within a month prior to the prescription.
Step 1 •Generate Read code attributes Step 2 •Get labels for some Read codes Step 3 •Apply distance metrix learning Step 4 •Apply semi-supervised clustering Step5 •Filter and Order Read codes

IV. DRESS ALGORITHM
The Detecting Rare Events Semi-Supervised (DRESS) algorithm comprises of five steps, see Fig (2).
The DRESS algorithm requires the user to input the drug of interest (d * ) and returns a ranked list of Read Codes in descending order of how likely they are to be ADRs.Tentative ADR signals can then be determined by the DRESS algorithm by considering the top k ranked Read codes to be signalled as ADRs.The value of k will determine the filtering threshold and in this paper we consider the top 100 ranked Read Codes to be signalled by the algorithms.

A. Step 1
The first step of the DRESS algorithm is the generation of attributes for any Read Code that could be an ADR.This is accomplished by initially finding the set of all the Read Codes that are recorded within a month of the first prescription of d * for any patient, we denote this set of Read Codes by G.

B. Step 2
As we are applying a semi-supervised approach, we need labels for some of the Read Codes.We have decided to have three different labels for each Read Code, one label representing Read Codes that are ADRs (known ADR), another label representing Read Codes that cause the drug to be taken by the patients (indicator) and the final label representing Read Codes that are not linked to the drug but just occur by chance (noise).The reason for choosing three labels is because there is information available to enable us to determine the labels for a sufficient number of Read Codes when using three labels.
We determined the Read Codes that are labelled as noise by using the hierarchal structure of the Read Code tree.We determined branches that are not related to immediately occurring ADRs by manually The constrained K-means algorithm [27] is applied to the Read Codes on their corresponding transformed data points determined by the distance metric learning algorithm above.The constrained Kmeans algorithm is a semi-supervised algorithm that fixes the class of the labelled Read Codes and uses these labelled Read Codes to calculate the initial cluster centres then iteratively assigns the non-fixed Read Codes into the cluster with the closest mean, with the means iteratively being recalculated until convergence.
When the DRESS algorithm is implemented, the set of data points input into the Constrained Kmeans algorithm is the set {f (x 1 ), f (x 2 ), ..., f (x n )}, the value of K input is 3 and the initial seeds are S 1 = {x i : x i is labelled as a known ADR}, S 2 = {x i : x i is labelled as an indicator} and S 3 = {x i : x i is labelled as noise}.
Input : Set of data points X = {x 1 , x 2 , ..., x n }, x i ∈ R d , number of clusters K, the set l=1 of X such that the KMeans objective function is optimised.
For x ∈ S, if x ∈ S h assign x to the cluster h (i.e., set X t+1 h ).For x ∈ S, assign x to the cluster h * (i.e., set X t+1 h * ), for h * = argmin until convergence; Algorithm 2: The Constrained K-means algorithm developed in [27] Read Codes in the same cluster as the Read Codes that were originally labelled as known ADRs are referred to as being in the ADR cluster, Read Codes in the same cluster as the Read Codes that were originally labelled as indicators are referred to as being in the indicator cluster and Read Codes in the same cluster as the Read Codes that were originally labelled as noise are referred to as being in the noise cluster.

E. Step 5
The last step involved applying two additional filters and then used the Read Code attributes and clustering to order the Read Codes by how likely they are to be ADRs.The first filter removed all the Read Codes (e i 's) that were in the indicator cluster or where (1−expect(e i , d * ))×ABratio 30 (e i , d * ) < 1 . The second filter is a filter that we have developed for post processing with any algorithm that detects ADRs by mining the THIN database.This filter removes all the Read Codes that are irrelevant for ADR detection such as Read Codes corresponding to administrative events.Finally Read Codes were ordered in descending order of (1 − expect(e i , d * )) × ABratio 30 (e i , d * ) × 1 β , where β = 1 for Read Codes in the ADR cluster and β = 3 for Read Codes in the noise cluster.
In summary, the DRESS algorithm uses the semi-supervised clustering to filter Read Codes that have attributes that make them unlikely to be ADRs and then orders the remaining Read Codes by how often they occurred unexpectedly after the first prescription of the drug being investigated but also adds a weight so that Read Codes that have attributes similar to known ADRs are ranked higher.

V. RESULTS
We applied the DRESS algorithm to a range of drugs, some of which have been withdrawn from the market due to serious ADRs.Table II shows the rank that each data mining algorithm assigns for a known rare and serious drug and ADR pair.The table also states the cause for taking the drug (the indication), the year the drug was withdrawn and approximately how commonly the ADR occurs for patients prescribed the drug.In some cases the rare and serious ADR being investigated was listed as an ADR on the netdoctor website and to prevent any bias in the results for the DRESS algorithm, any labels for the Read Codes corresponding to the ADR being investigated were removed at the end of step 2, prior to the semi-supervised steps.So the Read Codes corresponding to the ADR being investigated were always unlabelled in the DRESS algorithm.
The DRESS algorithm had an average rank over the drug and ADR pairs of 143.75 and had the highest rank compared to all the other algorithms for 12 of the 16 drug and ADR pairs investigated.It was able to return a rank under 100 for 56.25% of the ADRs and all the ADRs had a rank below 500.In comparison the other methods had an average rank of 344.69, 791 and 2385.73 for the OE ratio, HUNT and MUTARA respectively.The OE ratio only returned a rank under 100 for 25% of the ADRs and only 6.25% of ADRs ranked by MUTARA and HUNT had a rank under 100.The existing methods for detecting ADRs using EHDs do not currently definitively detect ADRs, but rather, they can be considered to generate tentative ADR signals for the top k ranked medical events in their returned list.This is effectively filtering out all the medical events that are unlikely to be ADRs or the medical events without sufficient evidence (number of patients experiencing the event after the drug) of being an ADR.The medical events with a rank greater than k are ignored and the medical events with a rank less than k (the signalled medical events) are investigated further to confirm if they are true ADRs.
Therefore, for an unknown ADR to be detected by these existing methods it needs to be signalled by being ranked in the top k medical events returned and the closer that the rank is to the value 1, the more likely it will be investigated further, even when low values of k are used.The results of this paper show that the existing methods are not able to rank the known rare and serious ADRs highly, so they would be unlikely to signal these for further investigation, but the DRESS algorithm was able to rank over 50% of these rare ADRs within the top 100 and would most likely signal these for further investigation.This implies that the DRESS algorithm is more suitable than existing methods for detecting rare ADRs and has the potential to discover many unknown rare ADRs.
The DRESS algorithm was unable to rank 'Naproxen and Hepatitis' and 'Nifedipine and weight increase' higher than the OE ratio, one reason for this is that the DRESS algorithm does not perform as well for medical events that have a high background rate as medical events that are common will have a greater number of patients experiencing the medical event in the month before the prescription and if the ADR is rare only one or two patients extra will have the medical event in the month after the prescription, so the ABratio will be close to one, but on the other hand, the OE ratio performs better for medical events with a high background rate as the bias in the IC calculation will have less impact.
It might be better to have two ADR clusters, one for the medical events that occur at a low background rate and another for the medical events that occur at a high background rate, but this may cause issue if there is not a sufficient number of Read Codes corresponding to known side effects, as if there are only a few Read Codes for the known side effects there may only be one to two labelled Read Codes in each cluster and this will be deleterious for the semi-supervised methods.
It may be argued that the DRESS algorithm can only be applied after many ADRs are known and this may prevent it efficiently detecting rare ADRs, but rare ADRs that occur less than 1 in 1000 patients generally need three of more ADR cases before there is satisfactory evidence to confirm the ADR and this requires thousands of patients having the drug.DRESS can be implemented after a few hundred patients have taken a drug as the obvious side effects will be known, so the constraint of requiring known ADRs will not reduce the efficiency of DRESS.It is worth noting that the current implementation of DRESS will not be as effective if a drug is generally safe and does not have many side effects, but this is not common and DRESS could be modified to use the positive effects of the drug as labelled medical events as the positive effects and side effects have similar attribute values as both should increase after the drug is taken.
The DRESS algorithm still highly ranks medical events related to the cause of the drug and removing these medical events would greatly improve the ability to detect rare ADRs with DRESS.The expect attribute has limitations as it only indicates if the patient had a repeat of a medical event and does not make use of the medical event relations within the THIN database to determine if a medical event is expected based on related medical events.For example, if a patient has 'a cold' then they are likely to experience 'a cough' if the illness progresses, but the expect attribute only says 'a cough' is expected if the patient has had 'a cough' shortly before and not if a predecessor medical event has occurred before.
To reduce the rank of medical events related to the cause of the drug a new attribute that uses sequential patterns to determine the expectedness of each medical event that occurs after the description could be used.

VII. CONCLUSION
In this paper we have described a novel methodology to detect rare ADRs that incorporates some existing methods (Observe Expected ratio, MUTARA and HUNT), information retrieval from the web, metric learning and semi-supervised clustering.The results suggest this methodology is able to detect rare and serious ADRs for the range of drugs chosen in the investigation and has the potential to help detect many currently unknown ADRs.The method is not able to remove all medical events related to the cause of taking the drug and future work should aim to prevent generating signals for these medical events by adding an additional attribute to the clustering that determines the expectedness of each medical event based on sequential patterns that can be mined from the whole database.Further work could also investigate different metric learning and semi-supervised clustering techniques.

Fig. 1 :
Fig. 1: Example of a branch in the Read Codes tree.
α 1 (ω, d * ) = −1) to estimate the background rate that the Read Code is prescribed into the THIN database.For each patient that has never been prescribed the drug d * , a random time interval is chosen within the age that the patient first has any Read Code recorded and the age that a Read Code is last recorded, min(A M (ω)) and max(A M (ω)) respectively, where we defined A M (ω) previously to be the set of ages that patient ω has a Read Code recorded into the THIN database.If a patient has never had a Read Code recorded into the THIN database (A M (ω) = ∅) or only had Read Codes recorded for the short period of time (max(A M (ω)) − min(A M (ω)) < |t 2 − t 1 |) then the patient is not used in this study as they are not an active patient and may bias results.The time interval of length |t 2 − t 1 | is chosen uniformly within [min(A M (ω))...max(A M (ω))].

Fig. 2 :
Fig. 2: A summary of the five steps applied in the DRESS algorithm.
investigating the Read Code tree and found the set of irrelevant Read Codes, M irrel .Examples of Read Codes in M irrel that cannot correspond to immediately occurring ADRs are those related to cancer, occupations or family history.Fig (3) illustrates the Read Code tree and the Read Code branches considered to be noise are shaded in, whereas the Read Code branches that are possible ADRs are unshaded.Using the set M irrel , the DRESS algorithm labels any Read Code e i ∈ G ∩ M irrel as noise.To determine the labels of some of the Read Codes corresponding to known ADRs or indicators we mined data from the internet.The Read Codes labelled as indicators are found by first extracting the strings listed as indicators on the netdoctor website [25], then finding the set of Read Codes with descriptions containing any one of these strings and finally validating these are indicators by ignoring indicator Read Codes that do not have an ABratio 30 < 1.The Read Codes labelled as known ADRs are found similarly, by first extracting the strings listed as side effects on the netdoctor website, then finding the set of Read Codes with descriptions containing any one of these strings and validating these by ignoring any Read Codes that do not have an ABratio 30 ≥ 1.5.In addition to the Read Codes two Read Codes with different labels.The mapping is, f : R 9 → R 9 ; f (x i ) = x i T S µ t , where S µ t is the 9 × 9 learned distance metric matrix, D. Step 4

TABLE I :
Examples of both a medical table and prescription table containing the patients records.The column patID is a unique reference corresponding to a patient, the age is the patient's age in days when the entry was recorded and the Read Code/drugcode is the reference to the specific medical event or prescription respectively.
1 , t 2 , e i , d * ) value calculation creates a bias that causes the IC(t 1 , t 2 , e i , d * ) value to tend to zero when the Read Code or drug occurs rarely.The IC ∆ is a measure that compares the IC(0, 30, e i , d * ) corresponding to the time period of a month after the first prescription in 13 months compared to the IC(−810, −630, e i , d * ) corresponding to a time period between 27 and 21 months prior to the prescription.When investigating possible ADRs for drug d * the authors use a filter to remove any Read Codes e i if the IC(−30, −1, e i , d * ) > IC(0, 30, e i , d * ) or IC(0, 0, e i , d * ) > IC(0, 30, e i , d IC ∆ (e i , d * ) = log 2 ( * ), these correspond Read Codes linked to the cause of taking the drug) above Read Codes corresponding to ADRs.MUTARA and HUNT both investigate the Read Codes that occur during the month after the drug being studied is first prescribed to a patient or the union of the month after the first and second prescriptions of the drug if it is repeated within a month of the first prescription.Previously we defined α(ω, d * ) to be the ages in days that patient ω is prescribed the drug d * and α 1 (ω, d * )) to be the age in days that patient ω is first prescribed the drug d * , so the set α(ω, d [22]ies a case control approach that estimates the background rate that the Read Code is recorded into the THIN database by finding out how many patients who have not been prescribed the drug of interest have the Read Code recorded during a random time interval.MUTARA aims to find Unexpected Temporary Association Rules (UTARs) by investigating how many patients have a specific Read Code unexpectedly recorded within a period of interest centred on the first prescriptions of the drug being studied.The algorithm Highlighting UTARs Negating TARs (HUNT)[22]was developed by the same authors as MUTARA and applies a similar method but is less prone to ranking therapeutic failure Read Codes (* ) \ α 1 (ω, d * ) contains the ages that patient ω is prescribed drug d * except the age when they are first prescribe the drug.Using this we define a new function that returns the patient's age in days for the second time a drug is prescribed for a patient if September 3, 2014 DRAFT

TABLE II :
The Read Code ranks that serious and rare ADRs are given by the different methods for a range of drug and known rare ADR pairs.If the algorithm did not return a rank for the Read Code corresponding to the ADR, then this is represented by '-'.