Discovering Sequential Patterns in a UK General Practice Database

The wealth of computerised medical information becoming readily available presents the opportunity to examine patterns of illnesses, therapies and responses. These patterns may be able to predict illnesses that a patient is likely to develop, allowing the implementation of preventative actions. In this paper sequential rule mining is applied to a General Practice database to find rules involving a patients age, gender and medical history. By incorporating these rules into current health-care a patient can be highlighted as susceptible to a future illness based on past or current illnesses, gender and year of birth. This knowledge has the ability to greatly improve health-care and reduce health-care costs.


I. INTRODUCTION
A patient's medical state continuously changes over time, for instance, one day they may be 'healthy' and the next they may be 'suffering from a cold'. It is common for medical states to develop gradually over time and/or be dependent on previous medical states. For example, before developing illness A, many patients may previously have illnesses B and C. If these temporal associations between illnesses can be learned, they can be used to highlight patients that have illnesses B and C as being susceptible to developing illness A. With this extra knowledge it may be possible to then act to reduce the chance of these susceptible patients developing illness A. Alternatively, if preventing the illness is not possible, susceptible patients may be monitored more frequently to help detect the illness early and improve prognosis.
Sequential patterning mining algorithms find temporal associations. An example of a sequential pattern rule in the context of retail sales is that 78% of customers buying an electric toothbrush buy toothbrush replacement heads after three months. A recent example of mining in a medical context is the application of the sequential pattern mining algorithm FreeSpan on a database known as the RSU Dr. Soetomo medical database to find sequential disease patterns [1], however, age and gender were not included into the sequential rules and the author only displayed a selection of rules. Other existing work aiming to detect medical sequential patterns has tended to focus on time series data [2] [3] or specific illnesses, such as investigating patterns that predict the onset of thrombosis [4] and identifying traits leading to atherosclerosis in a database of approximately 1400 middle aged men [5]. There is currently no existing work on detecting sequential patterns of illnesses in The Health Improvement Network (THIN) general practice database (www.thin-uk.com).
The aim of this paper is to discover sequential pattern rules occurring within the THIN database. By learning these rules it may be possible to make statements such as 'if a patient is born in 1973 and has event A then there is a 70% chance of them developing event B in the future'. If implemented in health-care monitoring, when event A occurs for any patient born in 1973, then a marker could be added to alert the doctor, potentially allowing preventative actions to be introduced. The advantage of using the THIN database is that the rules learned can be directly incorporated into general practice systems as they contain all the required information, that is, age, gender and medical history. This may help prevention/early detection of illnesses in many thousands of patients.
The outline of this paper is as follows. In Section 2 the THIN database and the SPADE algorithm are described. The results of applying SPADE to the THIN database are presented in section 3 followed by a discussion of the importance of the results. Section 4 presents conclusions and suggests potential future work.

A. THIN Database
The THIN database contains medical records from participating general practices within the UK. The data is anonymously extracted directly from the general practice Vision clinical system [6]. THIN then implements validation steps, these are added as extra fields within the tables. The database contains patient information including the year of birth, gender, date of registration and family history of each patient registered at the practice since participation. Any information recorded by the doctors when a patient visits (referred to as medical events) is recorded including the date of the visit. Information regarding any medication prescribed as well as the date of the prescription and the dosage are also included in the database. In this paper a database containing records from 20 general practices was used. This subset of the THIN database contained approximately 350 thousand patients, over 25 million prescriptions and over 15 million medical events.
Each medical event is recorded in the database by a reference code known as a Read code. The Read codes used in the THIN database are an independent system designed specifically for primary care but every ICD-9-CM (International Classification of Diseases, Ninth Edition, Clinical Modification) code (or analogues) have a corresponding Read code.

B. Pre-processing Database
In this paper the THIN database is transformed into a 'transaction database', a database commonly used in retail where each entry corresponds to a customer's 'basket' (a collection of items purchased during a shopping trip) and each database entry is ordered by the date of the shopping trip. This transformation is implemented by grouping medical events occurring on the same date for the same patient together into 'baskets' and ordering the 'baskets' of each patient by the date that the events occurred. The first 'basket' for each patient contains the patient's year of birth and gender. Medical events with partial or missing transaction dates are ignored from the study as it is not possible to definitively determine their order within the patient's medical sequence. Fig. 1 shows an example of how the database is transformed for two patients.

C. Sequential Pattern Mining
Sequential pattern mining methods find patterns in ordered sequences. The formal problem is defined as [7]: Given a set of sequential records (sequences) representing a sequential database D, a minimum support threshold (min sup) and a set of the unique items I = {i 1 , i 2 , ..., i k }, find the set of all frequent sequences S in the given sequence database D of items I at the given min sup.
The term event denotes a non-empty unordered collection of items, denoted (i 1 , i 2 , ..., i l ), where each i j is an item. A sequence is a list of ordered events, denoted α = (α 1 → α 2 → ... → α m ), where each α j is an event. The cardinality of a sequence is the number of items it contains and the term k −sequence denotes a sequence of cardinality k, k = ∑ j |α j |.
The support of sequence α , denoted sup(α), is the number of sequences in D that contain α, given by The confidence of rule X → Y , denoted conf(X → Y ), is the fraction of sequences containing X that also contain Y , as given by D. SPADE Sequential Pattern Discovery using Equivalence Classes (SPADE) [8] is a lattice based method and an example of an early pruning algorithm. These algorithms do not require a support threshold, use position codes and use a vertical projection of the database [9]. The SPADE algorithm is suitable for the medical database due to its ability to find sequences with a high confidence even if the items in the sequence are rare. The reason for this is that SPADE does not require the user to input a minimum value for the support of an item, so rare items (with a low support) can be included into the rules. SPADE was implemented on the pre-processed medical data with a confidence of 0.1 using the cspade function in the arulesSequences R package [10], [11]. The confidence value of 0.1 was chosen for efficiency and to prevent a surplus number of rules being mined.

III. RESULTS & DISCUSSION
A total of 97,883 sequential rules were found by SPADE, offering a variety of information. The rules contain information such as on differences between genders, how beneficial health advice is, how age is associated to the illness and other illnesses that may occur while an illness progresses, see Tables I -III. The sequential rules give insight into the number of people who remain ill or relapse. For example, it was common for the sequential rules to be of the form A, A → A or A, A, A → A, these correspond to patients having the illness again after previously having it two or three times respectively. The confidence of these sequential rules can be used to estimate the chance of a patient having a repeat illness. Table I shows 34.3% of patients suffering from 'Depressive disorder NEC' have a repeat but 52.7% of patients who have 'Depressive disorder NEC' twice will have it again, suggesting the probability of them developing 'chronic depression' increases each time they have a relapse. It may be possible to find attributes that increase a patient's chance of relapse by investigating any differences between patients that have repeats and those that do not.  The difference between the confidences of the sequential rules containing gender information and not containing gender information indicates differences between males and females. If the sequential rule A, B, f emale → C has a greater confidence than A, B → C then this suggests females with a history of events A and B are more likely than males with a history of A and B to develop C. For example the rule 'Essential hypertension,female'→'Essential hypertension' has a greater confidence than 'Essential hypertension'→'Essential hypertension', see Table III, suggesting females are more likely to have a repeat of Essential hypertension than males. In this way, sequential rules can be used to identify patients that need to be targeted for preventative action.
As the database contains Read codes corresponding to health-care interventions, such as educating and informing patients about an illness, the rules containing these inter- ventions can be analysed to determine if the intervention has been successful. In Table III, the rule 'Health education offered, Essential hypertension' → 'Essential hypertension' had a lower confidence than 'Essential hypertension' → 'Essential hypertension'. This suggests a patient may have a decreased chance of repeating the illness if given advice, implying some interventions help improve a patient's medical state.
By adding the patient's year of birth (yob) to the start of each patient sequence it was possible to find age specific illnesses. The confidence for a rule of the form yob → B is the number of patients that were born in yob and had event B by the end of 2010. This gives some indication of the age that people have event B, but with some limitations. For example, the two rules 1943 → 'Essential hypertension' and 1944 → 'Essential hypertension', Table III, with respective confidences 0.117 and 0.110 show that approximately 11% of people aged 66/67 in the UK have had 'Essential hypertension' at some point. Further, as 1945-2010 was not contained in a sequential rule, a patients chance of having 'Essential hypertension' before the age 66 is less than 10%. The difference between the confidences for the different yob may help indicate the ages between which there is a rapid increase in patients developing the illness. For example, patients aged 67 were 6.4% more likely to have had 'Essential hypertension' than patients age 66. It is strange that yob = 1942 or lower were not found in any sequential rules, this may be a consequence of the THIN database only having data collected from 2003 onwards. Patients born before 1942 may have progressed from 'Essential hypertension' before the data was recorded. This also highlights a limitation of the algorithm as it is not able to infer obvious yob rules, such as 'yob less than 1944'→ 'Essential hypertension'.
For some illnesses that are progressive, sequential pattern mining found events that lead to or cause them. For example, it was found that more than 10% of patients that have 'pure hypercholesterolaemia', 'Depression', 'Type 2 diabetes mellitus' or 'Chronic kidney disease stage 3' develop 'Essential hypertension'. Also, more than 10% of patients with 'Backache' or 'Tiredness' were later diagnosed with 'depression', see Table I. Because these rules are common the knowledge is probably well known by doctors. However, the sequential rules provide additional quantitative information. Doctors know that patients with 'Type 2 diabetes mellitus' are high risk for developing 'Essential hypertension', but the sequential rule confidence gives the actual proportion of patients with 'Type 2 diabetes mellitus' that develop 'Essential hypertension'.
For some consequences, the antecedents do not lead to or cause the consequence but are linked by the population subgroup with the highest prevalence of the consequent event. This was observed in Table II, when 'acute conjunctivitis' is the consequence. The yob observed in the antecedents suggests that the young are more susceptible to 'acute conjunctivitis'. Many other antecedents seem to be linked to young age rather than 'acute conjunctivitis', such as 'Infantile eczema', 'Croup', 'Normal birth' and 'Enterobiasis -threadworm'. These rules seem most interesting as they are less obvious but may help indicate patients at risk of a future illness by finding illnesses that are common in their population subgroup.
The limitations with applying sequential pattern mining to all patients in the THIN general practice database is that most patient sequences are not complete and it is difficult to distinguish between real repeat infections or repeat appointments of the same one. If sequential pattern mining were applied to sequences of patients that have complete records from birth to death, then the set of rules obtained would be complete. But most patients used in this study are still living and many would only have entries for part of their life recorded. This may bias the results, as if at age 40 many people develop 'event A' and these people are also likely to develop 'event B' at age 50 but their subsequence stops at age 42 then, as they are still included for the support and confidence calculations, the support and confidence of the rule 'event A' → 'event B' will be much lower than it should be. One way of solving this is to only use patients that are registered from birth and also recored as dead, but this may limit numbers. On the other hand, this may actually help weigh sequential rules as sequential rules that occur over a shorter time interval are less likely to be affected by only having a partial sequence than sequential rules that occur over years, and these shorter interval sequential rules are of greater interest to doctors due to their urgency.

IV. CONCLUSIONS
The results obtained in this paper indicate that new information about how medical events, age and gender are related over time can be learned by applying sequential rule mining-algorithms to the THIN longitudinal health-care database by employing the proposed pre-processing methodology. Interestingly, the key result from this study is that these sequential rules present the possibility of determining the likelihood of re-infections. As the database contains additional information including the demographics of the general practice, family relationships and history, BMI and health adverse activities (eg. smoking), etc, these could be used to investigate attributes that may increase the chance of a repeat infection. This information offers the potential of preventing re-infection and therefore reducing the cost of health-care in the UK. A comparison between our results and existing results is not possible as existing work either detects patterns in time series data or detects patterns for specific illnesses using different medical attributes than the medical events contained in the THIN database.
Future work needs to address the limitation of discretising the yob, so that rules such as 'patients born prior to 19xx' → 'event A' can be mined and to develop a method of identifying if repeat medical entries are repeat entries for the same infection or real reinfections. It is also of interest to determine the sequential rules obtained when only considering first occurrences of medical events or only using sequences of patients spanning a minimum number of years.