Data‐driven machine‐learning analysis of potential embolic sources in embolic stroke of undetermined source

Hierarchical clustering, a common ‘unsupervised’ machine‐learning algorithm, is advantageous for exploring potential underlying aetiology in particularly heterogeneous diseases. We investigated potential embolic sources in embolic stroke of undetermined source (ESUS) using a data‐driven machine‐learning method, and explored variation in stroke recurrence between clusters.


Introduction
Approximately 17% of all ischemic stroke patients have an embolic stroke of undetermined source (ESUS) (i.e. a stroke without an apparent cause despite recommended diagnostic work-up) [1]. Numerous underlying pathologies may serve as embolic sources in patients with ESUS such as atherosclerotic plaques in the carotids and the aortic arch, covert atrial fibrillation (AF), patent foramen ovale (PFO), left ventricular (LV) disease, atrial cardiopathy, cancer and cardiac valvular disease [1]. Recently, we showed that there is significant overlap of potential embolic sources (PESs) in patients with ESUS. In a cohort of consecutive ESUS patients, 65.5% had two or more PESs and 31.1% had three or more PESs, whereas on average, each patient had two PESs [2]. In this context, it is frequently difficult to identify the actual source of embolism in an ESUS patient, when several PESs coexist [3].
Clustering algorithms, a common 'unsupervised' machine-learning, can be used to identify groups (clusters) of similar individuals based on the sum of the combined values of their measured characteristics [4]. In hierarchical clustering, the results are easily reproducible, and this process is fixed once clusters are assigned, so participants cannot be reclassified into a different cluster. This contrasts with standard regression methods, which are used to identify associations between response and explanatory variables. This belongs to supervised learning, which can be used for multiple testing to determine significant differences between groups that need to be specified a priori. Each test is independent of the other tests, which results in groups that are only relevant to the particular variable tested. Clustering takes into account all variables, providing a way to holistically represent the entirety of the data collected [5]. Therefore, this process is extremely advantageous for exploring the potential underlying aetiology in particularly heterogeneous diseases such as ESUS.
In this context, we investigated the potential sources of embolism in ESUS patients using a data-driven, machine-learning analytical method and explored variation in rates of stroke recurrence between clusters.

Patient population
We analysed complete data from consecutive patients with ESUS recruited in three prospective stroke registries: ASTRAL (Acute Stroke Registry and Analysis of Lausanne), Athens Stroke Registry, and Larissa Stroke Registry [6][7][8]. A standard pro forma template was used to collect all clinical, demographic, biometric, biomarkers and outcome data. The use of the registry data for research was approved by local institutional review boards, and the study is registered at Clini-calTrials.gov (NCT02766205). Full descriptive study procedures and methods were previously published [2].
The definition of ESUS was based on the Cryptogenic Stroke/ESUS International Working Group criteria: non-lacunar brain infarct in the absence of (i) extracranial atherosclerosis causing ≥50% luminal stenosis in arteries supplying the area of ischemia or (ii) major-risk cardioembolic source or (iii) any other specific cause of stroke (e.g. arteritis, dissection, migraine/vasospasm or drug misuse) [1].

Patient features included in the analysis and methodology of cluster generation
The clustering methods utilised all baseline features detailed in the Supplementary Table S1, which included demographics, lifestyle factors, clinical symptoms/signs during the qualifying ESUS, comorbidities, biometrics, biomarkers, vascular imaging, brain imaging, electrocardiogram and echocardiography.
To identify groups of patients with similar characteristics (i.e. clusters), we used a combined k-means and hierarchical agglomerative approach to generate clusters, called hierarchical k-means clustering [9]. This process allows for the k-means-based approach to accelerate or speed up a traditional k-means algorithm in both training and query phases, which allows for a much larger number of centroids to be used, which in turn leads to much better learning [9]. In this process, we pick some k to be the branching factor, which defines the number of clusters at each level of the clustering hierarchy. We then cluster the set of points into k clusters using a standard k-means algorithm. Finally, we recursively clustered each subcluster until we determine a small fixed number of points. Therefore, using all the baseline data provided from ESUS patients, the algorithm could assign each individual into a unique cluster.
To determine the optimum number of clusters, we used a combined approach using 30 different clustering indices, which includes common methods including 'elbow', 'average silhouette', or 'gap statistics'. The optimal number of clusters was determined from the highest frequency of selection from all 30 indices [10]. To visualize the clustering process, we generated a dendrogram (a tree diagram) to illustrate the arrangement of the clusters produced [11]. Each branching creates a unique participant cluster, with the size of the clusters determined by the height of the branches. Separately, we also conducted a principle components analysis by plotting the first two principle components on a coordinate to observe the clusters between each ESUS patient by his/her respective assigned cluster group. These principal components were derived using the orthogonal transformation (eigenvectors and eigenvalues) to reduce down the dimensionality of the original data, from all the clinical features collected on ESUS patients. Clustering analyses and data visualisation tools were conducted using statistical software R (The R Foundation for Statistical Computing, Vienna, Austria) using packages cluster, NbClust, factoextra, dendxtend and ggplot2.
Description of clusters: summary characteristics and prevalence of potential embolic sources Descriptive characteristics of each cluster were provided, reporting number (%) and median [interquartile range (IQR)] for categorical and continuous variables, respectively. We further profiled each cluster by determining the prevalence of each PES within each cluster. Patients were also categorised by the number of PESs: zero to one PES, two PESs, or three or more PESs.
Potential embolic sources were categorised as follows: atrial cardiopathy, AF, arterial disease, LV disease, cardiac valvular disease, PFO, and cancer, as previously described in detail [2]. In particular, based on previously published associations with the risk of stroke, atrial cardiopathy was diagnosed if the echocardiogram reported left atrial dilatation or increased left atrial diameter (>38 mm for women and >40 mm for men) [12,13], or if supraventricular extrasystoles were present at the 12-lead electrocardiograms performed during hospitalization [14,15]. We diagnosed arterial disease in case of presence of any ipsilateral atherosclerotic carotid plaque causing luminal stenosis of <50% [16][17][18] or aortic arch atherosclerosis [19][20][21][22] based on the imaging reports. We did not review the images. We did not include contralateral carotid atherosclerosis in this PES. LV disease was diagnosed if low LV ejection fraction (<35%) or LV hypertrophy or left-sided heart failure was reported at the echocardiogram, or if LV hypertrophy was identified at the electrocardiogram (Sokolow index ≥ 35 mm) [23]. We diagnosed cardiac valvular disease if moderate-to-severe stenosis or regurgitation of the mitral or aortic valve was reported at the echocardiogram. AF was assessed during on-site patient visits at the outpatient clinic and/or by contact with the patient and/or the next of kin or the patient's primary physician; it was considered present if confirmed by an electrocardiogram performed for any reason including palpitations, irregular pulse on clinical examination, in-hospital surveillance or portable outpatient monitoring.

Clinical end-points during follow-up
We evaluated the risk of stroke recurrence over the 10 years of follow-up by cluster. Stroke during follow-up was ascertained by on-site patient visits at the outpatient's clinic, contact with the patient's next of kin, or the patient's primary physician. Where possible, the outcome had been adjudicated by reviewing the patient's medical notes and imaging outcomes.

Statistical analysis
Comparisons across clusters were conducted using the non-parametric Kruskal-Wallis test for continuous variables and v 2 tests for categorical variables [24,25]. Prior to the clustering analysis, data that were missing at random were imputed using multiple imputation using chained equations [26].
To quantify the contribution of each PES to each cluster, we applied logistic regression to determine the association between each PES with the derived cluster. In this analysis, the PES was the exposure variable, and the cluster grouping was the outcome variable (coded as 1, belonging to the cluster; or coded as 0, belonging to other clusters). All models were adjusted for sex, age, dyslipidaemias, diabetes mellitus, smoking, coronary artery disease, and National Institutes of Health Stroke Scale (NIHSS) score at admission. The PESs in each cluster were then ranked by significance and by the effect size, with 95% confidence intervals (CIs) provided. In this way, we were able to profile each cluster and associate them to specific PESs.
Subsequently, we evaluated the 10-year follow-up of stroke recurrence by cluster. Incidence rates (per 1000 person-years) and 95% CIs were provided. To obtain estimates for the association between cluster groups and stroke recurrence, we performed Cox proportional hazards regression analysis, with informative censoring of the survival time when patients were lost to follow-up or died. The cluster with the lowest event rate for stroke recurrence was used as the reference group.
Furthermore, we quantified the dose-response relationship of having multiple PESs compared to a single or no PES using Cox proportional hazards. Similar to the logistic regression analyses, all hazard ratios were adjusted for sex, age, hypertension, dyslipidaemia, diabetes mellitus, smoking, coronary artery disease, and NIHSS at admission. All hazard models were assessed for proportional hazards using Schoenfeld residuals. P values <0.05 were considered statistically significant.

Results
A total of 800 ESUS patients (43.3% women) were included in the analysis. The median age of patients was 67 years (IQR = 54-77 years).

Visualization of the hierarchical clustering analysis
From 30 clustering indices, it was found that the optimal number of clusters is four (Fig. S1). The arrangement of the four clusters during the clustering process is illustrated at the dendrogram (Fig. S2).
The principal components analysis identified that 82% of all principal components were needed to explain 100% variation of the original ESUS data (Fig. S3a), which suggests that there is substantial heterogeneity between ESUS patients in clinical features, as a high number of principal components are needed to explain significant variation of the original data. By plotting the first two principal components, which only explains up to 16% of the variation in the original data, visual separation can be seen between clusters from the hierarchical clustering process (Fig. S3b). Cluster sizes were as follows: 44 patients (5.5%) in cluster 1, 149 patients (18.6%) in cluster 2, 430 patients (53.8%) in cluster 3, and 177 patients (22.1%) in cluster 4. There was overlap between clus-ter 1 and cluster 2. However, clusters 2, 3, and 4 all remained quite distinct, with a large degree of separation and very little overlap.

Characteristics of the cluster groups
The baseline characteristics of the patients in the four clusters are summarized in Table 1. There were significant differences between clusters in terms of gender, baseline age, NIHSS at admission, hypertension, diabetes mellitus, coronary artery disease, previous stroke and antithrombotic treatment at discharge.
The prevalence of each PES stratified by cluster is summarised in Table 2. There were significant differences between clusters in the prevalence of atrial fibrillation, atrial cardiopathy, arterial disease, LV disease, PFO and cancer. LV disease was most prevalent in cluster 1 (100%). PFO was most prevalent in cluster 2 (38.9%). Arterial disease was most prevalent in cluster 3 (57.7%). Atrial cardiopathy was most prevalent in cluster 4 (100%).

Association between cluster grouping and PES
Using multivariable logistic regression models, we determined the association between each PES and cluster membership. The adjusted odds ratios and 95% CIs for each cluster are presented in Table 3. LV disease was perfectly associated with cluster 1 membership. PFO was significantly associated with increased likelihood of cluster 2 membership (adjusted odds ratio = 2.69, 95% CI: 1.64-4.41). Arterial disease was significantly associated with increased likelihood of cluster 3 membership (adjusted odds ratio = 2.21, 95% CI: 1.43-3.13). Atrial cardiopathy was perfectly associated with cluster 4 membership.

Risk of stroke recurrence across clusters
The mean and median follow-up duration was 3.7 years (SD = 3.7) and 2.1 years (IQR = 0.8-5.8 years). Over 2922 person-years, there were 101 recurrent strokes, corresponding to an overall rate of 34.6 per 1000 person-years (95% CI: 28.4-42.0). The risk of stroke recurrence was not different across clusters in adjusted models Table 4 and Fig. 1.

Discussion
This data-driven machine-learning analysis of consecutive ESUS patients identified four clusters of patients based on their baseline characteristics. The largest cluster, which included more than half of the overall population, was associated with the presence of arterial disease. Two clusters of medium size including approximately 15% to 20% of the overall population were associated with atrial cardiopathy and PFO, respectively. A small cluster, which included only 5% of the overall population, was associated with LV disease. Atrial fibrillation was not associated with any cluster. The risk of stroke recurrence was similar across clusters. During recent years, there has been emerging evidence supporting an important etiological association between ESUS and atherosclerotic plaques. A recent analysis of the NAVIGATE-ESUS (New Approach Rivaroxaban Inhibition of Factor Xa in a Global Trial versus ASA to Prevent Embolism in Embolic Stroke of Undetermined Source) trial [27] as well as several other studies [28][29][30][31][32][33][34][35][36][37][38][39][40] showed that the prevalence of carotid plaques is higher ipsilateral to the infarct than contralateral in patients with ESUS. In addition, the AF-ESUS (Prediction of Atrial Fibrillation after Embolic Stroke of Undetermined Source) study showed that new incident AF is less frequently detected in patients with ESUS and carotid plaques compared to those without [18]. Similarly, in young adults with cryptogenic stroke, carotid plaques were associated with the absence of PFO [41]. Both latter studies show that carotid plaques act as a competing stroke aetiology to other established stroke aetiologies, and hence, support their role as an underlying cause of ESUS. Moreover, a recent analysis of consecutive emboli retrieved during mechanical thrombectomy showed that the emboli from patients with large artery atherosclerotic and cryptogenic strokes had a similar proportion of platelet-rich clots, which was significantly higher compared with thrombi from patients with cardioembolic stroke [42]. The results of the present study provide further arguments in support of an important association between ESUS and atherosclerotic plaques. The concept of atrial cardiopathy has emerged during the recent years as an important source of embolism in patients with ESUS [43]. There is growing body of evidence indicating that thrombi may be formed in the diseased left atrium, even in the absence of atrial fibrillation. Atrial cardiopathy has been assessed in various ways using several indices including biomarkers [44][45][46], cardiac magnetic resonance imaging [47] and electrocardiographic indices [48][49][50]. The present analysis adds to the evidence that supports an important causative role of atrial cardiopathy in ESUS and indicates that atrial cardiopathy could be the cause of stroke in 15% to 20% of the overall ESUS population. The ongoing ARCADIA trial (AtRial Cardiopathy and Antithrombotic Drugs In Prevention After Cryptogenic Stroke) currently investigates whether patients with ESUS and atrial cardiopathy respond better to apixaban compared to aspirin for secondary stroke prevention [51].
Although older randomized trials were neutral [52], several recent randomized trials showed that percutaneous PFO closure is associated with a large reduction of recurrent stroke rates in patients with ESUS [53], supporting an important etiological association between PFO and ESUS. The results of our analysis are in line with this, as we identified a cluster of patients (18% of the overall cohort) that is associated with PFO.
Several observational studies and randomized trials showed that AF can be detected in 30% of ESUS patients during follow-up, suggesting a strong causal association between AF and ESUS [54][55][56][57][58]. However, there has been emerging evidence questioning the strength of this association, especially for short-lasting episodes detected remotely after ESUS [59]. The rate of AF detection in ESUS patients was similar with The association has been adjusted for sex, age, hypertension, dyslipidaemia, diabetes mellitus, smoking, coronary artery disease, and National Institute of Health Stroke Scale score at admission. CI, confidence interval. other non-ESUS stroke patients [56], as well as with older patients without previous stroke [60]. In addition, ESUS patients are phenotypically different compared with stroke patients with AF, with the former being younger with milder strokes [57,58,61,62] [61,62]. The present analysis adds to the aforementioned evidence that support the argument that AF is not so strongly associated with ESUS as it was initially believed.
The main strength of this study is its design: the datadriven hierarchical-clustering analysis allowed the categorization of patients into distinct clusters based on their baseline characteristics, without pre-specification of variables, and then coupling of these clusters with PES. This is a particularly advantageous method in cases of datasets with large degree of heterogeneity between individuals. The categorization of patients into clusters rather than PESs is advantageous and more informative, as there is large overlap of PESs in patients with ESUS. For example, a previous analysis in the same cohort showed that LV disease was present in 54.4% of the overall cohort [2]. However, the present analysis showed that the cluster, which was associated with LV disease, included only 5% of the overall cohort. This suggests that for the majority of patients with LV disease, this would represent an innocent bystander rather than the actual embolic source. On the other hand, our study is limited by the risk of registration bias within and between the participating registries and differences in the workup of patients during the inhospital phase. Finally, the clustering algorithms are empirical methods, which also may be limited by the sample size of the data and number of clinical features collection to determine cluster associations, as the analysis was not specifically powered to determine potential associations with future outcomes. Future research should explore whether these findings are consistent in a much larger sample of ESUS patients.
Conclusion This data-driven, machine-learning, hierarchicalclustering analysis identified four clusters of ESUS patients that were associated with arterial disease, atrial cardiopathy, PFO and LV disease, respectively. AF was not associated with any cluster. The risk of stroke recurrence was not different across clusters.
Disclosure of conflicts of interest G.N.: received through his institution a research grant for the Prediction of AF in ESUS (AF-ESUS) study (ClinicalTrials.gov Identifier: NCT02766205), which is an investigator-initiated study supported by Pfizer through the BMS/Pfizer European Thrombosis Investigator Initiated Research Program (ERISTA). Speaker fees/advisory boards/research support from Amgen, Bayer, BMS/Pfizer, Boehringer-Ingelheim, Elpen, Galenica, Sanofi, Winmedica. All fees are paid to his institution (University of Thessaly). KP.: received travel support by Pfizer. G.S.: received research grant from Swiss Heart Foundation, congress travel support from Bayer and Shire, and served on scientific advisory boards for Amgen and Daiichi-Sankyo. All fees are paid to his institution (CHUV). E.K: Speaker fees/ honoraria for advisory boards from Pfizer, Amgen. P.M.: received within the last 2 years through his institution research grants from the Swiss Heart Foundation and BMS; speaker fees from Boehringer-Ingelheim, Medtronic and Amgen; consulting fees from Medtronic and Amgen, and honoraria from scientific advisory boards from Boehringer-Ingelheim, Pfizer and BMS. All this support goes to his institution (CHUV) and is used for stroke education and research. S.W.: Member of the Clinical Practice Research Datalink (CPRD) Independent Scientific Advisory Committee (ISAC), academic advisor to Quealth Ltd. and has received honorarium from AMGEN. All other authors: none.
This study was performed in the dataset of the Prediction of AF in ESUS (AF-ESUS) study (ClinicalTri als.gov Identifier: NCT02766205), which was an investigator-initiated study supported by Pfizer through the BMS/Pfizer European Thrombosis Investigator Initiated Research Program. The Swiss Cardiology Foundation supported data collection in the ASTRAL (Acute STroke Registry and Analysis of Lausanne) registry.

Data availability statement
The data that support the findings of this study are available from the corresponding author upon reasonable request.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Figure S1. Determining the optimal number of clusters by frequency among 30 clustering indices. Figure S2. To visualise the clustering process, we generated a dendrogram (tree diagram) to illustrate the arrangement of the clusters produced the hierarchical k-means clustering analysis. Figure S3. Principal components analysis of collected baseline clinical features from 800 ESUS patients. Table S1. Baseline characteristics of patients.