Prognostic significance of nucleolar assessment in invasive breast cancer

Nucleolar morphometric features have a potential role in the assessment of the aggressiveness of many cancers. However, the role of nucleoli in invasive breast cancer (BC) is still unclear. The aims of this study were to investigate the optimal method for scoring nucleoli in IBC and their prognostic significance, and to refine the grading of breast cancer (BC) by incorporating nucleolar score.


Introduction
The histological grade of invasive breast cancer (BC), determined with the Nottingham grading system, is one of the strongest prognostic indices for guiding management. [1][2][3] Nuclear pleomorphism, tubule formation and mitotic count are used to determine histological grade by the use of light microscopy. However, the subjective nature of the grading system results in subsequent variation among pathologists, and hence variation in patient management decisions. 4 Among the three parameters of the Nottingham grading system, assessment of nuclear pleomorphism is considered to be the least reproducible component, owing to its subjective, illdefined scoring criteria, and it therefore shows the lowest level of interobserver agreement. 5 The agreement in assessing nuclear pleomorphism in BC among expert breast pathologists ranges from 0.35 6 to 0.59 7 (kappa statistics).
Morphological changes in the nucleoli are considered to constitute one of the first histopathological characteristics of malignant transformation, along with abnormal mitotic figures, thickened and irregular nuclear membranes, and coarse chromatin. 8 Malignant cells show a variety of changes in nucleolar structure, including nucleolar composition, size, and number, and chromatin texture. Nucleolar size and number correlate with an increase in the grade of malignancy. 9 MacCarty et al. 10 reported that the nucleolus could show a much larger size relative to the size of the nucleus in malignant cells, regardless of the type or origin of the neoplasm. Nevertheless, nucleolar size plays a central, proportional role in the rate of cell proliferation, and its morphology is closely related to cancer growth. [11][12][13] The major roles of the nucleolus are the synthesis of rRNA, and processing and assembly of ribosomal subunits. [14][15][16] An increased number and size of nucleoli indicate a higher rate of ribosome biogenesis, representing a major metabolic requisite for cell growth and proliferation. 17,18 These changes in the nucleolus are linked to growth factors and oncoproteins that induce cell proliferation, such as epidermal growth factor, 19 insulin-like growth factor, 20 and c-Myc. 21 Prominent nucleoli have been shown to be associated with poor outcomes in various tumours, 8 including BC. 22 Quantitative nucleolar morphometry predicts metastasis and biochemical recurrence in prostate cancer. 23 Furthermore, nucleolar size assessed by the use of silver staining of argyophilic nucleolar organiser regions was reported to be associated with poor clinical outcomes of BC in several studies. 8,24,25 However, nucleolar morphometric features are not widely assessed in BC, owing to a lack of reliable assays that can be routinely implemented in a clinical setting. With the progressive deployment of digital platforms in reporting pathology laboratories, the assessment of subcellular details, including nucleoli, might become a feasible practice to improve BC grading performance and to capture useful prognostic values by the use of artificial intelligence techniques or visual eyeballing methods. Therefore, in this study, we sought to investigate the optimal method for nucleolar evaluation by using digitised images, in terms of reliability, reproducibility, and prognostic significance, in large cohorts of BC. Also, the added value of nucleolar score in refining the performance of the Nottingham grading system was assessed.

Nottingham cohort
This study was conducted on a large series (n = 1600) of primary operable BCs presenting to Nottingham City Hospital, Nottingham, UK between 1999 and 2006. This is a well-characterised cohort with long-term clinical follow-up (median, 138 months; range, 0-216 months) and detailed clinicopathological data, including patient age at diagnosis, primary tumour histological grade, tumour size, histological tumour type, the Nottingham Prognostic Index (NPI), molecular subtypes, and outcome data, including BC-specific survival (BCSS), defined as the time (in months) from the date of the primary surgery to the time of death from BC, and distant metastasis-free survival (DMFS), defined as the time (in months) from the primary surgery until the first event of distant metastasis.
For the purpose of this study, this cohort was divided into two groups: the first sequential 400 cases (25%) of the cohort were used as a training set, and the subsequent 1200 cases were used as a validation set. Table S1 summarises the clinicopathological parameters of both sets.
Freshly cut, 4-lm-thick, full-face sections for the Nottingham cohort were stained with haematoxylin and eosin (H&E), one section per case from the tumour block that showed the largest tumour burden, defined as >50% of the section area. H&E-stained slides were scanned into high-definition digital images through high-resolution (0.19 lm/pixel) scanning with a highthroughput scanner (Pannoramic 250 Flash III; 3D-Histech, Budapest, Hungary), and the slides were then viewed with CASEVIEWER (version 2.2.0.85; 3D-Histech) on a full-screen panel (size, 21 inches; resolution, 1366 9 768). These tumours were graded by use of the digital images, as previously described. 26 External validation cohort In addition, digital H&E BC images from The Cancer Genome Atlas (TCGA) 27 (n = 743) were used as further external validation set. These images were digitised at 940 objective magnification and obtained from BC excision specimens from patients who have longterm clinical follow-up (10 years). All of the images were directly downloaded from the cBioPortal website and viewed on CASEVIEWER for nucleolar score. 28

S C O R I N G O F N U C L E O L I
In this study, nucleoli were assessed by visual scoring of the digitised whole slide images according to the following four scoring methods: • A modified Helpap method, 29 which was based on results related to prior work. 30 This method stratified nucleoli into three scores on the basis of their prominence. Nucleoli were assigned a score of 1 if no prominent nucleoli (i.e. inconspicuous) or single less prominent nucleoli that were difficult to see at 920 magnification were observed. A score of 3 was assigned if prominent nucleoli, which were easily seen at 910 magnification and were identified in at least 20% of the tumour 31 or dysmorphic/multiple nucleoli were present. A score of 2 was assigned to those tumours with nucleoli not scored 1 or 3. For objectivity enhancement, nucleoli were also scored at 940 magnification, and were stratified according to the number of prominent nucleoli (defined as ≥2.5 µm in size or easily seen at 910 magnification) per defined number of field views (FVs) on the screen.
• Counting in 10 FVs (equivalent to an area of 1 mm 2 ).
• Counting in five FVs (equivalent to an area of 0.5 mm 2 ).
• Counting in one FV (equivalent to an area of 0.1 mm 2 ).
Counting was carried out within the nucleolus hotspot, defined as an area of conspicuous nucleoli observed by scanning the images at 910 magnification. Examples of nucleolar score in one FV are shown in Figure 1. The cases were scored on the basis of the absolute number of prominent nucleoli, and cut-offs were then applied to categorise the cases into three scores: 1, 2, and 3.
To check the reliability and reproducibility of the nucleolar evaluation methods, the training set was scored with all aforementioned methods by four observers; the first observer scored all of the cases, and the others scored 25% of the training set. The optimal method for nucleolar score evaluation was determined on the basis of the level of reproducibility in terms of interobserver concordance and association with outcome.
The first validation set (n = 1200 cases) was scored by two observers using the chosen method; the first observer scored all of the cases, and the other scored 50% of the cases to further assess the level of concordance of this method. Nucleolar score determined by the first observer (K.A.) was considered in the final statistical analysis. For external validation of the optimal method chosen, digital images of scanned H&Estained full-face slides of TCGA were scored by one observer.
In a previous study, we graded the Nottingham cohort twice using whole slide images by one observer, 32 and data on the concordance of grade and pleomorphism were used in this study. Moreover, incorporation of nucleolar score in the Nottingham grading system was attempted to assess the performance of the Nottingham grade with the addition of this score to the three components and/or replacement of nuclear pleomorphism.

S T A T I S T I C A L A N A L Y S I S
The optimal cut-off point of nucleolar count against BCSS was defined by the use of X-TILE bioinformatics software version 3.6.1 (School of Medicine, Yale University, New Haven, CT, USA) 33 . Nucleoli were given three scores according to these cut-off points (Table S2). Also, new grade scores of the Nottingham grading system were obtained by the use of cut-off points of total scores against BCSS with X-TILE. IBM-SPSS statistical software 24.0 (SPSS, Chicago, IL, USA) was used for statistical analysis. The degree of interobserver agreement was assessed by use of the intraclass correlation coefficient (ICC) for continuous data. Fleiss' kappa statistic was used to assess the concordance between more than two observers for categorical variables. Associations between nucleolar count with different concordant and discordant cases were analysed with the Kruskal-Wallis test. Outcome analysis was assessed by the use of Kaplan-Meier curves and the log-rank test. Cox proportional hazards multivariate regression modelling was used for the multivariate analysis. For all tests, P < 0.05 (two-tailed) were considered to be statistically significant.

Results
This study included two cohorts of BC: (i) the Nottingham cohort (n = 1600 cases) split into a training set comprising 400 cases scored by four observers using four different scoring methods, and a validation set (n = 1200), which was scored by counting of nucleoli in five FVs by two observers; and (ii) an external validation cohort, i.e. TCGA cohort (n = 743), to assess the performance of the optimal scoring method.

T R A I N I N G S E T
The highest degree of interobserver agreement was observed for counting nucleoli in five FVs (ICC = 0.782), whereas the least concordance was seen with the modified Helpap method (Fleiss' kappa value = 0.417). The concordance rate of nucleolar score with the different methods between the four observers is summarised in Table S3.
The percentage of cases scored 3 with the modified Helpap method, and at 10 FVs, five FVs and one FV, were 17%, 16%, 17%, and 18%, respectively. Their associations with BCSS and DMFS are summarised in Table 1.
Higher nucleolar score showed a significant association with BCSS (P = 0.011, P = 0.013 and P = 0.024 for nucleolar score in 10 FVs, five FVs and one FV, respectively), whereas the modified Helpap    Figure S1).
The nucleoli were assessed in the validation set with the five FVs method, and the concordance rate between the two observers (ICC) was 0.981. Nucleolar score in the validation set showed that 534 cases (45%) had a score of 1, 40% of cases had a score of 2, and 15% of cases had a score of 3.

A S S O C I A T I O N O F N U C L E O L I W I T H O T H E R C L I N I C O P A T H O L O G I C A L P A R A M E T E R S
High nucleolar score was associated with parameters characteristic of aggressive tumour behaviour, including younger patient age, larger tumour size, higher tumour grade, stage 3, and poor NPI. Only 10% of cases with a nucleolar score of 1 were high-grade lesions, in comparison with 40% and 63% of those with nucleolar scores of 2 and 3, respectively (P < 0.0001). Also, a significant association was observed between nucleoli and the number of positive axillary lymph nodes, whereby 14% of cases with a nucleolar score of 3 showed more than three positive lymph nodes, as compared with 10% and 5% of cases with nucleolar scores of 2 and 1, respectively (P < 0.0001).
Nucleolar score 3 tumours were more associated with oestrogen receptor negativity, progesterone receptor negativity, and HER2 positivity (all P < 0.0001). Significant associations were found between nucleolar score and histological subtypes of BC, whereby 97% of nucleolar score 3 tumours were of no specific histological type (NST) (P < 0.0001). The association between nucleolar scores and various clinicopathological parameters are summarised in Table 2.
To assess the value of using nucleolar score in cases associated with low-grade concordance (cases with borderline features for grade according to the Nottingham grading system), nucleolar score was applied to the subgroups of BC based on grade concordance. 32 Significant associations were observed between nucleolar score and grade-concordant and grade-discordant cases (P < 0.0001). Thirteen per cent of grade-concordant cases (G2/2) had a nucleolar score of 3, as compared with 29% of grade-discordant cases (G2/3). On the other hand, when we compared G2/2 and G2/3 cases with a nucleolar score of 2, we found that this percentage decreased from 50% to 44%. Also, as shown in the bar chart in Figure S2, G2/3 cases had higher nucleolar scores than G2/2 cases (Table 3).
Associations between nucleolar count in five FVs and concordant and discordant cases are shown in Figure S2.
In the whole cohort, significant associations were observed between high nucleolar score and shorter BCSS and DMFS (respectively: log-rank = 33.32, P < 0.0001; and log-rank = 33.72, P < 0.0001), as shown in Figure 3.
A multivariate Cox regression model adjusted for the standard prognostic clinicopathological parameters, including patient age, tumour size, and nodal stage, showed nucleolar score to be an independent predictor of survival (Table 4).
When the analysis was restricted to the most common tumour type ductal NST, inclusion of nucleolar score with the other grade components in a Cox regression model showed that nucleolar score was an independent prognostic factor in survival prediction [hazard ratio (HR) 1.259, 95% confidence interval (CI) 1.03-1.53, P = 0.022], whereas pleomorphism was not significantly associated with outcome. Moreover, when pleomorphism was replaced with nucleolar score, the latter was an independent predictor and showed a more significant association than nuclear pleomorphism (HR 0.004, 95% CI 1.09-1.58, P = 0.004). Table 5 summarises the multivariate analysis results for the various models used.

N U C L E O L I A N D T H E N O T T I N G H A M G R A D I N G S Y S T E M
The incorporation of nucleolar score, as a replacement for nuclear pleomorphism, in the Nottingham grading system, especially in NST cases ( Table 6), showed that nucleolar score had higher significant association with BCSS (log-rank = 102, P = 8.3 9 10 À23 ) than the normal grade (logrank = 96, P = 1.6 9 10 À21 ). In addition, when nucleolar score was added to the existing grade components, it still showed a higher significant association with survival than normal grade (logrank = 104, P = 2.8 9 10 À23 ) ( Figure 4). Groups with new grade scores are shown in Table S4.
A multivariate Cox regression model including tumour size, nodal stage and tumour grade after replacement of nuclear pleomorphism with nucleolar score, or addition of nucleolar score as an additional feature, showed that those grades were independently more predictive of survival (respectively: P = 8.2 9 10 À13 ; and P = 6.1 9 10 À13 ) than the normal grade (P = 1.2 9 10 À13 ) (Table S5).

V A L I D A T I O N O F N U C L E O L A R A S S E S S M E N T I N A N E X T E R N A L ( T C G A ) C O H O R T
Nucleolar score determined with the optimal method in the dataset of TCGA showed that 354 cases (48%) had score 1, whereas those with scores of 2 and 3 comprised 41% and 11% of cases, respectively. High nucleolar score was significantly associated with poor 10-year overall survival (OS) (log-rank = 12.81, P = 0.002).
Moreover, incorporation of nucleolar score in tumour grade determined with the Nottingham Table 2. Association between nucleolar scores in five field views (0.5 mm 2 ) and clinicopathological parameters in the Nottingham cohort (N = 1600); the nucleolar score cut-off was determined with X-TILE bioinformatics software Nucleolar score Parameter Score 1, n (%) Score 2, n (%) Score 3, n (%) v 2 (P-value) scoring method for TCGA cases, which were assessed by one observer (L.D.), showed that incorporation of nucleolar score as an additional component to the grade resulted in a stronger association with OS (logrank = 13.5, P = 0.001) than conventional grade (log-rank = 11.9, P = 0.003), and an association was also observed after replacement of pleomorphism with nucleolar score in the grading system (logrank = 10.9, P = 0.004) ( Figure 5).

Discussion
The nucleolus is considered to be a mirror of cellular metabolic activity. Prominent nucleoli are associated with a high translational potential, and are considered to constitute an indicator of the cells' high demand for proteins (e.g. proto-oncogene proteins). Prominent nucleoli constitute an indication of cellular kinetics and cytobiochemical changes that occur in cancer cells. 31 In carcinoma, the association of nucleolar morphometric changes with poor patient prognosis is noteworthy, and there is increasing evidence from independent data that suggests an active role of the nucleolus in tumorigenesis. 15,34,35 However, there are currently no consensus histopathological guidelines with which to evaluate nucleoli in BC clinical practice. Previous studies assessing the prognostic significance of nucleoli in BC lacked objective criteria for scoring nucleoli, and did not identify optimal cut-off points for prognostic stratification. 31 In the current study, nucleoli were assessed with a variety of scoring methods in a large cohort of invasive BCs with long-term follow-up data, to determine the most reproducible assessment method while also providing prognostic value. The applicability of incorporating nucleolar score into the routine grading system was also assessed.
Assessment of nucleoli in the training set demonstrated that counting nucleoli in five FVs (0.5 mm 2 ) on digitised whole slide images was the optimal method. This inference was based on the following. First, nucleolar score in five FVs had the highest concordance rate between the observers. Second, this method determines nucleoli within the whole slide without the requirement to count nucleoli in wider areas, which is time-consuming. Counting in 10 FVs requires >100 prominent nucleoli to stratify the tumour. It was also more objective than the modified Helpap method. 29 Finally, nucleolar score in five FVs  showed a significant association with patient outcome.
In the five FVs method, we defined the nucleoli as ≥2.5 µm or easily seen at 910 magnification to provide a distinction between what was observed at the level of research and what might be applied in clinical practice. We utilised an eyepiece reticle in which 2.5 lm was the breadth between two cross-hatches at 940 magnification. It was apparent to the observers that 2.5 lm at 940 magnification via the reticle correlated with what was easily seen at 910 magnification, making the eyepiece reticle redundant. We think that, in daily clinical practice, the 910 rule could be used, on the basis of our experimental observations at 2.5 lm.
In addition, we used the same method of counting mitotic figures in the routine practice of grading by choosing a field with a high enough degree of cellularity, and we relied on hotspots. We chose the field   for counting after scanning the whole tumour at 910 magnification to locate nucleolar hotspots. We used a fixed number of areas (hotspots) to count the prominent nucleoli in all cases. Nucleoli were given three scores according to cut-off points generated with X-TILE against BCSS.
The main disadvantage of the modified Helpap method is the lack of information on the number of FVs or the number of cells with prominent nucleoli to be scored, or even the size of nucleoli to be considered. Although we followed the method recommended by Donizy et al. 31 in melanoma, which used 20% of the tumour cells with prominent nucleoli, it was still more subjective than counting a defined number of FVs. This might also explain the discrepancies in high nucleolar scores between different methods, as the modified Helpap method required >20% of the tumour cells with prominent nucleoli (i.e. thousands of cells), whereas, in the other methods, fewer cells with prominent nucleoli were counted to give a high score.
A high nucleolar count showed significant associations with poor BCSS and shorter DMFS in both training and validation cohorts, and also OS in the external validation cohort. An association between prominent nucleoli and poor prognosis has also been reported in melanoma 31 and prostatic cancer. 23 On the basis of the association of nucleolar score with concordant and discordant cases, we can rely on nucleolar score if the tumour is of high or low grade. When we compared concordant cases with discordant cases, we found that the percentage of G2/3 cases with a nucleolar score of 3 was higher than that of G2/2 cases. The statistical significance here is related to these variations in percentages between groups, and not to the absolute percentage. We therefore concluded that nucleolar score could provide an additional feature with which to determine the grade of these borderline cases.
In our study, a high nucleolar score showed a higher association with NST than lobular and other special types, which include invasive cribriform, invasive mucinous and tubular carcinoma. NST is the most common type of BC, constituting 40-75% of all mammary invasive carcinomas, 36 and is associated with a poorer prognosis than other types. 37 We therefore incorporated nucleolar score in the tumour grade in NST cases only.
Despite the objective improvements that have been made to BC grading methods, any assessment of morphological characteristics inevitably retains a subjective element and is heavily dependent on the preanalytical parameters. Nuclear pleomorphism indicates the shape, chromatin distribution and size of the nucleolus. The lack of clear definitions of this grading criterion and the subjectivity of its assessment 5,38 resulting in poor reproducibility are strong motivations to replace it with other, more objective, components. Also, nuclear pleomorphism showed the least concordance among grade components between pathologists in several studies. 6,7,38,39 A multivariate Cox regression model incorporating nucleolar score in five FVs with the Nottingham grade components, including pleomorphism, 40 in NST cases showed that nucleolar score was an independent predictive factor for outcome. In addition, after removal of nuclear pleomorphism from the model's covariates, nucleolar score showed a more significant association than pleomorphism. Therefore, nucleolar scores could be more predictive of outcome than pleomorphism. These findings support the hypothesis that nucleoli could be incorporated into grade, following our approach of quantification and scoring, to provide a more objective measurement than that obtained with pleomorphism.
When we replaced nuclear pleomorphism with nucleolar score, we showed a higher significant association with patient outcome than original grade with the pleomorphism score. Also, when nucleolar score was added as an additional feature to the grade, it showed a highly significant association with BCSS. These findings support our hypothesis that nucleolar score could be a promising parameter to be assessed to enhance the grading system.
In conclusion, assessment of nucleoli in H&Estained full-face BC in five FVs on whole slide images is a reproducible and practical method for predicting tumour behaviour and progression, and also provides more evidence for the reproducibility and reliability of the use of digitised images in clinical pratice. The use of whole slide image technology not only opens up opportunities for computer-assisteddetermination of the nucleolar size (2.5 lm), but also improves standardisation and reproducibility of evaluation of other morphological features by allowing the intgration of deep learning models. Application of this method in routine practice would aid in the risk stratification of BC to provide more individualised patient management.

Conflicts of interest
The authors declare that they have no conflicts of interest.

Author contributions
K. E. Elsharawy: scored all of the cases with different methods of assessment, and took the lead in writing the manuscript, data analysis, and interpretation. M.
S. Toss: helped with double scoring, data interpretation, and reviewing the article. S. Raafat: helped with double scoring and reviewing the article. G. Ball and A. R. Green: contributed to data analysis and reviewing the article. M. A. Aleskandarany: contributed to study design and reviewing the article. L. W. Dalton: contributed to study design, double scoring, and reviewing the article. E. A. Rakha: conceived and planned the study, and contributed to data interpretation and reviewing the article.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Figure S1. Kaplan-Meier survival plot showing associations of the four different scoring methods for nucleoli with distant metastasis-free survival (DMFS) in the training set. Figure S2. Bar charts showing associations of nucleoli counting in five field views (0.5 mm 2 ) and (A) grade-concordant cases, (B) grade high-discordant cases and (C) grade low-discordant cases in the Nottingham cohort. Table S1. Clinicopathological characteristics of the cases in both the training (n = 400) and validation (n = 1200) sets of the study cohort. Table S2. Cut-off points of nucleoli count as generated by X-TILE bioinformatics software based on the association with breast cancer-specific survival (BCSS). Table S3. Nucleolar scoring interobserver concordance results in the training set (n = 400). Table S4. Incorporation of nucleolar scores into the Nottingham grading system. Cut-off points of grade scores were generated by X-TILE bioinformatics software based on the association with breast cancerspecific survival (BCSS) in no specific type (NST) cases.