How many testers are needed to assure the usability of medical devices?

Before releasing a product, manufacturers have to follow a regulatory framework and meet standards, producing reliable evidence that the device presents low levels of risk in use. There is, though, a gap between the needs of the manufacturers to conduct usability testing while managing their costs, and the requirements of authorities for representative evaluation data. A key issue here is the number of users that should complete this evaluation to provide confidence in a product's safety. This paper reviews the US FDA's indication that a sample composed of 15 participants per major group (or a minimum of 25 users) should be enough to identify 90-97% of the usability problems and argues that a more nuanced approach to determining sample size (which would also fit well with the FDA's own concerns) would be beneficial. The paper will show that there is no a priori cohort size that can guarantee a reliable assessment, a point stressed by the FDA in the appendices to its guidance, but that manufacturers can terminate the assessment when appropriate by using a specific approach - illustrated in this paper through a case study - called the 'Grounded Procedure'.

Medical device manufacturers face a number of challenges in taking new products to market. After demonstrating their product's clinical effectiveness, the priority is to demonstrate its safety. The importance of medical device safety is illustrated by the fact that in the USA, in 2006 alone, unsafe medical devices were responsible for 2712 deaths [1].
One aspect of device safety that has attracted a lot of attention in recent years is usability, since there is a direct relationship in many industries between the usability of a product and how safe it is [2][3][4]. Recent research has shown a similar link for medical devices [5] and so healthcare regulators are increasingly turning to usability testing [6][7][8][9]. Regulators in Europe and the USA now stipulate a formal approach to development known as human factors engineering (also known as usability engineering or user-centered design) [10], where usability is integrated into the entire development cycle rather than being assessed just prior to the release of the product.
International standards (notably ISO 62366 and HE75) are the cornerstone of the regulatory framework and are intended to help manufacturers design and evaluate safe devices and to furnish appropriate supporting evidence. However, despite the regulatory bodies and the international standards, many devices are recalled each year on safety grounds. ExpertRECALL TM [11] reports that in the USA, in the third quarter of 2012 alone, 407 medical devices were recalled -a 70% increase over the same period in 2011 -which resulted in more than 26.5 million units being withdrawn from the market. In the UK, between 2006 and 2010, there was an increase of 1220% in the number of safety problems reported (for a review, see [12]). Further, looking at the information reported on the website of the Federal Institute for Drugs and Medical Device of Germany [13], the number of recalls in Germany increased from 721 in 2010 to 1075 in 2013.
FIGURE 1 shows the total number of alerts and recalls in only four countries from 2008 to 2011: Canada, Japan, the UK and the USA [14]. Heneghan et al. [12] also note a trend of increasingly serious problems being identified by post-market surveillance authorities and that of the 146 companies that recalled a device in 2011, 86 of these (58.9%) had to recall more than one device. These data show that device safety is a significant problem and, in this paper, we explore the question of usability testing, within that context.
We are not aware of a full taxonomy around the usability and safety of healthcare technology, nor does this paper attempt a rigorous classification. We note, however, that there are classes of devices where safety is primarily a question of clinical performance and is addressed by a clinical trial. Some implantable technologies might fit in this category. On the other hand, there are devices with a proven function where the potential for use errors are the major safety issue, and where methods from human-computer interaction -the provenance of the present paper -are fully appropriate. We can consider other types of devicesfor instance, hand-held technologies to support better breathing -where the clinical function and usability may contribute in a more balanced way to the overall safety of the device.
We contend, therefore, that there are two types of study or trial that may contribute to evaluating the safety of a medical device. Those seeking to design a clinical trial have recourse to statistical methods that will inform the number of users recruited to the trial. The contribution of this paper lies in putting usability questions on a similar footing, thus enabling developers to justify an appropriate number of contributors to their usability studies. Ironically, the two classes of methods have used nomenclatures that overlap, and we shall see that the terms p or the p-value in this paper follow the naming convention from the Human Computer Interaction (HCI) literature and represent the discovery likelihood. They should not be mistaken for statistical probability or statistical power, which might be connoted in the context of clinical trials.
Usability rules are process standards and are, by nature, less prescriptive than design or performance standards since they do not have associated objective end points. This makes it difficult to know exactly what testing and design changes they require. The point is particularly pertinent given that recent research has suggested that many medical device manufacturers do not have expertise or knowledge about usability and human factors [7], a problem that is a particular issue for smaller companies and those that are new to healthcare [8]. The design decisions that result from such testing require interpretation of the data, and so a lack of expertise or knowledge may prevent manufacturers from getting the most out of their usability studies [15].
The US FDA has developed guidance to assist manufacturers in interpreting the standards [16]. One issue that is covered in the guidance is that of the minimum sample sizes that might be applied to usability testing. This is an important topic as manufacturers understandably want to limit the amount of costly and time-consuming testing, while still ensuring that they conduct enough testing to address all of the safety risks associated with the usability of their device and demonstrating this safety to regulators. The balance of risk and cost is at the heart of this paper.
It is important to note here that safety is not the only reason to conduct rigorous usability testing of a medical device. Taking a user-centered approach to development will also increase the likelihood of a device being used regularly, correctly and with satisfaction. This should involve many different types of usability testing at different stages of the device cycle and with different fidelities of prototype, in different environments and with different types of user. The results of this testing will inform the decision on what should happen next and therefore the results of the testing should be in a form that allows manufacturers to make the best possible decision on the next step. This may be to make design changes, conduct more testing, include different users or move to the next stage of development. As well as stipulating this iterative process of design and evaluation, the medical device standards IEC 62366 [17] and ANSI/AAMI HE75 [10] also require a final validation evaluation as well as a formal post-market surveillance procedure to monitor the use of the device [18].
This paper describes how a new approach to sample size calculation for usability testing can be applied at different stages of development and how the results can be used to aid development decisions. In particular, the objective of this paper is to describe a new method -the Grounded Procedure (GP) (see [19,20]) -that details a systematic process of evaluation and data management. We argue that the GP provides a reliable way to use a set of estimation models commonly used in the HCI field. Moreover, the GP allows manufacturers to continuously monitor their usability evaluations and use the emerging findings to determine how many more subjects are likely to be needed to identify a specific percentage of problems, helping them to manage the cost/benefit risks associated with usability evaluation. . The total number of medical device alerts and recalls per year in Japan, the USA, the UK and Canada from 2008 to 2011 [9].
Usability testing of medical devices: estimating the sample size The question of how many users to include in usability testing is one that has been well-debated in HCI [21][22][23][24]. The first studies in this area mostly focused on determining the cost-benefit of web interface analysis by estimating the return on investment to justify the cost of usability assessment [25]. In line with this aim, researchers in the 1990s proposed a specific rule of thumb arising from the results of Virzi [24,26] and Nielsen [27][28][29][30]. This rule, known as the five-user assumption, proposes a onesize-fits-all solution, in which five users are considered enough for reliable usability testing. The five-user assumption has, however, been strongly criticized in the literature, notably because the (return on investment [ROI] based) estimation model behind it was too optimistic [31][32][33][34][35][36][37][38]. In fact, the p-value in the ROI model was estimated as the problems identified by each users against the total number identified by the cohort. This model was, as many researchers have noted, to consider the complexity of the interaction assessment. Today, a sample of five users in HCI tends to be seen as a good starting point for a usability assessment of interaction tools, such as a website, rather than a suitable final sample size, and at least three other well-tested models have been developed that overcome the optimistic results of the ROI model, and to avoid the blind use of a predetermined sample size. The first is the Good-Turing model (GT) [23,39], modified in tune with the study by Hertzum and Jacobsen [40]. The GT model formula is expressed as follows: In Eq. (1), p est is the initial estimate calculated from the raw data of the cohort, E(N 1 ) is the number of usability problems discovered only once in the evaluation across all users, N is the total number of problems identified and n is the number of test participants.
Second, the Monte Carlo method is a statistical simulation technique that has been used to simulate the impact of the subjects taking part in the evaluation in different orders (for a review, see [41]). Lewis [21,42] applied this in conjunction with the GT model procedure and showed that it delivers a conservative and reliable value of p.
Third, the Bootstrap Discovery Behavior model, proposed by Borsci et al. [31,32], is another re-sampling method that adopts a bootstrapping approach [43,44]. The BDB model is expressed as follows: In Eq. (2), M t represents the total number of problems in a product. The value a is the maximum limit value of problems collected by 5000 possible bootstrap samples. The value p represents the normalized mean of the number of problems found by each subsample. The q variable expresses the hypothetical condition L = 0 (an analysis without evaluators). In other words, since D does not vanish when L = 0, D(0) represents the number of evident problems that can be effortlessly detected by any subject, and q the possibility of detecting a certain number of problems that have already been identified (or are evident to identify), but were not addressed by the designer, as expressed in Eq. (3): The value q represents the properties of the interface from the evaluation perspective, with its extreme value being the 'zero condition', where no problems are found. The BDB model (as expressed in Eq. (2)) enlarges the perspective of analysis by adding two new parameters not considered in Eq. (1): all the possible discovery behaviors of participants (a) and a rule in order to select the representative data (q).
All of these models aim to calculate a specific index called p (or the p-value), which represents the average percentage of errors discovered by a user, that is, the discovery likelihood.
The estimation of the final number of users for an evaluation sample can be calculated by inserting the p-value into the following well-known error distribution formula [23,24,26,30,32]: At the beginning of the usability evaluation, neither p (the discovery likelihood) nor D (the total number of usability problems) is known, although, clearly, given one, the other can be easily calculated. This leaves those seeking to evaluate a product's usability with the problem of whether the users involved in the test (N) have identified a sufficient number of problems to ensure that a given threshold percentage, D th has been met. This threshold will vary according to the type of product: for many consumer products where the risks of usability errors are low, thresholds of 80% are common and appropriate; however, for medical devices where the risks of usability errors are much greater, an appropriate threshold is likely to be 97% and for some safety-critical tasks it will be 100% [16].
The only way to check whether the evaluation of a medical device has reached the desired threshold is by estimating the pvalue of the total sample and then calculating how this will change when a new user is added to the sample. Every time a new user is added to a sample, the overall p-value of the cohort may increase or decrease, depending on the new user's performance in terms of identifying problems.
By applying the estimated p-value to the Error Distribution Formula (4), it is possible to construct a curve of discoverability (FIGURE 2), by examining when the discovery threshold (D th ) is reached by the sample. This allows the estimation of the minimum number of participants that represents the ability of a larger population of final users (in an ideal situation all the possible users of a device) to identify all of the interaction issues with the device under the same evaluation condition (i.e., undertaking the same tasks with the same goals, performed in the same conditions).
How many testers are needed to assure the usability of medical devices? Review informahealthcare.com Finally, the p-value is an index that signifies the representativeness of the sample compared with the entire population of users, and, as stated above, it can be only be determined by a dynamic process of collecting information about the sample discovery likelihood and updating the discoverability calculations as users are added to the sample.
In response to demands from manufacturers for more clarity on questions related to the sample that should be used in usability testing, the FDA has included information on this topic in their recent guidance document. This contains useful advice on the critical importance of sampling strategies, stating that: 'The most important aspect of sampling may be the extent to which the test participants correspond to the actual end users of the device' [16]; and also deals with the question of sample sizes. The guidance acknowledges the limitations of all estimation models in terms of their reliance on assumptions of fixed and known probabilities of identifying device problems and points out that these assumptions do not reflect the real world.
Given these observations, the guidance does not promote the use of the estimation models for calculating the p-value of a sample of users. Instead, it suggests that validation testing should include 15 users from each major user group, basing this figure on empirical research by Faulkner [34], or that specific products/devices (such as infusion pumps) should be tested with a minimum sample of 25 users [45]. However, the guidance recognizes the limitations, or potential inappropriateness, of using such fixed sample sizes and goes on to state that: 'it may be advisable to test the maximum number of participants that budgets and schedules allow'.
This caution in relation to specifying general sample sizes is well-placed. As Borsci and colleagues [19] have suggested about the study by Faulkner, which has been used as support for such minimum sample sizes: 'It is difficult to determine how much weight should be given to the outcomes of this study since the primary data is not available for detailed analysis by other researchers. Further, the study did not make any connection between the average discovery likelihood and the likely percentage of discovered problems'.
It is important to stress that, despite the FDA guidance highlighting the limitations of the estimation models, it ultimately (though cautiously) proposes that practitioners adopt one of two starting points for validation testing (i.e., to test the device with at least 15 users from each major user group, or a minimum sample of 25 subjects). Though the sample sizes are higher, suggesting any kind of minimum, even with reservations contained in the appendices to the FDA's guidance, essentially leaves practitioners with a variation of the established, and much criticized, five-user assumption. In fact, by proposing a minimum number of users as a starting point without discussing a set of indexes for checking the cohort discovery likelihood, the FDA may inadvertently be reinforcing the same sort of sample size solution that has created misunderstanding in the HCI field.
Yet, it is clear from all current research on the use of estimation models in usability evaluation that there is no fixed sample size that can guarantee beforehand the reliability of the evaluation. In light of this, five, 15 or 50 participants may be far too few to identify all the problems with a device [22,36]. In fact, the number of users needed for a test strictly depends on the participants' performances in identifying problems, and there may be a number of issues that affect this. For example, the variability of the users' answers and reactions during the interaction analysis is unpredictable, and the practitioner may receive different answers (i.e., problem elicitations) from different users in the same context and evaluation conditions; in light of this, the selection criteria of the participants is a core issue when seeking to secure a reliable set of data. Moreover, devices differ and will have varying levels of complexity that may affect problem identification. Finally, the number and the types of problems identified by participants may vary substantially. The p-value could help practitioners to analyze the participants' performances during the interaction analysis. In particular, the p-value of the cohorts will represent the discovery likelihood of the  participants, selected by specific criteria, and involved in a test under specific conditions (task and scenarios).
Our key contention is that without controlling the p-value of the sample (i.e., the discovery likelihood), practitioners will report the number of problems identified by a sample, but will be unable to discuss the effectiveness of the sample in discovering usability issues. So, practitioners relying on indications about a minimum number of users for an assessment, and without any tools to control the discovery likelihood of the sample, may lead to unsuccessful pre-market submissions to the FDA or, more seriously, to products reaching the market that pose unacceptable risks. The converse risk is that manufacturers may waste valuable time and resources including too many users who show low levels of performance in relation to problem identification/discovery, ultimately risking the commercial success (and safety) of their device. In addition, there is the risk that fixating on a specific, predetermined sample size may lead to manufacturers waiting until this number of users has completed the evaluation before analyzing the usability data, rather than viewing usability evaluation as a continuous process requiring ongoing attention and analysis. This may be particularly true for formative usability evaluations.
The estimation models introduced earlier in this paper have been extensively used in HCI studies as tools for checking sample behavior in discovering problems, thus reducing the risk of obtaining outcomes with a low level of reliability. We would suggest that these models are not the ultimate solution and that in the future more inclusive, and well-balanced, algorithms could be identified by researchers to estimate the p-value. Nevertheless, currently, the alternatives seem to comprise the following options: practitioners can test a (minimum sized or larger) sample of well-selected users, analyze the findings and report the outcomes without having any information about the estimated percentage of problems identified by the users or practitioners can test an initial sample of well-selected users, analyzing the estimated p-value to take informed decisions about how to proceed with the evaluation (e.g., increase the sample, change the selection criteria, etc.), analyzing the findings as the evaluation proceeds and reporting the outcomes when a certain proportion of problems has been identified by the cohort. By following the second option, practitioners could start with 15 (or 25) users, as the FDA has suggested, and, after the analysis of the p-value, could decide whether or not to add to the cohort.
To explain the value of the estimation models, we will briefly discuss two scenarios using the following example: a practitioner arranges a validation test with a sample of 50 participants and the cohort identifies 20 usability problems. It could be that, in a best-case scenario, the 20 issues represent a high percentage of the total usability problems (i.e., a high p-value such as 0.40-0.50; see FIGURE 2) associated with the product. However, in a worst case scenario the 20 issues identified by the sample may represent only a small proportion of the discoverable problems (i.e., a low p-value). It is clear that if the practitioner does not check the p-value of the sample, s/he cannot discriminate between these scenarios. However, if the practitioner adopts the estimation models, s/ he will be in a position to take different decisions on the basis of the p-value identified. For instance, in the best-case scenario, the practitioner could decide to stop the testing and report the list of problems, while in the worst-case scenario, the practitioner would have to add more users to the sample and revise the procedure and the selection criteria before restarting a new evaluation test.
The preceding discussion suggests that there is a clear need for guidance and methods to assist manufacturers in, first, deciding on appropriate sample sizes for usability testing and, second, interpreting the results of this testing. Given the tension between the cost and need for an appropriate level of assessment to be built into the design process, manufacturers will have a series of issues in mind during the assessment, such as whether: major design problems can be fixed early and cheaply (through in-house or expert testing); a device is of sufficient quality to be tested with real users and/or in the context of use; they can be confident that a device is safe and that validation testing can be undertaken; a product is ready for release and that all appropriate evidence exists to support this judgment or there is a need to include more users in the sample or whether the evaluation can be concluded.
In the following section, we will explain, by means of an example, the application of the GP, which proposes the use of multiple estimation models in a single process to help practitioners monitor the usability assessment and use the emerging findings to take informed decisions about the evaluation.
The GP's three steps We propose that practitioners could start, in line with the FDA indication, with a sample of 15 users per major group, by assuming a specific range of p-value standard (e.g., 0.40-0.50, if the aim is to reach the 90-97% of the problems; see FIGURE 2), and use this value as a comparator against which the behavior of the real population of subjects can be assessed [19,20]. In light of this, practitioners, by estimating the p-value using the models, have to compare the p-value of their actual tested sample with the standard to make the following two main judgments, leading to the associated decisions and actions: • If the sample fits the standard p-value: report the results to the client and determine whether the product should be redesigned or released. • If the sample does not fit the standard p-value: add more users to the sample and re-test the p-value until the predetermined percentage of problems (D th ) is reached.
The manufacturers, by applying the GP, aim to obtain reliable evidence to decide whether to extend their evaluation by adding users, or whether they can stop the evaluation because they have sufficient information. To support this aim, the GP consists of three main steps [19]: How many testers are needed to assure the usability of medical devices? Review informahealthcare.com • Monitoring the interaction problems (step 1): a table of problems is constructed to analyze the number of discovered problems, the number of users that have identified each problem (i.e., the weight) and the average p-value of the sample; • Refining the p-value (step 2): a range of models are applied and then the number of users likely to be required is reviewed in light of the emerging p-value; • Taking a decision based on the sample behavior (step 3): the p-value is used to apply the Error Distribution Formula and take a decision on the basis of the available budget and evaluation aim.
Each of these steps is now discussed using an exemplar evaluation case.

Description of the evaluation case
An evaluation of a new model of blood pressure monitor (BPM) was conducted from September to October 2011 by the team of the MATCH programme (funded by EPSRC Grants: EP/F063822/1 and EP/G012393/1) [20]. The team tested six male and six female subjects (age mean: 29.2) each of whom had more than 11 months of experience of using different kinds of BPM. A think-aloud protocol [46,47] was applied, where each user was asked to verbalize the problems that they experienced during the use of the device. During the thinkaloud sessions, which were recorded by a digital video-camera, the participants completed three tasks: preparing the device for use; measuring blood pressure and recording the result and switching off the BPM.
In this paper, we are not interested in describing the users' interaction with the device, but in discussing the value of the GP for assessing devices and using the results to make appropriate decisions. Since the MATCH team did not use the GP during the assessment, we will discuss the results in terms of the problems identified by the evaluation cohort (section 'The discovery behavior of the evaluation case's sample'), as well as the additional decisions that would have been possible by applying the GP (section 'Applying the GP to the case').

The discovery behavior of the evaluation case's sample
The participants identified a total of 12 unique problems across the three tasks. For each one of these problems, we coded the users' behavior as 0 when a user did not identify a specific problem and 1 when they did. TABLE 1 presents a summary of the results.
Manufacturers can use the weight of the problems as an indicator of the sample behavior's homogeneity or heterogeneity in discovering problems. This indicator reveals the extent to which participants agree on the fact that a problem is visible (i.e., evident) during the interaction, and it is calculated for each unique problem as the number of participants that identify that unique problem, divided by the total number of the participants. When undertaking interaction evaluation, a sample can usually be considered heterogeneous when more than Table 1. The specific problems identified by each participant during the analysis of the three tasks.

Participants
Problems per tasks Individual p-value Task 1  Task 2  Task 3   1  2  3  4  5  6  The individual p-value represents the number of problems discovered by each participant (S1-S12) divided by the total number of unique problems discovered by the sample; while the weight of problems represents the percentage of the sample that identified a specific problem. The sample's level of homogeneity/heterogeneity is calculated as the number of problems discovered by less than 50% of the sample divided by the total number of unique problems identified. In this case, only 2 problems out of the 12 were discovered by less than 50% of the sample.
50% of the unique problems are identified only by one participant [19,20,48]. For instance, a sample of 10 users that identified a set of 10 unique problems would be considered to be heterogeneous when five or fewer of the 10 identified issues were identified only once during the evaluation. For medical devices, a more restrictive limit may be imposed in order to increase the safety of the device, for example, a sample might be considered heterogeneous when more than 50% of the unique problems are discovered by less than a half of the participants. To continue with the previous example, this would mean when 5 problems out of the 10 have been identified by fewer than five users in the sample. The sample of our evaluation case was homogeneous as only 2 problems out of the 12 were identified by fewer than six users (TABLE 1).
We estimated the p-value of the sample by applying three 1 of the estimation models discussed in section 'The GP's three steps', as follows: • Model 1: The sum of the individual p-values was used in the ROI model to estimate the raw p-value of the sample (i.e., the average of the individual p-values). • Model 2: The weight of each problem was used in the GT algorithm to recalibrate the p-value using the assumption that the more users that identify the same unique problem, the more evident and potentially findable the problem will be in a real use context (for a complete GT model procedure, see [23]). • Model 3: The BDB model used the datasheet of problems (TABLE 1) to run a 1000 iteration re-sampling using an algorithm that extracted from the real data different factors in order to refine the p-value estimation (for a complete BDB procedure, see [31,32]).
The results of the discovery likelihood for each model are shown in TABLE 2. Empirical models such as ROI and GT calculate the overall p-value by a probabilistic estimation of problems that are missed and identified by the cohort. The order of participants in these models is considered an unmodifiable constraint of a usability test -that is, each participant has a specific order of testing and identifies a specific set of problems. Without any intent to estimate a generalizable p-value, evaluators can use the ROI and GT models to obtain information about the trend of participant behavior in discovering problems under the predefined evaluation conditions of a test (i.e., the tasks, scenarios and order of participants' assessment of the product). In contrast to the empirical models, BDB aims to identify an accurate and generalizable p-value irrespective of the actual number of problems identified by participants and their order in the test. This approach, in a similar vein to that reported by Faulkner, estimates the p-value through a random re-sampling simulation of the observed data. TABLE 3 shows the number of users needed to identify between 90 and 97% of the problems estimated by 10,000 bootstrap re-sampling iterations.
Unlike approaches that have aimed to identify a general rule -that is, how many users you need to identify a certain percentage of problems in a general usability test -the GP uses all of these estimation models as tools to obtain a range of values that can inform evaluators about participants' behaviors in discovering usability problems during a specific test.
Therefore, the estimation models are intended here to be used only as means of checking the behavior of a cohort, and taking informed decisions, such as whether there is a need to add more users, during a particular usability test.
As FIGURE 3 shows, by applying the estimation values reported in TABLE 2 to Eq. (4), we find that our cohort of 12 participants discovered between 99.91 and 99.99% of the problems of the device with a homogenous discovery behavior (FIGURE 3).
In this case, the analysis suggests that there is no need to add more users to the evaluation sample and that to do so would largely represent a waste of resources; the probability of any subsequent user identifying new problems while completing the same three tasks is between 0.01 and 0.09% Applying the GP to the case We may assign, following Nielsen and Landauer [30], an arbitrary cost of £100 to each unit of analysis (in this case, each user involved in the study) to explore the costs and savings associated with applying the GP. Therefore, to discover 12 problems the investment of the manufacturers was £1200 (£100 for each of the 12 users involved).
By using the average p-values provided by the three estimation models, we can estimate that the evaluators had identified 90% of the problems after the analysis of the first four users (i.e., D (pROI,pGT,pBDB) = 93.30%) and 97% after the first six (i.e., D (pROI,pGT,pBDB) = 98.27%). In light of this, if the GP had been applied during the assessment of this BPM, after six users the manufacturers could have chosen to stop the assessment having obtained reliable results, resulting in a saving of 50% of the budget (£600).
To demonstrate this, we can simulate the application of the GP's three steps during the evaluation case, using a threshold How many testers are needed to assure the usability of medical devices?

Review
informahealthcare.com percentage (D th ) of 97% of the total problems to identify the point at which the evaluation would be stopped, as follows: • The manufacturer starts the assessment with a sample of five users, and compares the p-value of this initial sample to the standard for the aimed-for threshold (D ‡ 97% and p ‡ 0.5) to decide whether to stop the assessment or add new users to the sample. • By looking at TABLE 4, the manufacturer observes that the first five users identified 11 problems with a p-value ranging from 0.43 to 0.60 (M: 0.49). This discovery likelihood is close to the standard, and, by applying the average p-value to the Error Distribution Formula (1), the manufacturer can estimate that this sample of five users identified an average of 96.64% of the problems, with an estimated range of D from 94.38 to 98.98% (TABLE 5). The homogeneity of the sample is marginal; 7 problems out of the 11 identified at this point (63.63%) were discovered by more than 50% of the users, while the remaining four problems (36.37%) were discovered by less than a half of the sample.   • Since the sample is only marginally homogeneous, the manufacturer should not decide to stop the assessment at this point, because the aimed-for percentage of 97% has not been reached and, moreover, the sample behavior presents a relatively high level of heterogeneity (i.e., 36.37%). In light of this, the evaluator may decide to add at least one additional user to the sample in an attempt to increase the reliability of the evaluation data. Of course, if the allocated budget for the assessment does not allow for the addition of more users to the sample, the evaluator could report that they have discovered a high percentage of problems (i.e., 96.64%), but that the relatively high level of heterogeneity suggests that increasing the budget in order to add more users may increase the reliability of the assessment. In such a way, more informed decisions about the value of adding to the evaluation budget may be made.

ROI
Choosing to add another user (user number S6) would take the manufacturer through the GP cycle for a second time. This second cycle of GP analysis (i.e., re-running steps 1, 2 and 3 with this new user), shows an increase of the cohort p-value (0.47 < p < 0.58, M: 0.51); as TABLE 6 shows, a new usability problem was identified, and the sample became more homogeneous (at this point, only 16% of the problems had been discovered by less than 50% of the sample). On the basis of these data, the manufacturer had enough information to stop the assessment and report that the participants, as shown in FIGURE 4, had identified a total number of 12 unique problems, which represents 98.75% (98% < D < 99.48%) of the possible issues that could be identified by a larger sample of end users interacting with the product during the three evaluation tasks. Table 4. The specific problems identified by each participant (S1-S5) during the analysis of the three tasks, with a sample of five users.

Participants
Problems per tasks Individual p-value Task 1  Task 2  Task 3   1  2  3  4  5  6 Table 5. The discovery likelihood of each of the participants (S1-S12) calculated using the three estimation models, and the resulting mean p-value. In this case, both the overall p-value and the homogeneity of the sample were greatly increased when the sixth user (S6) was added to the cohort. However, sometimes adding a new user may decrease both the homogeneity and the p-value of the cohort. This could happen for different reasons, such as selecting inappropriate users. In these cases, the manufacturer may have to reconsider the participant selection criteria. Doing so at this point in the GP process, however, may cause bias in the results if the selection criteria are changed as they should be consistent for the whole set of sampled users. An alternative, though more costly, approach would be to restart the procedure with new users under revised selection criteria. Such cases should be rare given that attention will be paid to clearly and appropriately specifying the selection criteria and then using them to select users that are as representative as possible of the intended market users of the medical device.
This simulation shows how using the GP enables the analysis to be stopped when the optimal sample size has been reached -that is, when the desired D has arrived at -preventing resources from being wasted. Of course, the results of our evaluation case are not generalizable for the assessment of any other BPM. In fact, the GP only indicates the reliability of the data gathered during a specific evaluation process, meaning that with other participants or with other evaluation conditions (such as other tasks or another model of BPM), the GP outcomes will vary. As a result, there is no one single definitive number of users that should be used for reliably testing a certain kind of device, and, as a consequence, manufacturers should apply the GP for each evaluation, whether it be formative or summative.

Expert commentary & five-year view
The GP's value is that, for a specific evaluation setting (i.e., target product and chosen evaluation technique), it can help a manufacturer to decide how to proceed with the evaluation once the first five users have been studied (seeing five users as a minimum meaningful sample size). The GP can be used in the evaluation of medical devices by offering a way to control evaluation costs while assuring the representativeness of the sample and the associated quality of the evaluation data. It is important to note here that the GP forces manufacturers to manage and organize the gathered data in a specific way, and that the procedure of behavior analysis may be seen as a restrictive organization of the data. We would suggest, though, that the GP should not be used as a meta-methodology but as a tool that aims to support manufacturers to comply with relevant international standards.
By using the GP, manufacturers are driven to manage the data of different kinds of end user and to report and demonstrate to the monitoring/regulation institutions the representativeness and the reliability of their verification and validation tests. In light of this, we argue that the GP represents a pragmatic solution and a powerful tool for controlling evaluation costs by respecting practices in line with the user-centered Table 6. The specific problems identified by each participant (S1-S6) during the analysis of the three tasks, and the situation at the point when user number six (S6) is added to the sample.

Participants
Problems per tasks Individual p-value Task 1  Task 2  Task 3   1  2  3  4  5  6  7  8  9  10  11   design approach and promoted by the relevant standards, and for releasing medical devices on the basis of evaluation data gathered using techniques and methods that can offer greater confidence to the end users of a high level of safety in use. Through presentation of data related to the release of unsafe medical devices in the last few years, discussion of commonly-held beliefs of manufacturers and analysis of the lack of appropriate appreciation associated with relevant standards, we have argued that there is a gap between the needs of the manufacturers to conduct sufficient testing while managing their costs, and the requirements of international authorities for reliable and representative evaluation data. On this basis, we have proposed a solution for bridging this gap -the GP -which is a specific procedure for the management of evaluation data that may encourage manufacturers toward a truly UCD approach.
The procedure allows manufacturers to analyze the reliability of the data from their usability tests, enabling them to estimate the sample size needed to identify a given proportion of interaction problems. This method provides a new perspective on the discovery likelihood of problems/ issues with devices, and on designing evaluation studies, and gives manufacturers the means to use the data from their evaluations to inform critical system/product decisions, providing decision support in relation to when to enlarge the sample, re-design or release the product. It also allows the reliability of the evaluation to be calculated, which should help manufacturers to conduct efficient evaluation studies and control costs, and should also enable them to demonstrate objectively to regulators and purchasers the reliability of their evaluations.
The further development of the p-value estimation algorithms is the key factor for improving the reliability of usability data of medical devices. The estimation of the discovery likelihood of a cohort remains a keenly-debated topic in technology evaluation, and, currently, only the GP proposes a synthesis of the most advanced and well-tested algorithms for the estimation of the p-value. Extensive use over the coming years of approaches based on p-value estimation, such as the GP, could help medical device manufacturers to reduce the uncertainty in design decision-making. Moreover, the spread of these approaches could give manufacturers access to a set of comparable data on both the representativeness of the assessment carried out on different products and on the reliability of the different evaluation methods applied for testing a range of products. The comparability of the evaluation results and of the usability methods could create a way to define a set of standardized thresholds that a practitioner has to reach in order to establish a high degree of usability and safety associated with a medical device.

Financial & competing interests disclosure
The authors acknowledge support of this work through the MATCH programme (EPSRC Grants: EP/F063822/1 EP/G012393/1), although the views expressed are entirely their own. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.  How many testers are needed to assure the usability of medical devices?

Key issues
• The assessment of usability and its place in the management of product safety is an increasingly important aspect of medical device development.
• Medical device manufacturers, especially in small companies, do not have sufficient expertise or knowledge about usability and human factors and it is hard for them suitably to address all of the steps required by relevant standards.
• Despite a rigorous framework for designing and assessing a product, an increasing number of medical devices are recalled each year because of safety concerns.
• There is substantial evidence that testing a device with mandated sample sizes could lead manufacturers to evaluate a product without having real control over the reliability of the assessment.
• A new approach to usability data management applied in this paper, called the 'Grounded Procedure', drives manufacturers to estimate the sample size needed to identify a given proportion of interaction problems and to inform critical product decisions.
• Using the Grounded Procedure could enable manufacturers to increase the usability and the safety of their medical devices and help practitioners to check the representativeness of the evaluation cohorts, to analyze significance of specific usability problems and to re-think the user selection criteria for validation testing.