Building Better Nurse Scheduling Algorithms

The aim of this research is twofold: Firstly, to model and solve a complex nurse scheduling problem with an integer programming formulation and evolutionary algorithms. Secondly, to detail a novel statistical method of comparing and hence building better scheduling algorithms by identifying successful algorithm modifications. The comparison method captures the results of algorithms in a single figure that can then be compared using traditional statistical techniques. Thus, the proposed method of comparing algorithms is an objective procedure designed to assist in the process of improving an algorithm. This is achieved even when some results are non-numeric or missing due to infeasibility. The final algorithm outperforms all previous evolutionary algorithms, which relied on human expertise for modification.

comparison's sake. However, this raised the question of what level this should be at and whether one infeasible solution might be better than another.
Here we will present a method of statistically comparing algorithms. The implementation is such that any two versions of our algorithms can be compared and better components or parameter values are identified easily. We will use our nurse scheduling problem as a test bed to explain the workings and to show the effectiveness. Without any experience or relying on intuition, we are able to exceed all previous results by building the best heuristic algorithm to solve our problem to date. The proposed method of comparing algorithms is an objective procedure designed to assist in the process of improving an algorithm. This method has been used to build an improved nurse scheduling algorithm. Other innovations could be employed in an attempt to improve the current algorithm e.g. the column generation approach, and the effect of any additional innovation could be assessed using the method outlined. Thus, the proposed method of analysis is not a methodological search of all possible algorithms or improvements, but can assist in the intelligent improvement of parameter estimates and assess the effect of innovative alterations to an algorithm.
Moreover, one should note that the comparison methodology used is not restricted to the nurse scheduling problem discussed here or even to staff scheduling problems in general and can be used to compare any two algorithms solving the same problem. In particular, it was important to us to construct a comparison method that can cope with missing results due to the failure to reach a feasible solution.
The remainder of this paper is organised as follows. The next section introduces the nurse scheduling problem and states the equivalent integer programming formulation. Section 2 briefly explains Genetic Algorithms and show how they can be used to solve the nurse scheduling problem. The subsequent section then justifies the need and defines the statistical comparison algorithm. In Section 4, we use this comparison method to identify successful modifications to the original implementations to build the final algorithm. The paper concludes with a discussion and conclusions of our findings.

THE NURSE SCHEDULING PROBLEM
Creating good schedules for nurses in today's heavily constrained hospitals is no trivial task. A general overview of nurse scheduling algorithms can be found in Hung (1995), Sitompul and Randhawa (1990) and Bradley and Martin (1990). The rostering problem tackled in this paper can be described as follows. The task is to create weekly schedules for wards of up to 30 nurses by assigning one of a number of possible shift patterns to each nurse. These schedules have to satisfy working contracts and meet the demand for a given number of nurses of different grades on each shift, while being seen to be fair by the staff concerned. The latter objective is achieved by meeting as many of the nurses' requests as possible and considering historical information to ensure that unsatisfied requests and unpopular shifts are evenly distributed.
The problem is complicated by the fact that higher qualified nurses can substitute less qualified nurses but not vice versa. Thus scheduling the different grades independently is not possible. Furthermore, the problem has a special day-night structure as most of the nurses are contracted to work either days or nights in one week but not both. However due to working contracts, the number of days worked is not usually the same as the number of nights. Therefore, it becomes important to schedule the 'right' nurses onto days and nights respectively. These characteristics make this problem challenging for any local search algorithm as finding and maintaining feasible solutions is extremely difficult. Furthermore, due to this special structure previous nurse scheduling algorithms suggested in the literature cannot be used.
Electronic copy available at: https://ssrn.com/abstract=2824160 As described in detail in Dowsland and Thompson (2000) the problem can be decomposed into three independent stages. The first stage ensures that there are enough nurses to provide adequate cover. The second stage assigns the nurses to the correct number of day or night shifts. A final phase allocates those working on a particular day to the early or late shift on that day. Phases 1 and 3 can be solved using classical optimisation models. Thus, this paper deals with the highly constrained second step.
The numbers of days or nights to be worked by each nurse defines the set of feasible weekly work patterns for that nurse. These will be referred to as shift patterns or shift pattern vectors in the following. For example (1111100 0000000) would be a pattern where the nurse works the first 5 days and no nights. Then in phase 3, these days would be split into early and late shifts. Depending on their contacts, some nurses can work either days or nights or combinations of both. For each nurse, i, and each shift pattern, j, all the information concerning the desirability of that pattern for that nurse is captured in a single numeric preference cost p ij. This was done in close consultation with the hospital and is a weighted sum of the following factors: Basic shiftpattern cost, general day / night preferences, specific requests, continuity problems, rotating nights / weekends and other working history information. Further details can be found in Dowsland (1998).
The problem can be formulated as an integer linear program as follows.
Constraint set (I) ensures that every nurse works exactly one shift pattern from his/her feasible set, and constraint set (II) ensures that the demand for nurses is covered for every grade on every day and night. Note that the definition of q is is such that higher graded nurses can substitute those at lower grades if necessary. Typical problem dimensions are 30 nurses of three grades and 411 shift patterns. Thus, the IP formulation has about 12000 binary variables and 100 constraints. Although this is only a moderately sized problem, Fuller (1998) shows that some problem instances remain unsolved after more than 12 hours of computation time on a Pentium II PC using professional software.

GENETIC ALGORITHM IMPLEMENTATION
In recent years, Genetic Algorithms (GAs) have become increasingly popular for solving complex optimisation problems such as those found in the areas of scheduling or timetabling. GAs are generally attributed to Holland (1976) and his students in the 1970s, although evolutionary computation dates back further (refer to Fogel (1998) for an extensive review of early approaches). GAs are stochastic meta-heuristics that model some features of natural evolution Canonical GAs were not intended for function optimisation, as discussed by De Jong (1993).
However, slightly modified versions proved very successful. For an introduction to GAs for function optimisation, see Deb (1996). Many examples of successful implementations can be found in Bäck (1993), Chaiyaratana and Zalzala (1997) and others.
Electronic copy available at: https://ssrn.com/abstract=2824160 In a nutshell, GAs mimic the evolutionary process and use the idea of the survival of the fittest. Starting with a population of randomly created solutions, better ones are more likely to be chosen for recombination into new solutions, i.e. the fitter a solution, the more likely it is to pass on its information to future generations of solutions. In addition to recombining solutions, new solutions may be formed through mutating or randomly changing old solutions. Some of the best solutions of each generation are kept whilst the others are replaced by the newly formed solutions. The process is repeated until stopping criteria are met.
However, constrained optimisation with GAs remains difficult. The root of the problem is that simply following the building block hypothesis, i.e. combining good building blocks or partial solutions to form good full solutions, is no longer enough, as this does not check for constraint consistency. To solve this dilemma, many ideas have been proposed of which the major ones are penalty functions and repair. A good overview of these and most other techniques can be found in Michalewicz (1995).
Here we use two approaches: First, an encoding that follows directly from the Integer Programming formulation. Each individual represents a full one-week schedule, i.e. it is a string of n elements with n being the number of nurses. The ith element of the string is the index of the shift pattern worked by nurse i. For example, if we have 5 nurses, the string (1, 17, 56, 67, 3) represents the schedule in which nurse 1 works pattern 1, nurse 2 pattern 17, and so on. Full details of this algorithm can be found in Aickelin and Dowsland (2000).
The second approach presented here is the combination of an indirect GA with a separate heuristic decoder function. Here, the GA tries to find the best possible ordering of the nurses, which is then fed into the greedy decoder that assembles the actual solution. One way of looking at this decoder is as an extended fitness function calculation, i.e. the decoder determines the fitness of a solution after it has built a schedule from the permutation of nurses. For instance, a very simple (and perhaps not very good) decoder could take the nurses in the order decided by the GA and then assign each one her lowest cost shift pattern. Another simple decoder might take the nurses in order and assign each one to the currently most uncovered shift, taking into account nurses already scheduled. The actual decoder used considers a number of objectives, such as cost, nurses' preferences and cover, simultaneously and tries to find the best shift pattern for each nurse based on all objectives. The GA and the decoders used are fully described to solve a related multiple-choice problem in Aickelin and Dowsland (2002). The advantages of this approach are that all problem specific information is contained within the decoder, whilst the GA can be kept canonical. The only difference from a standard GA is the need for permutation-based crossover and mutation operators as explained for instance in Goldberg (1989).
For both algorithms, the fitness of completed solutions has to be calculated. Unfortunately, feasibility cannot be guaranteed for either method, as otherwise an unlimited supply of nurses would be necessary.
Therefore, we need a penalty function approach. Since the chosen encoding automatically satisfies constraint set (I) of the integer programming formulation, we can use the following formula, where w demand is the penalty weight, to calculate the fitness of solutions. Hence the penalty is proportional to the number of uncovered shifts and the fitness of a solution is calculated as follows.

THE NEED FOR A COMPARISON ALGORITHM
As outlined in the previous section, the GA might build infeasible solutions trying to solve the problem and no polynomial time repair algorithm for this class of problem is known. Even worse, because of this the only way to compare feasible and infeasible solutions within a GA is a penalty based method and hence some infeasible solution will be 'fitter' than feasible solutions and might dominate. Although some of this might be countered with high penalty weights, past work (Aickelin and Dowsland (2000)) has shown that if the weights are set too high often no good solutions are found at all. Furthermore, for this particular nurse scheduling problem it is known that for some data instances only very few feasible solutions exist in a large solution space. Thus, the GA will have to find the needle in the haystack. Consequently, there will be algorithm runs that terminate without ever encountering a feasible solution and hence their best solution will have a value of 'infeasible'.
Clearly, this is undesirable and better algorithms should be less and less likely to end up with an infeasible solution. However, the question remains how to build such well performing algorithms. The standard way of achieving this is to build a simple algorithm, assess how it performs, then add a new feature and compare the results. If the results are improved, the new feature is kept otherwise, it is discarded. This is then repeated for a number of features until the algorithm shows the desired performance. In a similar fashion, parameters can be tuned for better results.
Due to the reliability of heuristics on random numbers fair comparisons are only possible if the same problem instance is solved multiple times to avoid bias. We use 20 runs per problem instance. Thus, when deciding if an improvement is significant, it is traditionally the mean results before and after introducing the new feature that are compared. Alternatively, the best or worst cases or quartiles can be used as a basis for comparison. However, the question remains how to deal with algorithms that terminate with an infeasible solution as their best result. In these situations, the mean of the results can no longer be calculated, but one might argue that this could be substituted by the median. However, as the following examples show, this might not prove to be sufficient.
For instance, suppose we have two algorithms, say ALG1 and ALG2. Further suppose 10 trials of each algorithm were tried on the same problem instance. Let the values of the objective functions obtained for ALG1 be "infeasible" for the first six trials and be costs of 1, 2, 3 and 4 for the other four trials. Likewise, let the solutions obtained for ALG2 for be "infeasible" for the first 6 trials and let the cost for the next four trials be 1, 6, 7 and 8. Let us assume that as in the nurse scheduling problem lower costs are preferred. In this case, both algorithms have the same median solution ("infeasible") and both algorithms have the same cost for their best solution (cost of 1). What would be required is a comparison procedure that would globally rank ALG1 to be better than ALG2.
Although looking at quartiles might have worked for the simple example above, for a more realistic situation comparisons are even more difficult. Figure 1 plots the best solution found for three variants of our algorithms V2, V4 and V6 for the 52 weekly scheduling problems. Inspection of these best costs would tend to suggest that algorithm V6 is inferior to algorithms V2 and V4, and the differences between V4 and V6 are so small that the two algorithms may be viewed as being equally effective. To a certain extent the same conclusions Electronic copy available at: https://ssrn.com/abstract=2824160 would be drawn if attention was focussed on the median solution obtained (see Figure 2). However, if attention is restricted to the upper quartile or to the worst solution obtained then it is noticeable that V4 tends to be superior to V2 (see Figure 3 and Figure 4). So which one is best?
What is needed is a comparison procedure that is sensitive to changes in the location of a distribution but which also depends upon other distribution properties. Let us now construct such a method. Consider the general case where an algorithm i is to be compared with an algorithm j on a variant p of a problem. Suppose that K runs for algorithm i are to be performed with each run differing in initial starting conditions. Likewise, for the same variant of the problem suppose that L runs of algorithm j are to be performed. Let c i,k,p denote the cost obtained for run k of algorithm i on problem p. Likewise let c j,l,p denote the cost obtained for run l of algorithm j on problem p. If c i,k,p < c j,l,p then the solution for algorithm i on run k may be viewed as being better than the solution obtained by algorithm j on trial l.
Note that the "cost" associated with the solution need not be stated explicitly. For instance, suppose that a feasible solution on trial l for algorithm j was not found (c j,l,p = "infeasible") and for the same problem suppose a feasible solution on trial k for algorithm i was found. In many cases of this nature, we would want to interpret this example situation as c i,k,p < c j,l,p .
Suppose K trials of algorithm i and L trials of algorithm j on variant p of a problem are obtained with associated costs c i,k,p and c j,l,p (k = 1, … K; l = 1, … L). Consider Thus, D i,j,k,l,p indicates whether a particular instance of algorithm i is "better", "the same" or "worse" than algorithm j. An aid to assessing whether algorithm i is typically better than algorithm j on problem p is to consider the average value of D i,j,k,l,p over all possible pairwise comparisons i.e., Clearly, E i,,j,p = -E j,i,p and -1 ≤ E i,j,p ≤ +1 with the extreme values (±1) being obtained if one algorithm is "better" than the other in every observed instance. Values of E i,j,p equal to zero would occur when there is no difference between the algorithms or when the pairwise comparison frequency for "better" is equal to the pairwise comparison frequency for "worse".
In addition, if 0 < E h,i,p and 0 < E i,j,p then 0 < E h,j,p (i.e., if on problem p algorithm h is typically better than algorithm i and algorithm i is typically better than algorithm j then algorithm h is also "better than" to j).
Hence, 0 < E i,j,p permits a ranking for the algorithms h, i and j for problem p.
To illustrate suppose that a particular algorithm (ALG3) is used ten times to solve a particular problem.
If the hypothetical algorithms ALG3, ALG4 and ALG5 were applied to a variation of the problem then it is possible that a different rank ordering would be established. That is to say, the rank ordering could vary from one variation of a problem to another. In these instances, interest may focus on whether the rank orderings from problem to problem may be viewed as being "random" (meaning each permutation of rank orderings is equiprobable) or whether there is statistical evidence of some structure in the rank orderings over and above that anticipated by chance. In these cases, Friedman's test may be applied with the algorithms being viewed as levels of a single factor and the blocks being the variants of the problem under consideration. For an explanation of Friedman's test, see Lehmann (1975) or Conover (1980). If there is statistical evidence of some systematic arrangement of the relative rank ordering of the algorithms then scope exists to determine which algorithms differ. This can be achieved either by pairwise application of the Sign test based on the rank positions (i.e., the two-sample equivalent of Friedman's test) or by pairwise application of Wilcoxon's test using E i,j,p (p = 1, … P) as the difference measure between algorithm i and j on problem p. In applying Wilcoxon's test the statistical null hypothesis being tested would be for E i,j,p (p = 1, ... P) to be independent realisations from a symmetric distribution centred on zero. For an explanation of Wilcoxon's Signed Rank Sum Test see Lehmann (1975) or Conover (1980).

ANALYSIS AND DISCUSSION
For the problem under investigation eight algorithms were initially compared (say V1, V2, V3, V4, V5, V6, V7, V8). Brief descriptions of these eight algorithms are given in Table 1. These particular eight algorithms were chosen as they represented milestones in our original research (Aickelin and Dowsland (2000) and (2002)).
Whilst building those algorithms we constantly had to decide which features to keep or what parameter values to set. Typical questions were "Does adding a hillclimber improve results?", "Which algorithm is better, the direct or the indirect one?" or "What crossover should we use?". To answer these questions, we looked at the results obtained, but were faced with the problems outlined in the beginning of the previous section. Ultimately, we had to rely on our expertise to build the 'perfect' algorithm. After studying the problem for 3 years, the best we could build was V4.
In this section, we will show that with the new comparison measure we can improve upon our results because we no longer have to rely on human expertise to decide if a component or parameter setting is successful. First, we confirm that the comparison method works by ranking the milestone algorithms, for which we 'know' the rank positions. Then we will revisit some algorithm components and parameter settings where we had problems deciding the merit of one over another. During these tests we will show that the comparison measure is both sensitive enough to pick up improvements that we had missed previously, whilst being robust enough not to show a preference when no significant improvement had been made.
In order to have representative data, the weekly rostering requirements was obtained for a full calendar year, i.e. for 52 weeks. During all tests all 52 variants of the scheduling problem were considered (i.e., p = 1, … 52) with each variant being the rostering of nurses for each of the 52 weeks. Twenty trials for each algorithm on each variant p were performed. Application of equation 1 to the costs obtained from the trial solutions for each week permitted a rank ordering of the algorithms for each week, i.e., 52 rank orderings of the algorithms. An Electronic copy available at: https://ssrn.com/abstract=2824160 extract of the data for the first three weeks, the associated values for E Vi,Vj,1 and the associated rank orderings are given in Tables 2, 3 and 4 respectively. Statistically significant differences for the algorithms were found using Friedman's test (S = 290.54, df = 7, p-value < 0.001), where S is the value of Friedman's statistic on 7 degrees of freedom (df) and p-value denotes the p-value used for the test. The average rank position of each algorithm provides prima facie evidence of an overall ranking of the algorithms with V3 < V6 < V8 < V7 < V2 < V1 < V5 < V4 (i.e., algorithm V4 is tentatively regarded as the best, V5 the second best and so on through to V3 and V6 which are tentatively regarded as the worst). For illustration purposes Table 5 gives E V6,V3,1 (p = 1, …, 52) for algorithms V6 and V3 and Table 6 gives E V6,V8,1 (p = 1, …, 52).
To investigate whether there is statistical evidence of a difference between algorithm V6 and algorithm V3 one approach would be to use the Wilcoxon Signed Rank test to test whether the sample values E V6,V8,1 (p = 1, … 52) can be thought of as random sample from a symmetric population centred on zero. Another approach would be to test whether the signs of the sample values are considered to come from independent Bernoulli distributions with parameter π = 0.5 (i.e. a negative value just as likely as a positive value after ignoring any values equal to 0.). Application of the Wilcoxon test to the data in Table 5 confirms that there is no statistical evidence of a difference between algorithm V6 and algorithm V3 (T + = 760, T -= 566, Z = 0.909, n = 51, p-value > 0.35, two-sided test). Similarly, note for the sample data in Table 5, there are 30 positive instances of E V6,V3,p , 21 negative instances of E V6,V3,p and 1 instance equal to zero. Application of the Sign test to these frequencies fails to confirm a statistically significant difference between V6 and V3 (B = 30, n = 51, p-value > 0.25, twosided).
Application of the Wilcoxon test to the data in Table 6 shows that algorithm V8 is "better than" V6 (T + = 1267.5, T -= 58.5, Z = 5.67, n = 51, p < 0.001, two-sided test). Similarly for the sample data in Table 6, there are 45 instances of E V6,V8,p which are positive, 6 instances of E V6,V8,p are negative, and 1 instance equal to zero.
Application of the Sign test to these frequencies confirms a statistically significant difference between V8 and V6 (B = 45, n = 51, p-value < 0.001, two-sided).
Application of Wilcoxon's Rank Sum Test and/or the Sign test in the above manner leads to the conclusion that V4 is significantly better than V5. Repeated applications of these tests lead to the conclusion that V5 is significantly better than V1, V1 is significantly better than V2, V2 is significantly better than V7, V7 is significantly better than V8 and V8 is significantly better than V6. The data do not provide statistical evidence of a difference between V6 and V3.
The above conclusions are based on counting the number of instances one algorithm outperforms another. In the counting process, the magnitude of any difference is not explicitly accounted for. Indeed the magnitude of a difference may be difficult to quantify if a feasible solution is not obtained. For these reasons we may consider D i,j,k,l,p (α) defined by: The earlier conclusions for the ranking of the algorithms V1, V2, V3, V4, V5, V6, V7 and V8 where obtained for the special case of D i,j,k,l,p (α) = D i,j,k,l,p (1). In fact, repeat analyses using α = 0.5, 0.6, 0.7, 0.8 and 0.9 leads to the same broad conclusions as when α = 1.0.
Having shown that the comparison method works, let us now turn our attention to algorithm features that we had previously discarded as we could not perceive an improvement whilst using them. Hence, a second set of algorithms U1, U2, U3, U4, U5, U6, U7 and U8 are compared. A description of these algorithms is given in Table 1. Each of these algorithms was used 20 times on each of the 52 weekly scheduling problems and the above process was applied. In summary the poorest performing algorithms were U4 and U6 (no statistical evidence that they differ), followed by U3 which was significantly better than U4 and U6. Algorithms U2 and U8 were identical and significantly better than U3. There was no statistical evidence of a difference between U1 and U5 although both showed a statistically significant improvement over algorithms U2 and U8. Finally, algorithm U7 showed a statistically significant improvement over U1 and U5 and was considered the best from this set of algorithms.
Thus, we stand corrected from our previous result built upon our expertise. Figure 5 shows summary results for U7 and V4 (best indirect GA and our previous champion) and V7 (the best direct GA). The worst solution for each problem instance from U7 was compared with the worst solution found for each problem instance using U8 (which equals V4). In this comparison U7 obtained the better "worse" solution on 22 instances, U8 obtained the better "worse" solution on 14 instances and the same "worse" solution is obtained in 16 instances. If the median solutions are compared then U8 obtains the better median solution on 23 instances, U7 obtains the better median solution on 5 instances and the same median solution is obtained on 24 instances. If the best solutions are compared then U7 obtains the better solution on 9 instances, U8 obtains the better solution on 4 instances and the same best solution is obtained on 39 instances. It is worth pointing out in one instance a feasible solution was not obtained by U7 on one occasion whereas the best solution found by U8 was always feasible.
To confirm that the comparison method is not too sensitive we decided to use a third set of algorithms (W1, W2, W3, W4, W5, W6, W7 and W8) on the scheduling problem in exactly the same way as before. A description of these algorithms is given in Table 1. The algorithms were chosen such that we were reasonably convinced that none was much better than any of the others. Prima facie evidence might suggest that algorithm W4 is marginally better than the others (with W5, W8, W7 and W6 being identical to each other and marginally better than W1, W2 and W3). However, we would expect the comparison algorithm to be robust enough not to show any statistically significant difference. This was confirmed, as the observed differences by our method do not provide statistical evidence to claim a significant difference amongst the algorithms in this set.

CONCLUSIONS
The pairwise comparison procedure outlined provides a method of ranking algorithms on a problem instance that is applicable when some solutions are infeasible. In the presence of "infeasible" some sample statistics such as the mean or the standard deviation cannot be legitimately calculated unless it is justifiable to replace "infeasible" with a known finite cost. In summary, the comparison method condenses the results of algorithms to a single figure that can then be compared using traditional statistical techniques. This is achieved even when some results are "missing" due to infeasibility.
Focussing on the best solution alone is wrong, particularly in the algorithm building and refinement stage. Take  ---4.5 Table 1: Details of the GAs compared in the paper. The bold entries represent the best one of each set, note that for set W the differences were not statistically significant. The rank column gives the rank position of an algorithm relative to others in its set as determined by our comparison method. Full descriptions of these algorithms can be found in Aickelin and Dowsland (2000) and (2002). Briefly, (In)direct refers to the type of algorithm used as described in section 2 of the paper; Bound refers to how intelligently the solutions are built (i.e. not applicable to the direct version); Crossover gives the type of crossover used (automatic meaning the algorithm tries to decide itself), Elitism gives the percentage of the best solution carried over from one generation to the next; Auto-weights indicates that the algorithm tries to optimise some further parameters itself.