The effect of stimulus duration on preferences for gain adjustments in speech

Objectives In the personalisation of hearing aid fittings, gain is often clinically adjusted to patient preferences using live speech. When using brief sentences as stimuli, the minimum gain adjustments necessary to elicit preferences ('preference thresholds') were previously found to be much greater than typical adjustments in current practice. The current study examined the role of duration on preference thresholds. Design Participants heard 2, 4 and 6-s segments of a continuous monologue presented in pairs. Participants judged whether the second stimulus of each pair, with a {+/-}0-12 dB gain adjustment in one of three frequency bands, was "better", "worse" or "no different" from the first at their individual real-ear or prescribed gain. Study Sample Twenty-nine adults, all with hearing-aid experience. Results The minimum gain adjustments to elicit "better" or "worse" judgments decreased with increasing duration for most adjustments. Inter-participant agreement and intra-participant reliability increased with increasing duration. The effect of duration, however, decreased with increasing duration, with no increase in agreement or reliability for 6-s vs. 4-s segments. Conclusions Providing longer stimuli improves the likelihood of patients providing reliable judgments of hearing-aid gain adjustments, but the effect is limited, and alternative fitting methods may be more viable for effective hearing-aid personalisation.


Introduction 37
In the treatment of hearing loss, clinicians fit hearing aids to reach a balance 38 between audibility and comfort for each patient. The balancing act begins with 39 prescribed gains across frequencies based on each patient's pure-tone thresholds. 40 These prescribed gains, based on average data, are then personalised through 41 adjustments made by the clinician using patient feedback (Anderson et al., 2018;42 Jenstad et al., 2003;Kuk, 1999;Thielemans et al., 2017). The patient's feedback is 43 often based solely on the effect the adjustments have on the perception of the 44 clinician's voice, the most readily available stimulus in any clinic. 45 We have previously shown what gain adjustments are discriminable for short 46 sentences presented in quiet. Median just-noticeable differences (JNDs) for gain 47 increments in broad low-, mid-and high-frequency bands were 4, 4 and 7 dB, 48 respectively (Caswell-Midwinter and Whitmer, 2019). Using the same speech corpus, 49 we have subsequently shown what gain adjustments are necessary to elicit 50 preferences (Caswell-Midwinter and Whitmer, 2020). Median preference thresholds 51 ranged from 4-12 dB for gain decrements and 5-9 dB for increments in the same 52 broad low-, mid-, and high-frequency bands. In Caswell-Midwinter and Whitmer 53 (2019), it was posited that the greater JNDs for speech in quiet re speech-shaped 54 noise were due to the spectro-temporal sparsity of the speech. That is, for a given 55 gain adjustment in any given band, the clean speech signal provided a smaller 56 number of glimpses of the adjustment than speech plus noise. In Caswell-Midwinter 57 and Whitmer (2020), it was further hypothesised that the large preference thresholds 58 were due in part to the short duration of the stimuli. Although patients typically 59 make quick comparisons on adjustments in the clinic, audiologists may talk for 60 longer, which might elicit more frequent and reliable preferences. 61 Previous psychophysical research has shown durational effects on level 62 discriminability, albeit mostly limited to short pure-tone stimuli. Increasing the 63 duration of a 0.5 or 8-kHz tone up to 2 s can improve level discrimination in normal-64 hearing listeners (Florentine, 1986), and improves discrimination in fixed and roving 65 pedestal level conditions (Oxenham and Buus, 2000). For the discrimination of a 66 tone's level within a complex (i.e., profile analysis), performance improves up to a 67 duration of 100 ms (Green et al., 1984;Dai & Green, 1993). The ability to 68 discriminate a gain adjustment in particular band(s) of speech bears partial 69 resemblance to increment detection, the detection of a temporary increase or 'bump' 70 in level in an ongoing sound. Valente et al. (2011) showed that increasing the 71 duration of the standard tone decreased the threshold more than increasing the 72 duration of the increment of a tone. In all past studies of level discrimination and 73 increment detection with varying duration, though, performance improves with 74 frequency (e.g., Moore et al., 1997), whereas the discriminability of gain adjustments 75 decreases with the frequency band of the adjustment for speech (Caswell-Midwinter  76 and Whitmer, 2019). There is some evidence of a duration effect with broadband 77 stimuli: studying the detection of an 8-dB peak at 3.5 kHz in a broadband noise, 78 Farrar et al. (1987) found that thresholds decreased as duration increased up to 300 79 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) agreement or reliability in using descriptors (e.g., "tinny") to describe the effect of a 118 gain adjustment in Caswell-Midwinter and Whitmer (2020), the current study only 119 measured preferences. 120 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Whitmer, 2020). 139

145
All participants had also performed visual letter and digit monitoring tasks 146 during a previous study (min. 18 mos. prior to current study) as an estimate of their 147 cognitive abilities (specifically working memory; Gatehouse et al., 2006). The tasks 148 involved identifying sequences at two different ISIs (1 and 2 s); a full description is in 149 Caswell-Midwinter and Whitmer (2019b). The resulting d' measures were averaged 150 across letter and digit tasks and ISIs to a single cognitive score. 151 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Stimuli 152
The stimuli were consecutive segments of a Sherlock Holmes story read by a 153 professional male actor with a Southern English accent ("The Naval Treaty"; Doyle, 154 2011). The original stimuli were collapsed from stereo to mono and resampled to 24 155 kHz from an original recording sample rate of 44.1 kHz. Any silent gaps greater than 156 250 ms were truncated to 250 ms. On each trial, two consecutive segments were 157 presented to the participants' better ear, both with equal duration of either 2, 4 or 6 158 s. For each segment, 50-ms linear onset and offset gates were applied. To better 159 mimic adjustments in the clinic, the standard stimulus was always the first stimulus 160 in the pair, and there was no ISI beyond the offset and onset gating. 161 For the standard stimulus, real-ear or prescribed gain was applied across six 162 frequency bands: a 0.25 kHz low-pass band, four octave bands centred at 0.5, 1, 2 163 and 4 kHz, and a 6 kHz high-pass band. For the target stimulus, additional gain 164 (ΔGain) of either -12, -6, 0, +6 and +12 dB was applied in one of three broad 165 frequency bands: a low-frequency band combining 0.25 (low-pass) and 0.5 kHz 166 (octave) bands (LF), a mid-frequency band combining 1 and 2 kHz octave bands 167 (MF), and a high-frequency band combining the 4 kHz and 6 kHz (high-pass) bands 168 (HF). Stimuli were generated by convolving each segment with a 140-tap finite 169 impulse response filter optimised for NAL-R equalisation at 24-kHz sample rate by 170 Kates and Arehart (2010). The overall long-term A-weighted presentation level was 171 60 dB SPL to approximate in-quiet conversation level (Olsen, 1998). Presentation 172 level was verified with an artificial ear and sound level meter (Bruel & Kjaer 4152  173 and 2260), prior to any prescription or gain adjustment. Audibility of the segments 174 was confirmed with each participant after the first trial. 175 We additionally analysed the effect of the natural variation in power across 176 the consecutive segments of each trial (i.e., when ΔGain = 0). There were significant 177 mean level differences between the two segments in any given trial as a were asked on each trial to listen to each presentation and decide "How did the 189 second sound compare to the first sound?" by selecting either the "better", "worse" or 190 "no difference" button on the touch screen. 191 There were three segment durations (2, 4 and 6 s) and 13 gain adjustments 192 (±6 and ±12 dB adjustments in the LF, MF and HF bands plus a no-adjustment 193 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) Mean preference ratings -rates of "better," "worse" and "no difference" 205 judgments -were calculated for each participant for gain adjustments in each 206 frequency band (see Figure 2). A repeated-measures analysis of variance was run on 207 the entire dataset (5 gain adjustments × 3 frequency bands × 3 segment durations) 208 using individual mean combined "better" and "worse" preference rates [P(B||W) = 1 -209 P(ND)] as the dependent variable (see Table 1). The greatest rates of "better" and "worse" responses were for LF adjustments. 217 Compared to preferences elicited for short sentences in Caswell-Midwinter and 218 Whitmer (2020; grey triangles and dotted lines in Figure 2), the consecutive 219 segments elicited more "better" and less "worse" ratings for +12-dB adjustments in appear to be more "better" and less "worse" ratings in the LF band for +12 dB 223 adjustments (comparing grey with coloured triangles in the left panel of Figure 2) in 224 the current study compared to the previous, but these differences were not 225 statistically significant [t (59) = 1.99 & -1.60; both p > 0.05]. 226 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

Preference thresholds 238
The minimum gain adjustment required to elicit either a "better" or "worse" Tukey boxplots (Tukey, 1977) in Figure 3 to show the range of preference thresholds 249 for each condition. The Holm-Bonferroni method (Holm, 1979) was used to adjust 250 the rejection probabilities for multiple comparisons where necessary. 251 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. band, direction (±) of gain adjustment and segment duration (see Table 2). 260 Preference thresholds decreased with segment duration, increased with frequency 261 band and were greater for decrements than increments. There was a significant 262 interaction as frequency band × gain direction; decrement thresholds increased more 263 than increments with increasing (centre frequency) band. There were also a 264 significant albeit modest (η 2 = 0.11) interaction between gain direction and duration; 265 preference thresholds decreased generally more for increments than decrements. 266 There was additionally a significant but modest three-way interaction in the MF 267 band: preference thresholds decreased with increasing segment duration more for 268 decrements than increments. 269 The overall rate of change, derived from a linearisation of mean thresholds not 278 including HF decrements, decreased as a function of duration from -0.7 to -0.3 dB/s. 279 That is, preference thresholds decreased more for duration changing from 2 to 4 s 280 than from 4 to 6 s. 281 comparisons (all p > 0.05). HF increment preference thresholds were positively 294 correlated with HF pure-tone thresholds (ρ = 0.44; p = 0.032), and negatively 295 correlated with HF sensation level (ρ = -0.48; p = 0.019). Preference thresholds were 296 not correlated with cognitive score, but the individual decrease in threshold with 297 duration, characterised as the dB/s slope, was negatively correlated with cognitive 298 score (r = -0.50; p = 0.0073). That is, duration had a greater effect on those with 299 greater letter/digit-monitoring ability. 300

Preference agreement and reliability 301
Fleiss' κ (Fleiss, 1971) was used to measure inter-participant agreement, 302 comparing participants' most frequent judgment of each adjustment condition. To 303 simplify the analysis, judgments were collapsed across adjustments for each direction 304 and frequency band; the ΔGain = 0 condition was not included in the analysis. 305 Fleiss' κ was 0.39 [0.36-0.42 95% confidence intervals (CI)], 0.50 (0.47-0.53) and 0.50 306 (0.47-0.53) for segments of 2-s, 4-s and 6-s duration, respectively, representing fair (2 307 s) and moderate (4 & 6 s) agreement (ibid.). That is, agreement significantly 308 increased from 2-4 s, but not from 4-6 s. 309 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 5, 2021. ; https://doi.org/10.1101/2021.02.26.21252511 doi: medRxiv preprint For each participant, a given gain adjustment was considered reliable if it 310 elicited seven or more "better," "worse" or "no difference" judgments, a reliability 311 threshold based on binomial probability theory (Kuk and Lau, 1995 Figure 4  315 shows individual proportions of adjustments with reliable preferences. Reliability 316 increased significantly from a median value of 67% for short sentences and 2-s 317 segments to 75% for 4-s and 6-s segments [χ 2 = 11.10; p = 0.011]. There was no 318 significant difference in reliability between sentences and 2-s segments (z = 0.65; p = 319 0.51) nor 4-s and 6-s segments (z = 0.72; p = 0.47). The percentage of participants 320 with ≥ 90% reliable preferences, however, did increase from 14% at 4 s to 28% at 6 s. 321 Individual reliabilities for short sentences and 2-s stimuli were not correlated, but 322 reliabilities for 4-s and 6-s stimuli were (r = 0.61; p = 0.0004).  Whitmer, 2020). 338 Despite differences in the method, the median preference thresholds in the 339 current study for 2-s segments were similar to the thresholds for 1.6-s average 340 duration sentences in our previous study (Caswell-Midwinter and Whitmer, 2020), 341 and correlated with the previous thresholds. As with the previous study, the 342 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 5, 2021. preferences and reliability. There were preference differences between the two studies, 347 with increases in "better" vs. "worse" judgments for MF and HF increments in the 348 current study. The differences in the long-term spectra between the current 349 monologue and previous sentences-0.9, 0.2 and -5.6 dB in the LF, MF and HF 350 bands, respectively -can explain the increase in "better" preferences for the HF 351 band, but not the MF band. 352 Participants were less likely to respond "no difference" in the current study 353 where It is not clear from the current results if talking even longer (i.e., for durations 384 > 6 s) would provide even greater discriminability and more reliable preferences. 385 While the thresholds across most conditions decreased significantly from 4-s to 6-s, 386 the trend was asymptotic. The overall rate of change decreased from -0.8 dB/s at 4 s 387 to -0.4 dB/s at 6 s, resembling the modest exponential decay of memory-performance 388 models (e.g., Durlach and Braida, 1969). In line with memory-performance models, 389 there was a negative correlation between participants' monitoring-task cognitive 390 scores and the rate of decrease in their preference thresholds with increasing 391 duration. That is, the better their cognitive scores, the stronger the effect of stimulus 392 duration on preference thresholds. This suggests that the effect of duration in the 393 judgment of gain adjustments is limited by each individual's cognitive abilities. The 394 mean preferences were very similar for 4-s and 6-s stimuli (Figure 2), and there was 395 no increase in inter-participant agreement nor intra-participant reliability (Figure 4). 396 It is therefore unlikely for thresholds to decrease, or reliability to increase, much 397 further beyond the results here for 6-s stimuli (cf. Sams et al., 1993). 398 The improvement in thresholds and reliability with stimulus duration is also 399 small relative to the thresholds and reliabilities themselves. Talking or presenting 400 stimuli for 6 s to a hearing-aid wearer in the clinic will help elicit preferences for 401 adjustments, but those adjustments still need to be large: 3-6 dB for increments, 5-12 402 dB for decrements. These thresholds are still well above common troubleshooting 403 adjustments, especially for adjustments in the higher frequencies. In the 404 personalisation of hearing aids in the clinic, it is therefore important to not only say 405 more than a few words (e.g., "how's that sound?") immediately following an 406 adjustment, but to ensure the adjustment is large enough to elicit reliable feedback. 407 Given these constraints, alternative methods of fitting, such as self-adjustments 408 (