Noise Invariant Frame Selection: A Simple Method to Address the Background Noise Problem for Text-independent Speaker Verification

The performance of speaker-related systems usually degrades heavily in practical applications largely due to the presence of background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker verification systems achieves similar performance when we change the constraints (hyper-parameters) or features, which indicates that it is robust and easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks.


I. INTRODUCTION
Speaker verification is the prime example of a speaker-related task, that is, a task for which the data is heavily correlated with speakers. For speaker-related tasks, uncertainty in features can be represented by several speaker models, among which Vector Quantization (VQ), Gaussian mixture models (GMMs) [1] and i-vector [2] are the most successful examples proposed in the past decades. Recently, deep Neural Networks (DNNs), especially Convolution Neural Networks (CNNs) also have been widely and successfully applied to extract deep features to represent speakers [3], [4], [5].
However, in practical applications, speaker verification performance degrades heavily in noisy environments due to the acoustic mismatch between clean training conditions and noisy test conditions [6]. Many approaches have been proposed to address this issue: speech enhancement, feature enhancement, model adaptation, etc. Speech enhancement attempts to enhance the signal using the noise information obtained from speech or prior knowledge. To date, enhancement mainly focuses on filtering techniques such as Kalman filtering or spectral subtraction [7], compensation parallel model combination (PMC) [8], Jacobian environmental adaptation [9], and recently DNNs [10], [11], [12]. Model adaptation is another successful approach, which adjusts parameters of speaker models using different training data while keeping the observation stable [13]. For example, hybrid Neural Network/Hidden Markov Model [14], Linear Spline Interpolation (LSI) [15], CNNs [16] have been adopted in recent years. Among these adaptation methods, the most successful one is the Universal Background Model [17], which is a universal GMM trained with large amounts of voice data from different speakers.
All approaches have achieved excellent performance in addressing noise problems under certain conditions, but none of them take the data quality into consideration. In real world application, speakers can be affected by a variety of biological, environmental, social, or cognitive factors (human factors) when they are talking, resulting in distortion of feature vectors extracted from utterances. Although background noise may remain stable over a whole utterance, it may pose different impacts on different frames. For example, some frames may be affected less than others, and would thus be of relatively high quality. In this paper, we propose a pre-processing method called Noise Invariant Frame Selection (NIFS) for selecting a subset of data that is robust to various background noises. Speaker verification is used as the case study here to evaluate the usefulness of the NIFS, but the method could be applied to other tasks that are impacted by noise in the input signals, even they are non-audio signals, i.g. object recognition [18] or image retrieval [19]. The performance of NIFS for speaker verification is evaluated on the TIMIT database [20].
The rest of this paper is structured as follows: In Section 2, the proposed Noise Invariant Frame Selection algorithm is explained in detail. The experimental setup and results are shown in Section 3. The last section is devoted to the conclusion.

II. RELATED WORK
Previously, a number of works have been investigated in frame selection for acoustic-related tasks. Aiming at selecting taskspecified frames, task-specified constraints or criterion are always designed and employed.
Dutoit et al. [21] adopted the Viterbi algorithm to select a sequence of frames from the target database, which try to minimize a distance between selected frames and the output sequence mapped by their GMM mapping conversion function. The result showed that the combination of mapping and frame selection generate the best results among three experimental systems of their paper.
Ventura et al. [22] presented an audio parameterization method for acoustic recognition of bird species using integrated frame selection method. To be more precise, the proposed frame selection method employed morphological filtering applied on the spectrogram in MFCC algorithm. It allows to exclude from further processing certain audio events, which otherwise could cause misclassification errors. The experiment results for identifying 40 bird species proved its advantages in both accuracy and speed.
By normalizing conventional minimum-redundancy maximum-relevancy (mRMR), Jung et al. [23] proposed the NmRMR criterion. They first extracted features from frames to train an initial feature model. Then feature frames used for the training and test are selected by meeting the NmRMR criterion. This selected frames are expected to have minimum-redundancy within selected feature frames and maximum-relevancy to speaker models. The experiment results verified that the selected frames can enhance the performance of speaker verification system.
Meanwhile, some other researchers have applied more than one constraints to select frames and fusing selection results later.
Bocklet et al. [24] proposed a framework to select eight subsets of MFCC feature vectors from the original speech for speaker recognition based on eight different syllable constraints. Then, linear logistic regression is utilized to combine selection results at the score level. The experiment conducted by GMM-UBM system revealed that the proposed frame selection method improved the performance of the baseline system.
Based on hybrid technique, the framework proposed by Prasad et al. [25] has two branches, where the first one utilizes voice activity detection (VAD) to discard Non-speech frames and conventional Fixed Frame Rate (FFR) to select frames from selected active speech part. The second branch select frames according to the changes in the temporal characteristics of speech based on Variable Frame Rate (VFR) analysis. Finally, the selected frames from both branches are concatenated for the further processing.
Nematollahi et al. [26] extracted linear predictive coefficients (LPC), Gain and LP residual from each frame and then proposed three different ways to weight them. After that, the sum of the weighted scores is utilized for frame selection. Since the higher weight stands for the better speaker discrimination ability, those frames with lower weight are discarded.

III. METHODOLOGY
Besides the model architecture, training data is also crucial for a model's generalization ability. A subset of training data may train a model that has a better performance than the original set. The key question is how to select the optimized subset. Motivated by this, we propose a NIFS framework that aims to select noise invariant frames from utterances.

A. Robust frame selection
According to Fig.1, K kinds of additive noise are taken into account as the constraints. By adding each additive noise to the original utterance respectively, K noisy utterances with the same number of frames are generated. It should be noted that a original utterance is not equal to the clean utterance. Then, K + 1 sets of feature vectors can be extracted from these noisy utterances and the original utterance. For each frame of the original utterance, the distances between its feature vector and feature vectors extracted from the corresponding K noisy utterances are calculated separately, which can be denoted as where d f o n , f k n is the distance between the feature vector extracted from the n th frame of the original utterance and the feature vector extracted from its corresponding frame of the k th noisy utterance, where k = 1, 2, . . . , K. Here, the Euclidean distance is employed as the measure of distortion but other measures can be adopted for different tasks. Since more than one noise constraint is utilized and their impacts on the same frame may different, the weights and bias are applied to represent this uncertainty. As a result, the score can be calculated from the distance between an original frame f o n and a noisy frame f k n by where w k denotes the weight of the kth noisy utterance. After scores between those clean frames and their corresponding noisy frames are calculated, the final score of f n can be obtained by fusing the scores of all constraints, which can be factorized as where w 0 denotes the bias.
Consequently, frames that are less distorted by noise will have lower scores. By ranking all original frames in ascending  (2) and Equation (3) respectively.
order based on their final scores (Equation (4)), a new set of frames that are less sensitive to noise can be generated by selecting top ranked frames.
Unfortunately, the weights and bias of Equation (2), (3) and (4) would take quite a long time to be optimized. Therefore, a simple function for fusing distances is proposed in Equation (5). It ranks frames in ascending order for each constraint individually and the final subset can be obtained by selecting the intersection of them.
where RS k is a ranked subset obtained by k th constraint, w is the percentage of frames that RS i kept and SF is the final selected frame set. This formulation can reduce the number of parameters from K + 1 (w 1 , w 2 , . . . , w k and w 0 ) to only one (w) because all weights are set as the same value while the bias is set to zero. The reason of using w rather than a fixed threshold is that by using it, the NIFS framework can provide task-specified frame sets by controlling the robust degree and the number of the selected frames.
Given that there are m utterances for training each speaker and K noisy constraints are utilized (m > 1, K > 1), the time complexity of parameter optimization for both methods are analyzed as follows. For Equation (4), a linear search is applied to find the best threshold w, while we make use of Batch Gradient Descent to obtain the optimized weights and bias in Equation (2)  Fortunately, for the simple version (Equation (5)), only linear search needs to be adopted for searching the best threshold w, for which time complexity is O(n).

B. Speaker verification using robust frames
The NIFS algorithm is easy to be applied to any single utterance for selecting noise invariant frames, and thus we utilize speaker verification to validate the effectiveness of it.
In terms of applying the NIFS to speaker verification as a front-end, it can be adopted in both training and test phases. When applying the NIFS to training phase, it selects those robust training frames which are then used to train speaker models directly. When applying the NIFS to both training and test phase, it first selects robust training frames to train speaker models. Then, when a test utterance appeared, NIFS also selects robust frames from it and the input these frames to speaker models for testing.

A. Experimental setup
In this paper, a 24-dimensional MFCC feature [27] consisting of 12 MFCC and 12 ∆ MFCC is utilized. Each frame of the utterances is processed by a 25 ms length Hamming window and shifted by 10 ms. The 0th cepstral coefficient is replaced with the log energy. All experiments were carried out on the core condition of the TIMIT database, which contains a total of 6300 voice samples of 630 speakers (438 males and 192 females and each speaker contribute 10 utterances) from 8 major dialect regions in the United States. In our experiments, 80 speakers made up of 57 male speakers and 23 female speakers, were randomly selected (The index of these speakers and the MATLAB code of NIFS can be found in https://github.com/shuimove1234/Noise-Invariant-Frame-Selection).
In the training phase, eight utterances were utilized to train models for each speaker while another two remaining utterances were used for testing (160 test utterances in total). To test the performance of the NIFS in unknown noisy environments, test utterances were corrupted by four types of additive noise  including factory noise, leopard noise, machinegun noise and volvo noise, resulting in noisy test utterances at 15, 20, 25 dB SNR. Meanwhile, a clean condition is also introduced, where original test utterances without adding any additive noises were applied. It should be noted that the additive noise used in the test phase were different from the noise constraints used for training. All noise used in this paper is provided by the NOISEX-92 database [28].
In terms of speaker models, the codebooks of VQ systems were constructed with 128 clusters while the GMM models had 128 Gaussian mixture components. The universal background model was trained by 70 males and 30 females who were randomly selected in the remaining TIMIT (1000 speech recordings in total). Because the purpose of the experiments is to justify whether the selected frames can better represent the speaker than the original frames, no other pre-processing or post-processing method was applied.

B. Speaker verification results
In this section, three types of noises, including babble noise, white noise and pink noise from the NOISEX-92 database were introduced as the constraints in the training phase of the NIFS, where the SNRs were all set to be 20 dB.
The performances of the NIFS were evaluated using the equal error rate (EER) on three sub-experiments: 1. NIFS and VQ (NIFS-VQ)-based speaker verification; 2. NIFS and GMM-UBM (NIFS-GMM-UBM)-based speaker verification; 3. NIFS and i-vector (NIFS-i-vector)-based speaker verification (see Fig. 2). To be more specific, the original utterances were processed by the NIFS to yield corresponding selected frame sets, which then are fed to speaker models for training and testing.
The speaker verification results of these three NIFS-based systems and their corresponding baselines (VQ, GMM-UBM, i-vector) under clean condition and different noisy conditions are presented in Table I. It is clear that the performance of  Fig. 3 illustrated the effectiveness of the proposed NIFS.

C. Constraints analysis
The constraints used in the NIFS have three main hyperparameters: the types of noise, the number of noise and the noise SNR. It is interesting to explore the impact of these hyper-parameters on speaker verification. Motivated by this, the following three experiments were conducted, where ivector was employed as the speaker model.

1) Noise types:
The purpose of the first experiment is to discover the influence of different types of noise constraints adopted in the NIFS. Besides the combination of three noise types utilized in the aforementioned experiment, we also introduced another two combinations of noise constraints. One is the combination of 'buccaneer1', 'f16' and 'm109' (noise combination 2) and the other is the combination is the 'destroyerengine', 'destroyerops' and 'hfchannel' (noise combination 3). Then, new training data selected by NIFS with noise combination 2 and combination 3 were applied to train another two i-vector models respectively. Finally, well-trained models were tested on four different noisy environments with noise SNR of 20 dB. According to the table II, although frame sets selected by different noise combination yield different results on four noisy test environments, the average results are similar and all of them are better than the baseline.
2) Noise SNR: With the same training and testing setup, the second experiment applied the noise combination one with three different SNRs: 15 dB, 20 dB, 25 dB, as the constraints. The result displayed in Table II demonstrates that the SNR of constraints has certain impact on the quality of the selected frames. Fortunately, all selected frame sets outperformed the original frame set.
3) Noise number: The objective of the third experiment is to explore the influence of the number of noise constraints used in the NIFS, where the first model only made use of one noise (white noise with SNR of 20 dB) in the selection phase while the second and the third model utilized two (babble and white noises with SNR of 20 dB) and three noises (babble, white and pink noises with SNR of 20 dB) separately. It is clear from the last part of Table II that using only one noise constraints may not select high quality noise invariant frames. However, when more noise constraints were applied, the frames selected by NIFS can further enhance the performance of i-vector for speaker verification.

D. Feature analysis
Since the NIFS is proposed as a general framework, it should be suitable for different features. Therefore, besides the 24 dimensional MFCCs, the last experiment also employed another two features: 39 dimensional MFCC (13MFCC + 13∆MFCC + 13∆∆MFCC) and 60 dimensional MFCC (20MFCC + 20∆MFCC + 20∆∆MFCC). The hyper-parameters for this experiment were set as the same as the experiment conducted in section B. Consequently, another two i-vector models were trained by 39 dimensional MFCCs and 60 dimensional MFCCs. The result displayed in Table III proved that the NIFS has improved the EER results for all models in the clean and noisy environments, which justified that the proposed NIFS method is a useful front-end that can enhance the speaker verification performance for different features.
The next experiment was conducted to evaluate the usefulness of the NIFS in training and test phase individually. According to Table IV, the systems which either applied NIFS to the test phase or the training phase outperformed the baseline which haven't used it. When applied NIFS to both phases, the system yield better result than either applied it to training phase or test phase separately.

E. Parameter Analysis
There is clearly a trade-off between the quality of frames and the number of the selected frames. To study this tradeoff, the relationship between the threshold w adopted in the Equation (5) and the average Euclidean distance between the selected frames in original utterances and its corresponding noisy utterances is displayed in Fig. 4. The average distance decreased almost linearly with the threshold reducing, which justified that NIFS is able to remove those frames that are easily to be distorted by noises.  However, selecting a subset of highly robust frames from the original set will lead to fewer frames being utilized in the training phase. This may negatively affect the performance of models. Hence, we also studied the relationship between the EER result and the threshold w adopted in the Equation (5). According to Fig. 5, removing those frames that are easiest to be distorted can enhance the EER. However, after a certain percentage (90% − 92%), the benefit of NIFS cannot compen- sate the negative impact brought by the reduced number of the training data. Fortunately, since there is only one threshold to be determined for the NIFS, it is easy to be optimized.

V. CONCLUSION
Background noise poses unequal impacts on different frames of an utterance, where a subset of robust frames may train a better model for acoustic-related tasks. To justify this hypothesis, we have proposed a simple Noise Invariant Frame Selection (NIFS) method of low computational complexity as the front-end for speaker verification. NIFS applies several additive noisy copies of the input utterances as the constraints to select robust frames from utterances to represent speakers. The results show that speaker verification performance is improved under almost all conditions (clean and noisy) for each model by using the proposed NIFS as the front-end. Experi- ments demonstrated that although changing hyper-parameters of noise constraints of NIFS can affect the quality of the selected frames and result in different speaker verification results, performance still improves under almost all unknown noisy conditions. Experiments also proved that NIFS is useful for both training and testing and it is suitable for different features.
In conclusion, there exist some frames that are relatively robust to different kinds of additive noises and correctly selecting these frames is absolutely essential for good performance. This paper proved that frames selected by the proposed NIFS can train better speaker models than the original training set. Also, it is easy to be reproduced. Although NIFS only has been applied to a speaker verification task in this paper, we believe that it can be easily extended as a pre-processing method to other pattern recognition tasks.