Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets

Semisupervised deep learning (SSDL) is a popular strategy to leverage unlabeled data for machine learning when labeled data is not readily available. In real-world scenarios, different unlabeled data sources are usually available, with varying degrees of distribution mismatch regarding the labeled datasets. It begs the question, which unlabeled dataset to choose for good SSDL outcomes. Oftentimes, semantic heuristics are used to match unlabeled data with labeled data. However, a quantitative and systematic approach to this selection problem would be preferable. In this work, we first test the SSDL MixMatch algorithm under various distribution mismatch configurations to study the impact on SSDL accuracy. Then, we propose a quantitative unlabeled dataset selection heuristic based on dataset dissimilarity measures. These are designed to systematically assess how distribution mismatch between the labeled and unlabeled datasets affects MixMatch performance. We refer to our proposed method as deep dataset dissimilarity measures (DeDiMs), designed to compare labeled and unlabeled datasets. They use the feature space of a generic Wide-ResNet, which can be applied prior to learning, are quick to evaluate, and model agnostic. The strong correlation in our tests between MixMatch accuracy and the proposed DeDiMs suggests that this approach can be a good fit for quantitatively ranking different unlabeled datasets prior to SSDL training.

Impact Statement-Semisupervised deep learning is a technique for training a deep learning model when few labelled observations are available, leveraging unlabelled datasets. Different unlabelled data sources may be available, introducing the possibility for distribution mismatches between the labelled and unlabelled datasets. In this work, we assess the impact of distribution mismatches on the outcomes of the semisupervised MixMatch algorithm. We propose a set of simple feature-space density dataset distances, referred to as deep dataset dissimilarity measures (DeDiMs). In our extensive test-bed, the evaluated DeDiMs yield linear correlation coefficients of up to 96% to MixMatch accuracy.
Index Terms-Dataset similarity, deep learning, distribution mismatch, MixMatch, out of distribution data, semisupervised deep learning.

I. INTRODUCTION
T RAINING an effective deep learning solution typically requires a considerable amount of labeled data. In specific areas, like medical imaging technologies, high-quality labeled data can be expensive to obtain, leading to a paucity of labeled data [4], [12]. Several approaches have been developed to address this data constraint, including data augmentation, transfer, weakly, and semisupervised learning, among others [34], [46]. Semisupervised learning is an approach for learning problems where little labeled data is available, or a range of labels is lacking. It leverages the use of unlabeled data which is often cheap to obtain [44]. Formally, in a semisupervised setting, both labeled and unlabeled datasets are used. Labeled observations X l = {x 1 , . . . , x n l } and their corresponding labels Y l = {y 1 , . . . , y n l } make up the labeled dataset S l . The set of unlabeled observations S u is represented as X u = {x 1 , . . . , x n u }; therefore, S u = X u . Semisupervised deep learning (SSDL) approaches can be grouped into pretraining [14], self-training, or pseudo-labeled [15] and regularization-based. Regularization techniques include generative-based approaches, along consistency loss term and graph-based regularization [12]. A detailed survey on semisupervised learning can be found in [44].
The practical implementation of SSDL techniques in different contexts has been limited, barring few exceptions [32]. As with other learning paradigms, the transfer of SSDL techniques from lab to real world is complicated by, among other reasons, the violation of the independent and identically distributed (IID) assumption. In principle, we would like to exploit available unlabeled data as flexibly as possible. In practice, distribution mismatches between the labeled and unlabeled data sets can lead to serious performance degradation [32]. The following example illustrates this problem. We can train a Convolutional Neural Network (CNN) to classify chest X-ray images between COVID-19 ill and healthy patients, as, for example, seen in [7]. The labeled dataset S l can include a limited number of observations for each class. However, the unlabeled dataset S u can include observations of patients with other lung pathologies not sampled in S l , leading to a distribution mismatch between the labeled and unlabeled datasets. The mismatching data can be described as Out of Distribution (OOD) data [23] and it can harm the performance of an SSDL solution [32].
It begs the question how we can systematically select labeled and unlabeled data in non-IID settings such that performance on the downstream task is increased. A common recourse is what we call semantic matching heuristics. For example, Tiny ImageNet (TI) may be judged more similar to the Canadian Institute for Advanced Research dataset of 10 classes (CIFAR-10) than to Modified National Institute of Standards and Technology dataset (MNIST) because the first two datasets both contain object whereas the last dataset contains handwritten digits. Practices of semantic matching can be traced to other fields of machine learning, too, including out-of-distribution detection [52] or the domain adaptation literature [47], [50]. Insights from generative modeling should, at the very least, make us feel uneasy about such an approach to determine dataset similarity. Similarity can vary drastically depending on whether it is determined through semantic heuristics or quantified through the lens of a machine learning model [28].

A. Problem Statement
The central premise of this work is the quantitative impact assessment of distribution mismatch between labeled and unlabeled data on SSDL. This notion stipulates that a mismatch negatively affects the accuracy of models trained with SSDL algorithms [32]. Distribution mismatch occurs when the unlabeled data contains observations that do not correspond to or are too dissimilar to the observations of any of the classes present in the labeled data. It is not clear though what exactly the effect is when this mismatch occurs.
r Does it always harm the model accuracy in the context of SSDL?
r Does it help to use unlabeled data, that is, supposedly, semantically more similar to the labeled data? r Furthermore, if certain unlabeled datasets indeed harm accuracy of SSDL trained models, is there a reliable way to select the unlabeled data in an informed way prior to SSDL training? We adopt the following definitions. Given a dataset S 1 emanating from the data-generating process y = f (x), with y ∈ Y := {1, . . ., K} being a set of labels, and a second dataset S 2 emanating from the data-generating process y = g(x), with y ∈ Y := {1, . . ., K }, we define the following concepts.
. In particular, we must have that Y = Y .
Definition 2: OOD data: Dataset S 2 is OOD relative to dataset S 1 if f (x) = g(x). In particular, we may have that Y = Y .
Definition 3: Distribution mismatch in SSDL: A distribution mismatch occurs if the unlabeled data S u used for SSDL is OOD relative to the labeled data S l .
In practice, f (x) and g(x) are typically not known explicitly. Thus, given two datasets S 1 and S 2 , a definite formal verification of the distribution mismatch property is not possible. Instead, it is usually assumed that two different datasets, e.g., CIFAR-10 and MNIST, derive from different data generative processes. This working definition of OOD data follows the existing literature on distribution mismatch in SSDL [32] as well as OOD detection in deep learning [37]. We adopt this working definition for the OOD scenarios of our test bed. Note that different degrees of OOD contamination for S u are possible, as we describe in Section IV-A.

B. Contribution
In order to address the questions outlined in Section I-A, we first study the effect of distribution mismatch on SSDL accuracy in systematic test-bed. Then, we present a set of Deep Dataset Dissimilarity Measure (DeDiM)s to assess, prior to training, the effectiveness of unlabeled datasets for MixMatch SSDL [5]. A visual summary of the process is provided in Fig. 1. All code and experimental scripts, with automatic download of test bed data for ease of reproduction, are made publicly available. 1 It entails the following contributions.
r We present and make available a comprehensive simulation sandbox, called non-IID-SSDL, for stress testing SSDL algorithms under various non-IID (distribution mismatch) configurations. We demonstrate that including OOD data in the unlabeled training dataset for the MixMatch algorithm can yield different degrees of accuracy degradation compared to the exclusive use of IOD data. However, in most cases, using unlabeled data with OOD contamination still improves the results when compared to the default fully supervised configuration.
r Markedly, unlabeled that is supposedly semantically similar to the IOD labeled data does not always lead to the highest accuracy gain. This counterintuitive result suggests that using semantically similar unlabeled datasets does not always yield the best accuracy gain for SSDL.
r We propose and evaluate four DeDiMs that can be used to rank unlabeled datasets according to the expected accuracy gain prior to SSDL training. They can be considered to be less expensive to compute and model agnostic, which makes them amenable for practical application. r Our test results reveal a strong correlation between the tested DeDiMs and MixMatch accuracy, making them useful for unlabeled dataset selection. Therefore, we propose Fig. 1. A summary of the workflow presented in this article. In step 1 , a labeled, inside-of-distribution dataset S IOD , here MNIST, is paired with different potential unlabeled datasets for semisupervised learning. The unlabeled data S uOOD in our experiments is of the three types: T OOD other half (OH), similar (Sim), and different (Diff). In step 2 , a pretrained ResNet is used to extract feature representations of the labeled and unlabeled datasets and a deep dataset dissimilarity measure (DeDiM) is applied. Finally, in step 3 , the dissimilarity scores can be used as a proxy for SSDL accuracy to select unlabeled data. This example shows results from the MNIST S IOD experiment. The colors in the last scatter plot designate the number of labeled samples. the usage of the tested DeDiMs to select the unlabeled dataset for improved MixMatch accuracy. The best performing DeDiMs use a nonparametric density function approximation of the feature space, which provides a method to quantitatively describe the distribution mismatch between two datasets.

II. RELATED WORK
In this work, we address a combination of three overlapping problems that are often dealt with separately in the literature: OOD detection, distribution mismatch in SSDL, and dataset dissimilarity measures.

A. OOD Data Detection
In the context of machine learning, OOD data detection refers to the general problem of detecting observations that belong to a data distribution different from the distribution of the training data [18]. OOD detection can be considered as a generalization of outlier detection, since it considers individual and collective outliers [40]. Further variations of the OOD data detection problem are novel and anomaly data detection [33], with different applications such as rare event detection and artificial intelligence safety [1], [17]. Classical OOD and anomaly detection methods rely on density estimation, e.g., Gaussian Mixture Models [24], robust moment estimation, like the minimum covariance determinant method [38], prototyping, e.g., k-nearest neighbor algorithm [24], as well as kernel-based variants such as support vector data description [43]. Also, a variety of neural network-based approaches for novelty detection can be found [24], implementing a more data-oriented approach.
With the success of deep learning, recent works have addressed the generic problem of discriminative detection of OOD data for deep learning architectures. In general, discriminative OOD detectors can be categorized in output-and feature-based. For instance, a simple output-based OOD detection approach was proposed in [18]. The authors framed OOD detection as a prediction confidence estimation problem. The proposed method relies on the Softmax output, sampling the maximum value. Liang et al. [23] introduced OOD data detection in neural networks using input perturbations. A temperature coefficient T is used in the calculation of the Softmax output with a calibrated decision threshold δ for OOD data detection.
More recently, Lee et al. [22] argue that deep neural networks with Softmax output layers are overconfident for inputs dissimilar from the training data and hence propose the usage of the Mahalanobis distance in latent space. Similarly, Tagasovska and Lopez-Paz [41] also exploit latent representations, defining what they refer to as learning certificates: Neural networks that map feature vectors to zero for IOD data. A more challenging OOD detection setting was tested, where half of each tested dataset is used as IOD data, and the other half is used as OOD data, making OOD detection harder. Zisselman and Tamar [52] propose an OOD detector using the feature space as well. The approach fits different parametric distributions in the feature space of the data. The decision to discriminate between OOD and IOD data is done based on the estimation of the approximated parametric model. Unfortunately, no comparison with other popular OOD methods was presented. A similar approach with a simpler linear model trained with the statistical moments of the feature space can be found in [35].
In this concise overview of OOD detection methods, two different main categories for OOD detection can be found: Outputand feature space-based. The datasets selected for benchmarking OOD detection methods are usually different for each work, and quantitative evaluation of the difficulty of performing OOD detection is rare.

B. Distribution Mismatch in SSDL
The distribution mismatch between S u and S l can be interpreted as a violation of the IID assumption. Different causes for this distribution mismatch can be distinguished, as discussed in [19]. We summarize them as follows.
r Prior probability shift: The density of the targets in S l is different to the real target densities in S u (increasing the possibility of sampling noise). Class imbalance in the labeled dataset S l is a special case of this setting, as discussed in [8].
r Covariate shift: The labeled dataset S l might sample a different density of the features when compared to the unlabeled dataset S u , causing a distribution mismatch between the two datasets. For example, for handwritten digit recognition, the sample of S l might capture different stroke widths, when compared with S u . Concept drift is a similar setting where the change of features causes the concept to semantically change.
r Concept shift: It corresponds to a label change for a similar set of features. For instance, for sentiment analysis in audio, an observation might have different labels depending on the labeler (this is also related to label noise). In the context of distribution mismatch between S l and S u , as no label information is used from S u during training.
In this work, we analyze the impact of distribution mismatch between S l and S u caused by a concept drift, as a mild distribution mismatch cause (for instance using SVHN as S u and MNIST as S l ). To create more significant distribution mismatch settings, we contaminate the unlabeled dataset S u with different percentages of observations from completely different datasets (with different labels or features). For example, using MNIST as S l and for S u 50% Gaussian Noise (GN) images plus 50% MNIST images.
As previously highlighted in [32], the authors call for the need of a more extensive testing of SSDL techniques in real-world testing scenarios. One of them is the possible data distribution mismatch between the labeled and unlabeled training data can adversely impact SSDL results. Real mix was proposed [27] in response, implementing a masking coefficient to OOD data for the unlabeled dataset. The masking coefficient is used as a threshold of the Softmax output of the model, discarding unlabeled data used only in the unsupervised term. The authors performed limited testing on the significance of using OOD unlabeled data, with relatively few OOD contamination scenarios tested. The OOD dataset consisted of the splitted CIFAR-10 dataset, in two halves with different semantics. A total of four levels of OOD contamination were tested. We extend OOD datasets to more configurations.
More recently, the work in [11] proposes a simple approach to deal with OOD data, by using soft labels averaged by the output of the model along a number of epochs. The evaluation includes a benchmark with different proportions of distribution mismatch. The results yielded demonstrate an improved accuracy of the proposed method over other state-of-the-art SSDL approaches when dealing with OOD data in the unlabeled dataset. However, MixMatch is not among the compared approaches. Moreover, the distribution mismatch scenarios were not extensive, testing only different degrees of mismatch contamination, and not evaluating the impact of different OOD data sources.
In [51] a SSDL robust framework to OOD data was proposed. Authors claim that OOD data far away from the decision boundaries affects SSDL performance less than OOD data lying very close to the decision boundaries. However, no explicit quantitative measure of distribution similarity was used. The authors also noted a high influence of data batch-normalization, where normalizing the data using far away OOD data can impact the accuracy of the model more. To address this issue, the authors proposed a dynamic approach to re-weight the observations in both batch-normalization and training time, using a gradient optimization approach for both. The model was tested using virtual adversarial training and the Π model, excluding the usage of MixMatch. The experiments included different degrees of OOD contamination and unlabeled datasets, however no comparison to other approaches explicitly designed for SSDL with OOD robustness was performed.
In [11], another approach for OOD robust SSDL was proposed, using also a per-observation reweighting and giving less weight to the observations that are most likely OOD. To calculate the per-observation weights, an uncertainty proxy, as in [16], was implemented, using an ensemble of models yielded during the past epochs. The model was tested with the CIFAR-10 dataset (six classes) with varying degrees of OOD contamination (the other four classes left from CIFAR-10). No other unlabeled contamination data sources were used.
Unlike previous studies, in this work, we aim to quantify the notion of OOD data, correlating it with the SSDL accuracy using different unlabeled datasets with varying degrees of OOD contamination and different data sources. This quantification can be used to select one unlabeled dataset among many, prior to SSDL training. This also allows us to analyze the influence of OOD data. Finally, the proposed method can be extended to weight how harmful an unlabeled observation can be for SSDL. Using the feature distribution to this end has not been fully explored in previous work.

C. Dataset Dissimilarity Measures
The need of comparing two datasets, in this case, the labeled S l and unlabeled datasets S u to quantify the prior data mismatch between them, leads us to the need for dataset comparison measures. Computing a notion of dissimilarity between two sets of points (also known as shape matching [25]) is typically computationally more expensive than calculating the dissimilarity between a set of points and another single point. Strategies to reduce this burden are primarily centered around enriching the object space with a probability measure which helps guide attention to important areas of comparison [25]. When starting with raw datasets, as is typically the case when trying to decide which data to use for SSDL, additional preprocessing or modeling steps would be necessary to obtain this probability measure. Methods explicitly designed to compute dissimilarities between raw datasets for deep learning are, to the best of our knowledge, rare. Tatti [42] defines a dissimilarity measure based on the Euclidean distance between the frequency of a given feature function on two datasets, referred as the constrained measure distance. The calculation of the proposed measure can be efficiently performed using the covariance matrix of the feature function in the dataset.
More recently, Cabitza and Campagner [6] proposed a distance dissimilarity index based on the statistical significance difference of the distance distributions between the two datasets. To calculate it, each data point in the test set is matched with the training data. After exchanging the associated observations, changes in the topology are assessed, using the distance distribution. The confidence p-value of the difference between the two distributions is calculated and used as a dissimilarity measure.
Note that our requirements differ from the above OOD detection and dissimilarity measure methods: We are interested in computationally inexpensive, prior-to-training, and SSDL model agnostic quantification of the OOD degree between two datasets. Approaches that are computationally expensive or retrospective, applied after the model has been trained, are not feasible to address distribution mismatch before SSDL training.
Closest to our work are the OOD detection ideas developed by Ren et al. [37]. The authors present introductory experiments on the correlation between OOD detection and the dataset dissimilarity using a genome distance [36]. We explore a similar comparison: the relationship between SSDL accuracy and OOD-IOD dissimilarity, which can be useful for a prior evaluation of unlabeled datasets for SSDL. This enables an interesting quantitative insight on the real impact of OOD data to SSDL accuracy, which we explore in this work.

III. PROPOSED METHOD
Our approach is based on a simple idea: If OOD data indeed affects MixMatch SSDL accuracy, we would like to be able to select the unlabeled data prior to SSDL training such that resulting test accuracy of the model is maximized. To that end we propose and evaluate a number of DeDiMs. They provide a quantitative notion of similarity between the inputs of the IOD labeled data and the inputs of the OOD unlabeled data. The DeDiMs are based on dataset subsampling, as image datasets are usually large, following a sampling approach for comparing two populations, as seen in [21]. We compute the dissimilarity measures in the feature space of a generic Wide-ResNet pretrained on ImageNet, making our proposed approach agnostic to the SSDL model to be trained. This enables an evaluation of the unlabeled data before training the SSDL model. The proposed measures in this work are meant to be simple and quick to evaluate with practical use in mind. We propose and test the implementation of two Minkowski-based distance sets, d 2 (S a , S b , τ, C) and d 1 (S a , S b , τ, C), corresponding to the Euclidean and Manhattan distances, respectively, between two datasets S a and S b . Additionally, we implement and test two nonparametric density-based dataset divergence measures; Jensen-Shannon (d JS ), and cosine distance (d C ). For all the proposed dissimilarity measures, the parameter τ defines the subsample size used to compute the dissimilarity between the two datasets S a and S b , and C the total number of samples to compute the mean sampled dissimilarity measure. The general procedure for all the implemented distances is detailed as follows.
r We randomly subsample each one of the datasets S a and S b , with a sample size of τ , creating the sampled datasets S a,τ and S b,τ .
r We transform an input observation where n is the dimensionality of the input space, using the feature extractor f , yielding the feature vector h j = f (x j ).
r The feature vector h i ∈ R n has dimension n , with n < n.
For instance, the implemented feature extractor f uses the ImageNet pretrained Wide-ResNet architecture, extracting n = 512 features. This yields the two feature sets H a,τ and H b,τ . For the Minkowski-based distance sets d 2 (S a , S b , τ, C), d 1 (S a , S b , τ, C), we perform the following steps for the sets of features obtained in the previous description H a,τ and H b,τ .
r For each feature vector h j ∈ H a,τ , find the closest feature vector h k ∈ H b,τ , using the p distance, with p = 1 or p = 2 for the Manhattan and Euclidean distances, respectively: d j = min k h j − h k p . We do this for a number of C samples, yielding a list of distance calculations r We compute a reference list of distances for the same list of samples of the dataset S a to itself (intra-dataset distance), thereby computing d p (S a , S a , τ, C). This yields a list of reference distancesď 1 ,ď 2 , . . .,ď C . In our case, S a corresponds to the labeled dataset S l , as the distance to different unlabeled datasets S u is to be computed. We highlight that this should result in values close to zero. However, as different samples are used for each distance computation, the results are not exactly zero.
r To ensure that the absolute differences between the reference and inter-dataset distances d c = | d c −ď c | are statistically significant, we compute the p-value associated with a Wilcoxon test.
r After the distance set between two datasets d p (S a , S b , τ, C) is obtained, its average reference subtracted distance d and its corresponding statistical significance p-value are computed. As for the density-based distances implemented, we follow a similar subsampling approach, with the following steps.
r For each dimension r = 1, . . ., n in the feature space, we compute the normalized histograms to approximate the density functions p r,a , in the sample H a,τ . Similarly, we compute the normalized histograms to yield the set of approximate density functions p r,b for r = 1, . . ., n , using the observations in the sample H b,τ .
r For the Jensen-Shannon divergence (d JS ) and the cosine distance (d C ), we compute the sum of the dissimilarities between the density functions p r,a and p r,b , to yield the estimated dissimilarity for the sample j: d j = n r=1 δ g (p r,a , p r,b ), where g = JS and g = C for the Jensen-Shannon divergence and the cosine distance, respectively. We do this for all the C samples, yielding the list of inter-dataset distances: d 1 , d 2 , . . ., d C . To lower the computational burden, we assume that the dimensions are statistically independent. This assumption also simplifies the likelihood calculation, as seen in other methods [20].
r Similar to the Minkowski distances, we compute the intradataset distances for the dataset S a , in this context, the labeled dataset S l , to obtain the list of reference distanceš d 1 ,ď 2 , . . .,ď C .
r Similarly, to verify that the inter-and intra-dataset distance differences d c = | d c −ď c | are statistically significant, we compute the p-value associated with a Wilcoxon test. The distance computation yields the sample mean distance d and its statistical significance p-value. The proposed dissimilarity measures do not fulfill the conditions of a mathematical metric or pseudo-metric since the distance of an object to itself is not strictly zero (but tends to be close) and symmetry properties are not fulfilled for the sake of evaluation speed [13]. Despite these relaxations, we will see that these dissimilarity measures, especially the two that are density-based, are an effective proxy for estimating the S u,OOD accuracy gain.
To quantitatively measure the relationship between S l and S u distances and SSDL accuracy, we calculate the Pearson coefficient between them. This verifies the linear correlation between both. Table 3 describes the Pearson coefficient for each implemented dissimilarity measure and each SSDL configuration.
In summary, we propose to quantitatively rank a set of candidate unlabeled datasets S u,1 , S u,2 , . . ., S u,k according to a dissimilarity measure d(S l , S u ), instead of using semantic matching heuristics. In all the tests of this work, we used n = 512, τ = 80, and C = 10.

A. Semisupervised Deep Learning Setup
The basis for all SSDL experiments in this article is the MixMatch algorithm, a state-of-the-art SSDL method [5]. Mix-Match estimates pseudo-labels for unlabeled data X u , and also implements an unsupervised regularization term. Pseudo-label y j estimation is performed with the average model output of a transformed input x j , with K number of different transformations. The pseudo-labels y are further sharpened with a temperature parameter ρ. To further augment the data using both labeled and unlabeled samples, MixMatch makes use of the MixUp algorithm by [49], which builds linear interpolations between labeled and unlabeled observations. For supervised and semisupervised loss functions, the cross-entropy and the Euclidean distance are used, respectively. The regularization coefficient γ controls the direct influence on unlabeled data. Unlabeled data also influences the labeled data term since unlabeled data is used also to artificially augment the dataset with the Mix Up algorithm. This loss term is used at training time, for testing, a regular cross entropy loss is implemented. For a detailed description of the MixMatch algorithm, we refer to [5]. We use the recommended hyperparameters documented in the supplementary material.

B. SSDL With OOD Data Test Bed
To assess the effect of OOD unlabeled data S u on the accuracy of SSDL models trained with MixMatch, we construct the non-IID-SSDL test bed, with five variable parameters: 1) base data S IOD which constitutes the original task to be learned, 2) the type of OOD data T OOD , 3) the OOD data source S u,OOD , 4) the relative amount of OOD data among the unlabeled data % u,OOD , and 5) and the amount n l of labeled observations. Each of the five axes is explored by varying only one of the variables at a time while keeping the others constant. This allows us to isolate the effect of the individual variables. We consider three configurations for S IOD comprising MNIST, CIFAR-10, and FashionMNIST. A total of three configurations for T OOD [Other-Half (OH), Similar (Sim), and Different (Dif)] are tested. We derived the possible types of OOD data from the existing literature cited in Section II. In the OH setting, half of the classes and associated inputs are taken to be the S IOD data, whereas the other half of classes are taken to be the S u,OOD data. Similar is a S u,OOD dataset that is assumed to be semantically related to S IOD , e.g., MNIST and Street View House Numbers dataset (SVHN). Different is a S u,OOD dataset that is supposedly semantically unrelated to S IOD , e.g., MNIST and TI. There are five configurations for S u,OOD as explained above: The other half OH, a similar dataset, and three different datasets including two  Table 1 for the per-task pairings. Each configuration represents a multiclass classification task with |Y| = 5, that is a random subset of half of the classes of base data S IOD .
We vary the relative amount of OOD data % u,OOD between 0, 50, and 100 as well as the amount of labeled datapoints n l between 60, 100, and 150. We study the behavior of MixMatch under very limited number of labels settings, where the benefit of SSDL is usually higher. This makes the impact of distribution mismatch more evident. Note that for each result entry you can see in Table 1, we performed 10 experimental runs and report the accuracy mean and standard deviation of the models performing best on the test data from each run, as overfitting is very likely to happen with a low n l . For each run, we sampled a disjunct subset of data from S IOD and S u,OOD to obtain the required number of labeled n l and unlabeled n u samples for the run. Descriptive statistics (mean and standard deviation) for standardization of the neural networks inputs were only computed from these subsets to keep the simulation realistic and not use any information from the global training data. All other parameters (number of unlabeled observations n u = 3000, neural network architecture, the set of optimization hyperparameters, number of training epochs) are kept constant across all experiments to enable direct comparison with respect to the variable parameters of the system. We clarify that the goal of this test-bed is to assess the impact of distribution mismatch for MixMatch, rather to achieve state of the art performance with MixMatch on the given data. Such hyper-parameters are described in the supplementary material. Note that it is possible to extend the test bed to other effects of interest. We address some of these ideas in Section VI.

C. Deep Dataset Dissimilarity Measures
In Table 2, we show the dissimilarity results for the tested labeled and unlabeled dataset combinations. We tested the dissimilarity measures detailed in Section III, namely the Manhattan or 1 distance d 1 , the Euclidean distance d 2 , the cosine distance d C , and the Jensen-Shannon d JS divergence. The distances and divergences are computed without the need for training a model, making the proposed approach appealing to choose unlabeled datasets before SSDL training. As a complementary quantitative test, in Fig. 2, we show the probability density function approximation plots of some of the features for the MNIST dataset, using both the similar dataset chosen (SVHN) and the different dataset selected (TI). We picked the features presenting the smallest divergences for the chosen datasets. The density functions were built using random samples for both data pairs. The probability density function approximation plots illustrate in a summarized manner the similarity computed between the two compared datasets, and its correlation with the measured Jensen-Shannon and cosine divergences.

V. RESULTS
The experimental setup used in this work is detailed in the supplementary material. Table 1 shows the results of the distribution mismatch experiment described in Section IV-A. We make a number of observations. We find in the majority of cases that using IOD unlabeled data or a 50-50 mix of IOD and OOD unlabeled data beats the fully supervised baseline. For instance, take the results in row 0 vs. the results yielded in rows 2-7 (for the SSDL model). A clear advantage of the SSDL model is revealed over the supervised model, even under distribution mismatch settings. The gains range from 15% to 25% for MNIST, 10% to 15% for CIFAR-10, and 7% to 13% for FashionMNIST across all S u,OOD and n l . As expected, in most of the cases, the accuracy is degraded when including OOD data in S u , with a more dramatic hit when noisy datasets (SAPN, GN) are used as OOD data contamination.
Another interesting observation from the experiment results is related to semantic matching heuristics and the yielded SSDL accuracy. Sometimes, using an unlabeled dataset that is semantically supposedly less similar can result in greater accuracy. This is observed, for example in Table 1, when S l =CIFAR-10, n l = 100, and n l = 150, where OOD unlabeled data from TI (row 16) results in a similar accuracy (with no statistical significance gain, according to the Wilcoxon test performed) than using the other half of CIFAR-10 as S u,OOD (row 14). It is interesting that an S u,OOD dataset of type different can have a similar benefit than a S u,OOD dataset of type similar. A clearer case of this tendency is found for FashionMNIST and TI (row 31) versus FP at n l = 150 (row 29). In such case, using the TI (different) dataset brings a higher SSDL accuracy, than using the FP (similar) dataset. This contradicts the common heuristic that unlabeled data that appears semantically more related to the labeled data is always the better choice for SSDL. Rather, as we demonstrate in the second set of results below, a notion of distance in the feature space between labeled and unlabeled data offers a more consistent and quantifiable proxy for the expected benefit of an unlabeled dataset.
As for qualitative illustration, Fig. 2 shows an example of the density functions approximated for randomly selected samples for the MNIST-TI and MNIST-SVHN dataset pairs. The plots reveal a stronger density based similarity between the MNIST and ImageNet than the MNIST and SVHN datasets. This in spite of the higher semantic similarity of SVHN to MNIST (both represent numbers, the first one in natural scenes, and the second one in handwritten images). This correlates well with the quantitative figures yielded in Table 2. For instance, in row 3, the MNIST dataset is more dissimilar to the SVHN dataset (MNIST contaminated by 100% with the SVHN dataset), than the TI dataset (MNIST contaminated by 100% with the TI dataset), revealed in row 5. This also highly correlates with the final SSDL accuracy yielded with both unlabeled datasets (MNIST contaminated by 100% with SVHN, in row 5, and TI, in row 7) shown in Table 1. MixMatch shows a marginally higher accuracy (with no statistical significance, after performing a Wilcoxon test) when using TI as an unlabeled dataset compared to using SVHN as unlabeled data.
The second set of results demonstrate the potential of using distance measures as a systematic and quantitative ranking heuristic when selecting unlabeled datasets for the MixMatch algorithm. The exact distances, as described in Section III, for all OOD configurations from the ablation study can be found in Table 2. We can observe that these distances trace the accuracy results found in Table 1, as confirmed by the Pearson correlation. This correlation is quantified in Table 3 with the cosine-based density measure d c correlating particularly well with the accuracy results of Table 1. Also, the p-values are consistently lower for the density-based distances (with fewer p-values that exceed 0.05, as shown by the italicized entries in Table 2), meaning that density-based distances present more confidence. We suspect that this is related to the quantitative approximation of the feature distribution mismatch implemented both in the d JS and d C distances. In Table 1, we indicate the distance-based preference ranking in parentheses. The OOD configurations resulting in the best SSDL accuracy are contained in the top two

VI. CONCLUSION
In this work, we extensively tested the behavior of the Mix-Match algorithm under various OOD unlabeled data settings. We introduced a set of quantitative data selection heuristics, DeDiMs, to rank unlabeled datasets prior to model training according to their expected benefit to SSDL. Our results lead us to the following conclusions.
1) In the experiments conducted in this study, the implemented DeDiMs correlate strongly with SSDL accuracy. In particular, density-based measures yield high correlation with MixMatch accuracy. This suggests that DeDiMs can be applied in SSDL prior to learning, aiding the unlabeled data selection process and mitigate the distribution mismatch problem. The proposed method is agnostic to the downstream SSDL algorithm, simple and fast to compute making it particularly suitable for practical application in SSDL. Different OOD detectors [52] use the feature space for building a discrimination criteria to filter OOD data. Our results suggest that online OOD data filtering approaches for SSDL as the ones developed in [11], [27] might benefit from using the feature space for OOD detection. Other criteria for online OOD detection during training as the model Softmax output used in [27] might discard data that might be useful for learning. This is tested in [10] for a practical application. 2) In real-world usage scenarios of SSDL, the unlabeled dataset S u may contain observations of classes not present in labeled dataset S l . We simulated a similar scenario with the OH setting which resulted in a subtle accuracy degradation in most cases. However, the accuracy gain obtained vis-a-vis the fully supervised baseline is still substantial, making the application of SSDL attractive in such a setting. 3) Another plausible real-world scenario for SSDL is the inclusion of widely available unlabeled datasets, e.g., built with web crawlers, where shifts in crawl queries can lead to different unlabeled datasets. This scenario has been simulated with the OOD types similar and different. We can observe that notions of semantic similarity between labeled and unlabeled dataset pairings, e.g., MNIST-SVHN or FashionMNIST-FP, do not necessarily imply an SSDL accuracy gain. The quantitative comparison of the density function plots in Table 2 suggests a higher similarity for dataset pairs with less semantic similarity, for some of the tested dataset setups. Distance measures, in particular d C , seem to be an accurate and systematic proxy for SSDL accuracy, according to our test results. This is visible when comparing the accuracy and distance results of the previous pairings to MNIST-TI and FashionMNIST-TI, which have higher accuracies and, also, surprisingly, lower distance measurements. We speculate that using more diverse data for pretraining might yield an even better feature extractor, similar to results in self-supervised learning methods [48]. 4) As suggested, our method can be used to rank different unlabeled datasets. The proposed DeDiMs can be considered efficient to implement, requiring only small samples, and with no need for model training, as a pretrained ImageNet feature extractor is used. According to our tests, a ResNet model pretrained on ImageNet without further fine-tuning works surprisingly well for quantifying unlabeled-to-labeled dataset affinity. As preliminary studies show a growing concern for the carbon footprint of training deep learning models [2], inexpensive and quantified data selection heuristics like DeDiMs can help to avoid unnecessary computation loads. Further studying our method to decrease training time and resources is an interesting future research path. 5) The claim in [51] regarding the impact of OOD data close to the decision boundary compared to OOD data far from it, relies on an Euclidean space projection of the data. In this work, we have gathered evidence that Euclidean-based similarity measures correlate worse with SSDL accuracy than the density function-based measures tested. Using a density-based divergence like Jensen-Shannon might not correlate well with semantic similarity, but according to our tests, it better explains the obtained SSDL accuracy. This shows how the feature extractor and the consequent feature space projections play a more important role in the final model performance than the original input space, as the feature space is built through nonlinear convolution operations that significantly change the input representation. Based on these results, we can draw a number of recommendations for the researchers in the field, which we enlist as follows.
1) Our results shift the attention to data-oriented approaches to improve the model performance. Similar to [26], where dataset sparsity is related to downstream model accuracy, our method allows the use of DeDiMs to assess the impact of unlabeled datasets on SSDL training. This enables the exclusion of datasets that are not beneficial for a given SSDL task.
2) The use of SSDL can also improve other model properties like uncertainty [9]. Hence, exploring the impact of OOD data in other aspects of SSDL performance, such as robustness [30], explainability [39], and confidence [3], as recommended in [29], [31], is a promising next step for distribution mismatch analysis. For instance, in [3], the impact of OOD data is tested in the overall model robustness and explainability. Evaluating the impact of distribution mismatch between S l and S u in other performance aspects opens up further questions for research. 3) In unsupervised domain adaptation, we find similar challenges where the target domain presents a different distribution than the source domain. Using SSDL for such setting can leverage unlabeled data in the target domain. For instance, in [50], an SSDL approach is proposed for unsupervised domain adaptation. Quantifying the degrees of OOD for the unlabeled data could improve the analysis of the test results and estimate the performance for unsupervised domain adaptation. 4) Finally, the proposed test bed and distance measures can be used for a more systematic quantitative evaluation of SSDL algorithms. Counterintuitively, datasets with a high perceived semantic similarity can be less beneficial for SSDL than other unlabeled datasets with less perceived semantic similarity, adding further evidence that we should be wary to conflate human and machine perception. In future work, we plan to extend the test bed to other SSDL variants, depth-first analyses (e.g., fewer tasks with more training epochs), additional axes of test bed variables (e.g., n u ), and more testing around the appropriate dissimilarity measures parameters. Investigating the relationship between generic feature similarity and SSDL downstream performance further is a promising topic in data-centric machine learning. The fact that feature dissimilarity scores can be calculated before SSDL training and independent of the SSDL model offers an interesting profile for application. Connections to OOD detection [52], concept drift [45], and distribution mismatch [11] could be explored further. Efficient and effective quantitative dataset evaluation prior to training a deep learning model offers an opportunity to push the envelope in computationally efficient deep learning further and to narrow the gap between deep learning research and its real-world application.