Skip to main content

Research Repository

Advanced Search

Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets

Calderon-Ramirez, Saul; Oala, Luis; Torrentes-Barrena, Jordina; Yang, Shengxiang; Elizondo, David; Moemeni, Armaghan; Colreavy-Donnelly, Simon; Samek, Wojciech; Molina-Cabello, Miguel A.; Lopez-Rubio, Ezequiel

Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets Thumbnail


Authors

Saul Calderon-Ramirez

Luis Oala

Jordina Torrentes-Barrena

Shengxiang Yang

David Elizondo

Simon Colreavy-Donnelly

Wojciech Samek

Miguel A. Molina-Cabello

Ezequiel Lopez-Rubio



Abstract

Semi-supervised deep learning (SSDL) is a popular strategy to leverage unlabelled data for machine learning when labelled data is not readily available. In real-world scenarios, different unlabelled data sources are usually available, with varying degrees of distribution mismatch regarding the labelled datasets. It begs the question which unlabelled dataset to choose for good SSDL outcomes. ftentimes, semantic heuristics are used to match unlabelled data with labelled data. However, a quantitative and systematic approach to this election problem would be preferable. In this work, we first test the SSDL MixMatch algorithm under various distribution mismatch configurations to study the impact on SSDL accuracy. Then, we propose a quantitative unlabelled dataset selection heuristic based on dataset dissimilarity measures. These are designed to systematically assess how distribution mismatch between the labelled and unlabelled datasets affects MixMatch performance. We refer to our proposed method as deep dataset dissimilarity measures (DeDiMs), designed to compare labelled and unlabelled datasets. They use the feature space of a generic Wide-ResNet, can be applied prior to learning, are quick to evaluate and model agnostic. The strong correlation in our tests between MixMatch accuracy and the proposed DeDiMs suggests that this approach can be a good fit for quantitatively ranking different unlabelled datasets prior to SSDL training.

Citation

Calderon-Ramirez, S., Oala, L., Torrentes-Barrena, J., Yang, S., Elizondo, D., Moemeni, A., Colreavy-Donnelly, S., Samek, W., Molina-Cabello, M. A., & Lopez-Rubio, E. (2023). Dataset Similarity to Assess Semisupervised Learning Under Distribution Mismatch Between the Labeled and Unlabeled Datasets. IEEE Transactions on Artificial Intelligence, 4(2), 282-291. https://doi.org/10.1109/tai.2022.3168804

Journal Article Type Article
Acceptance Date Apr 22, 2022
Online Publication Date Apr 22, 2022
Publication Date Apr 1, 2023
Deposit Date May 3, 2022
Publicly Available Date May 5, 2022
Journal IEEE Transactions on Artificial Intelligence
Electronic ISSN 2691-4581
Publisher Institute of Electrical and Electronics Engineers
Peer Reviewed Peer Reviewed
Volume 4
Issue 2
Pages 282-291
DOI https://doi.org/10.1109/tai.2022.3168804
Keywords Training , Deep learning , Artificial intelligence , Feature extraction , Semisupervised learning , Data models , Semantics, Dataset similarity, distribution mismatch, MixMatch, out of distribution data, semisupervised deep learning
Public URL https://nottingham-repository.worktribe.com/output/7950447
Publisher URL https://ieeexplore.ieee.org/document/9762063

Files





You might also like



Downloadable Citations