Skip to main content

Research Repository

Advanced Search

ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem

Triguero, Isaac; del R�o, Sara; L�pez, Victoria; Bacardit, Jaume; Ben�tez, Jos� M.; Herrera, Francisco

Authors

Sara del R�o

Victoria L�pez

Jaume Bacardit

Jos� M. Ben�tez

Francisco Herrera



Abstract

The application of data mining and machine learning techniques to biological and biomedicine data continues to be an ubiquitous research theme in current bioinformatics. The rapid advances in biotechnology are allowing us to obtain and store large quantities of data about cells, proteins, genes, etc., that should be processed. Moreover, in many of these problems such as contact map prediction, the problem tackled in this paper, it is difficult to collect representative positive examples. Learning under these circumstances, known as imbalanced big data classification, may not be straightforward for most of the standard machine learning methods.

In this work we describe the methodology that won the ECBDL’14 big data challenge for a bioinformatics big data problem. This algorithm, named as ROSEFW-RF, is based on several MapReduce approaches to (1) balance the classes distribution through random oversampling, (2) detect the most relevant features via an evolutionary feature weighting process and a threshold to choose them, (3) build an appropriate Random Forest model from the pre-processed data and finally (4) classify the test data. Across the paper, we detail and analyze the decisions made during the competition showing an extensive experimental study that characterize the way of working of our methodology. From this analysis we can conclude that this approach is very suitable to tackle large-scale bioinformatics classifications problems.

Citation

Triguero, I., del Río, S., López, V., Bacardit, J., Benítez, J. M., & Herrera, F. (2015). ROSEFW-RF: the winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Systems, 87, https://doi.org/10.1016/j.knosys.2015.05.027

Journal Article Type Article
Acceptance Date May 28, 2015
Online Publication Date Jun 1, 2015
Publication Date Oct 1, 2015
Deposit Date Sep 4, 2017
Publicly Available Date Sep 4, 2017
Journal Knowledge-Based Systems
Print ISSN 0950-7051
Electronic ISSN 1872-7409
Publisher Elsevier
Peer Reviewed Peer Reviewed
Volume 87
DOI https://doi.org/10.1016/j.knosys.2015.05.027
Keywords Bioinformatics; Big data; Hadoop; MapReduce; Imbalance classification; Evolutionary feature selection
Public URL https://nottingham-repository.worktribe.com/output/981985
Publisher URL http://www.sciencedirect.com/science/article/pii/S0950705115002130

Files





You might also like



Downloadable Citations