GRID distribution supports clustering validation of large mixed microarray data sets

Angelica Tulipano; Carmela Marangi; Leonardo Angelini; Giacinto Donvito; Guido Cuscela; Giorgio Pietro Maggi; Andreas Gisel

doi:10.14806/ej.17.1.205

GRID distribution supports clustering validation of large mixed microarray data sets

Authors

Angelica Tulipano CNR, Istituto Tecnologie Biomediche, Bari
Carmela Marangi CNR, Istituto per le Applicazioni del Calcolo, Bari
Leonardo Angelini Università degli Studi e Politecnico di Bari, Dipartimento Interateneo di Fisica
Giacinto Donvito INFN, Bari
Guido Cuscela INFN, Bari
Giorgio Pietro Maggi INFN, Bari Università degli Studi e Politecnico di Bari, Dipartimento Interateneo di Fisica
Andreas Gisel CNR, Istituto Tecnologie Biomediche, Bari

DOI:

https://doi.org/10.14806/ej.17.1.205

Keywords:

grid computing, microarray clustering, statistical validation

Abstract

Microarray data are a rich source of information, containing the collected expression values of thousands of genes for well defined states of a cell or tissue. Vast amounts of data (thousands of arrays) are publicly available and ready for analysis, e.g. to scrutinise correlations between genes at the level of gene expression. The large variety of arrays available makes it possible to combine different independent experiments to extract new knowledge. Starting with a large set of data, relevant information can be isolated for further analysis. To extract the required information from data sets of such size and complexity requires an appropriate and powerful analysis method. In this study, we chose to use an unsupervised hierarchical clustering algorithm, Chaotic Map Clustering (CMC), in a coupled two-way approach to analyse such data. However, the clustering approach is intrinsically difficult, both in terms of the unknown structure of the data and interpretation of the clustering results. It is therefore critical to evaluate the quality of any unsupervised procedure for such a complex set of data and to validate the clustering results, separating those clusters that are due simply to noise or statistical fluctuations. We used a resampling method to perform this validation. The resampling procedure applies the clustering algorithm to a large number of random sub-samples of the original data matrix and, consequently, the whole process becomes computationally intensive and time consuming. Using Grid technology, we show that we can drastically speed up this process by distributing the clustering of each matrix to a separate worker node, and thus retrieve resampling results within a few hours instead of several days. Further, we offer an online service to cluster large microarray data sets and conduct the subsequent validation described in this paper.

Downloads

HTML
PDF

Published

2011-05-12

Issue

Vol. 17 No. 1: Next Generation Sequencing Data Analysis

Section

Technical Notes

License

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).

GRID distribution supports clustering validation of large mixed microarray data sets

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Language

Developed By

Information