GAM: Genomic Assemblies Merger

A Policriti, S Scalabrin, F Vezzi, R Vicedomini


Motivations. In the last 3 years more than 20 assemblers have been proposed to tackle the hard task of assembling. Recent evaluation efforts (Assemblathon 1 and GAGE) demonstrated that none of these tools clearly outperforms the others. However, results clearly show that some assemblers performs better than others on specific regions and statistics while poorly performing on other regions and evaluation measures. With this picture in mind we developed GAM (Genomic Assemblies Merger) whose primary goal is to merge two or more assemblies in order to obtain a more contiguous one. Moreover, as a by-product of the merging step, GAM is able to correct mis-assemblies. GAM does not need global alignment between contigs, making it unique among others Assembly Reconciliation tools. In this way a computationally expensive alignment is avoided, and paralog sequences (likely to create false connection among contigs) do not represent a problem. GAM procedure is based only on the information coming from reads used in the assembling phases, and it can be used even on assemblies obtained with different datasets.

Methods. Let us concentrate on the the merging of two assemblies, dubbed M and S. As a preprocessing step, that is an almost mandatory analysis, reads (or a subset of them) used in the assembling phase are aligned against M and S using a SAM-compatible aligner (e.g., BWA, rNA). GAM takes as input M, S and the two SAM files produced in the preprocessing step. The main idea is to identify fragments belonging to M and S having high similarity. For this purpose, GAM identifies regions, named blocks, belonging to M and S that share an high enough amount of reads (i.e. regions sharing the same aligned reads). After all blocks are identified the Assembly Graph (AG) is built: each node corresponds to a block and a directed edge connects block A to block B if the first precedes the second in either M or S (see Fig.1). Once AG is available, the merging phase can start. As a first step GAM identifies genomic regions in which assemblies contradict each other (loops, bifurcations, etc.). These areas represent potential inconsistencies between the two sequences. We chose to be as much conservative as possible electing (for example) M to be the Master assembly: all its contigs are supposed to be correct and cannot be contradicted. S becomes the Slave and everywhere an inconsistency is found, M is preferred to S. After the identification and the resolution of problematic regions, GAM visits the simplified graph, merges contigs accordingly to blocks and edges in AG (each merging phase is performed using a Smith-Waterman algorithm variant) and finally outputs the new improved assembly. GAM is not only limited to contigs, it can also work with scaffolds, filling the N's inserted by an assembler and not by the other.

Results. GAM has been tested on several real datasets, in particular on Olea's chloroplast (241X Illumina paired reads and 21X 454 paired reads), Populus trichocarpa (82X Illumina paired reads), boa constrictor (40X Illumina paired reads). Illumina reads have average length of 100 bp and insert size of 500 bp. All tests have been performed on a computer equipped with 8 cores and 32GB RAM. ABySS and CLC were selected as assemblers. Results are summarized in Fig. 1. Olea's chloroplast has been used as a proof of concept experiment. The presence of a reference sequence allowed GAM's output validation (using dnadiff). Two assemblies were obtained with CLC using Illumina and 454 data. GAM was used to merge them. Figure 1 shows how GAM assembly is not only more contiguous but also more correct: while Master (CLC-Illumina) and Slave (CLC-454) have 58 and 39 suspicious regions respectively, GAM has only 14 of those. On Populus trichocarpa and Boa constrictor, CLC assemblies were used as master due to their better contiguity. In both cases assemblies returned by GAM were more contiguous (see Fig. 1).

Full Text:




  • There are currently no refbacks.