Analysis workflow for the identification allelic variants associated with rare disorders using whole genome sequencing approach

V Maselli; D Cittaro; E Stupka

doi:10.14806/ej.18.A.436

Authors

V Maselli Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute, Milan
D Cittaro Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute, Milan
E Stupka Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute, Milan

DOI:

https://doi.org/10.14806/ej.18.A.436

Keywords:

BITS, next generation sequencing

Abstract

Motivation

Recent advances in sequencing technologies allowed for unprecedented possibilities and applications for clinical data analysis. In particular, Whole Genome sequencing at reasonable coverage has turned into a cost-effective technology to characterize the genetic framework of rare diseases. Here we present an example of a case of whole genome sequencing analysis in a case of a rare recessive disease. Our purpose is to identify allelic variants associated with the described syndrome using a whole genome sequencing approach.

Methods

As an example, we utilized 100 bp paired-end sequencing of a single human genomic sample from a rare recessive disorder using the Illumina HiSeq 2000 sequencer. Read tags were aligned to the hg19 reference genome using BWA [1]. The Genome Analysis ToolKit was used to pipeline the downstream analysis: local realignment around indels, quality score recalibration, SNP and indel calling and Variant Quality Score Recalibration. We filtered SNV with a confidence lower than 0.1%. We used the Seattle SNP Annotation tools [3] in order to annotate the SNP on the reference genome. We performed some preliminary statistics in order to identify a threshold that allowed us to identify a reliable subset of SNPs to use for our purpose. Using the sub set of known SNPs as guide we identify the value of the SNP quality, for which we are confident regarding our data. Using a quality threshold of 50 we identified a subset of SNPs that we analysed with the HomozygosityMapper web tools [4].

Results

The sequencing data used amounted to 400 million read pairs, achieving a 20x average coverage of the genome (sufficient for reliable homozygous SNP calls). We filtered out 2 million SNPs with a read depth of at least 5 and max 250. The average quality above 400k SNPs is 313 (max 9371, min 30). We found about 100k novel SNPs (10% of which were homozygous). Given the assumption of recessive inheritance in the example under investigation, we focused on homozygous stretches, and identified 6 stretches of homozigosity, two of which were also confirmed by previous microsatellite linkage data. This workflow can be easily automatized and used for different type of re-sequencing projects in the context of recessive rare disorders.

Reference

Li and Durbin (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26 (5):589-95
Mckenna et al. (2010) The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20 (9):1297-1303
http://snp.gs.washington.edu/SeattleSeqAnnotation134/
Seelow et al. (2009) HomozygosityMapper—an interactive approach to homozygosity mapping. Nucleic Acids Res 37 (suppl 2):W593-W599

Analysis workflow for the identification allelic variants associated with rare disorders using whole genome sequencing approach

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Language

Developed By

Information