Next generation sequencing

NGS platforms

  • Illumina® (Solexa) Genome Analyzer™ and HiSeq
  • Roche 454 Sequencer™
  • Applied Biosystems SOLiD™

File formats

  • FASTQ: derived from FASTA format with the addition of quality scores. Each read from a sequencer comprises an identifier line, a sequence line, a second identifier line (or with a + character) and final a quality line. This typically forms the input to a mapping program (along with a FASTA reference genome). A typical human exome FASTQ file might be around 10-15GB, which can be compressed to 5-6GB using gzip.
  • SAM format: mapped/aligned sequence containing detail about alignment, mapping quality etc. This usually contains a subset of the raw reads (as some will have been discarded at the mapping stage). The SAM (or BAM) file is typically used as the substrate for variant calling algorithms and other analyses.
  • BAM format: binary version of SAM. A typical human exome BAM file might be around 2GB in size.
  • BED format: annotation format that describes genome regions, with the optional addition of annotation data for display of genome browser tracks.
  • Other annotation formats: UCSC describe a number of other formats suitable for generating tracks in genome browsers.
  • VCF:variant call format – this contains details about the number of reads at variant sites in the genome, plus a range of quality information. A typical human exome VCF file might contain about 20,000 lines.

Projects with available data

  • 1000 genomes project
    International consortium working towards sequence data for 1000 human genomes (2 trios at high coverage,179 low coverage whole genome, 697 exome). VCF files and raw data downloadable.
    http://www.1000genomes.org/. Also see Nature 467:1061–1073.
  • National Institute of Environmental Health Sciences SNP project
    Complete exome sequencing data for 88 EGP samples, with VCF and BAM data available
    http://snp.gs.washington.edu/niehsExome/
  • Personal genome project
    Harvard University initiative for genome data sharing: aiming for 100,000 participants, currently only a limited quantity of data
    http://www.personalgenomes.org/
  • Illumina’s demo data
    eg One Yoruban human genome available from NA18507, plus analysed in/del and SNP information
    http://www.illumina.com/HumanGenome/

Selective approaches to NGS

  • Sequence capture arrays – exome, gene list, specific GWAS-hit regions etc
  • PCR amplification – suitable for smaller scale
  • Pooling to maximise throughput (“barcoded” or anonymous)
  • FAIRE-Seq: identify regions of open chromatin, where regulatory proteins bind (formaldehyde-assisted isolation of regulatory elements)
  • MAINE-Seq: identify regions of closed chromatin (MNase-mediated purification of mononucleosomes to extract histone-bound DNA sequencing)
  • ChIP-Seq: identify where transcription factors bind using antibody to TF on nuclear DNA (Chromatin Immunoprecipitation sequencing)

Useful sites

  • SEQanswers: an online forum – extremely useful for NGS information
    http://seqanswers.com
  • Service providers: check with your University. This is a rapidly changing field and most universities are beginning to run systems in-house. Alternatively, commercial NGS services are available in many countries.
  • Illumina:
    http://www.illumina.com
  • 454:
    http://www.454.com
  • ABI SOLiD:
    http://tinyurl.com/ccdk8j