bowtie short sequence mapping
 

About Bowtie

Bowtie was proposed by Langmead B. and Trapnell C. et al. Bowtie is an ultrafast, memory-efficient short read aligner geared toward quickly aligning large sets of short DNA sequences (reads) to large genomes. It aligns 35-base-pair reads to the human genome at a rate of 25 million reads per hour on a typical workstation. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: for the human genome, the index is typically about 2.2 GB (for unpaired alignment) or 2.9 GB (for paired-end or colorspace alignment). Multiple processors can be used simultaneously to achieve greater alignment speed. Bowtie can also output alignments in the standard SAM format, allowing Bowtie to interoperate with other tools supporting SAM, including the SAMtools consensus, SNP, and indel callers.

Bowtie is not a general-purpose alignment tool like MUMmer, BLAST or Vmatch. Bowtie works best when aligning short reads to large genomes, though it supports arbitrarily small reference sequences (e.g. amplicons) and reads as long as 1024 bases. Bowtie is designed to be extremely fast for sets of short reads where (a) many of the reads have at least one good, valid alignment, (b) many of the reads are relatively high-quality, and (c) the number of alignments reported per read is small (close to 1).

Index and Reference Genome

Bowtie builds index files from a set of DNA sequences. Those files are needed to align reads to corresponding reference. The algorithm used to build the index is based on the blockwise algorithm of Karkkainen and the index is based on the FM Index of Ferragina and Manzini, which in turn is based on the Burrows-Wheeler transform.

Reference Description
H. sapiens, UCSC hg18 N/A
H. sapiens, UCSC hg19 N/A
H. sapiens, NCBI v36 N/A
H. sapiens, NCBI v37 N/A
M. musculus, UCSC mm9 N/A
M. musculus, NCBI v37 N/A
R. norvegicus, UCSC rn4 N/A
B. taurus, UMD v3.0 N/A
C. familiaris, UCSC canFam2 N/A
D. melanogaster, Flybase, r5.22 N/A
A. thaliana, TAIR, TAIR9 N/A
C. elegans, Wormbase, WS200 N/A
S. cerevisiae, CYGD N/A
E. coli, NCBI, st. 536 N/A

 

Input Sequence Format

FASTQ files: usually having extension .fq or .fastq
FASTA files: usually having extension .fa, .mfa, .fna or similar.
Raw files: one sequence per line, without quality values or names.

FASTA format

>r0
GAACGATACCCACCCAACTATCGCCATTCCAGCAT
>r1
CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
>r2
TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
>r3
AAAATTTGTGCCTGGATGGCCTGAGTACCNANTAC
>r4
GCAGAGCAGTTGCTAGAAANNNNNTTGAAGAGGTT
>r5
CAGCATAAGTGGATATTCAAAGTTTTGCTGTTTTA

FASTQ format

@r0
GAACGATACCCACCCAACTATCGCCATTCCAGCAT
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r1
CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r2
TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554

RAW format

GAACGATACCCACCCAACTATCGCCATTCCAGCAT
CCGAACTGGATGTCTCATGGGATAAAAATCATCCG
TCAAAATTGTTATAGTATAACACTGTTGCTTTATG
AAAATTTGTGCCTGGATGGCCTGAGTACCNANTAC
GCAGAGCAGTTGCTAGAAANNNNNTTGAAGAGGTT
CAGCATAAGTGGATATTCAAAGTTTTGCTGTTTTA
GGCAGTGATGCAACTGCCCGTTATCAACAGNCNCT
GCATATTGCCAATTTTCGCTTCGGGGATCAGGCTA
GGTTCAGTTCAGTATACGCCTTATCCGGCCTACGG
GGCGATGATTTCATTACCCTCAACGCCGAACAGGC
AATCCCACGGCGGCAGCATGGTCCTAGANAGGNCG

Paired End Input Sequence

FASTA(2 files) FASTQ(2 files)
>r0/1
TATTCTTCCGCATCCTTCATACTCCTGCCGGTCAG
>r1/1
TGATAGATCTCTTTTTTCGCGCCGACATCTACGCC
>r2/1
CACGCCCTTTGTAAGTGGACATCACGCCCTGAGCG
@r0/1
TATTCTTCCGCATCCTTCATACTCCTGCCGGTCAG
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
@r1/1
TGATAGATCTCTTTTTTCGCGCCGACATCTACGCC
+
EDCCCBAAAA@@@@?>===<;;9:99987776554
>r0/2
GAATACTGGCGGATTACCGGGGAAGCTGGAGC
>r1/2
AATGTGAAAACGCCATCGATGGAACAGGCAAT
>r2/2
AACGCGCGTTATCGTGCCGGTCCATTACGCGG
@r0/2
GAATACTGGCGGATTACCGGGGAAGCTGGAGC
+
EDCCCBAAAA@@@@?>===<;;9:99987776
@r1/2
AATGTGAAAACGCCATCGATGGAACAGGCAAT
+
EDCCCBAAAA@@@@?>===<;;9:99987776

 

Limitations

Due to network speed issue, the size of input sequence file is limited below 2G. The online service is not appropriate for large data size.

Reference

The reference paper: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol 10:R25. The standalone software and manual can be accessed from: http://bowtie-bio.sourceforge.net/manual.shtml