STAR Pipeline

How to Create a Library File

The purpose of a library file is to categorize your sequence data into groups. This is important for monoclonal filtering since data from the same sequencing library may span several sequence files. If monoclonal filtering is enabled, reads within the same sequencing library that map to the same starting stranded chromosomal coordinate are filtered so that at most only one read maps to each stranded chromosomal coordinate.

A sample library file looks like this:

SequencingData1.fastq.gz 1

SequencingData2.fastq.gz 2

or

SequencingData1.mapped.txt.gz 1

SequencingData2.mapped.txt.gz 2

Each line in the library file contains two fields and the entries are separated by a tab. The first field contains the sequence file name and the second the second column contains the library id. Sequence files that share the same library id are considered apart of the same sequencing library; sequences from these files will be combined before being filtered for clonal reads.

Mapping Reads

Mapped reads must be uploaded in compressed fastq format (i.e. files must have a .fastq.gz suffix).Quality scores must be in Phred+64 format. Regarding paired end reads, reads from both ends (i.e. mate 1 and 2) should be in the same FASTQ file.
Reads that are within the same sequencing library must be uniquely labeled for the pipeline's mapping algorithm to work properly. This can become an issue when reads from multiple sequencing runs share the same lane number and are contained within the same sequencing library.
Read IDs may only have suffixes (i.e. mate IDs) of either 1 or 2 (e.g. 1-Parkinsons1:3:1:1578:1416#0/1 and 1-Parkinsons1:3:1:1578:1416#0/2 ). Mate 2 from an Illumina barcoded run has a mate ID of 3. These reads must be modified so that their mate ID is 2.

Uploading Reads

If selected, the user may upload his of her already mapped reads in files that contain a .mapped.txt.gz suffix. The format of this text file is the following:

ReadID <tab> sequence <tab> quality <tab> strand <tab> chromosome <tab> start (1-based) <tab> stop <tab> mismatches (comma delimited) <new line>

Sequences and quality scores that map to the Watson strand are printed left to right as 5' to 3'. Sequences and quality scores that map to the Crick strand are printed left to right as 3' to 5'. Mismatches appropriately indicate the mismatched bases in the sequence field (0-based counting from left to right). The start position refers to the starting bp of the 5' end for Watson mapping reads and the starting bp of the 3' end for Crick mapping reads. All bp coordinates are relative to the Watson strand. Quality scores must be in Phred+33 format.

These sequences are still examined for unique mappability and can be filtered for monoclonal reads. Mate pairing within the defined minimum and maximum distances is enforced. You must make sure all read IDs are unique regardless of whether they are in the same sequencing library or not.

Mapping Parameters

Mapping Algorithm

endtoend - This mapping style maps sequences allowing up to a certain number of mismatches.

maq - This mapping style maps a seed of specified length up to 2 mismatches. The match is then expanded towards the 3' end allowing multiple mismatches. If a read maps to a certain location and the sum of the quality scores of its mismatched bases is below the user specified e-value, the read is considered to map to this location.

Maq mapping method-related parameters:

seedlength - length of the seed that establishes a candidate mapping site for a read. This seed is allowed to have up to 2 mismatches. The seed starts at the 5' end of the read.

e-value - the sum of the quality scores of mismatched bases may not exceed this number (e-value is based on rounded Phred quality scores and each quality score saturates at 30).

Total Mismatches Allowed

When using endtoend mapping, this value specifies the maximum number of matches allowed in the entire read.

When using the maq mapping mode, the value specifies how many mismatches are allowed for a mapped read before it is clipped. For example, imagine a 100 bp mapped read has mismatches at positions 0, 25, 40, 60. If you set the total mismatches allowed parameter to 3, the read will be cut at the 59th position so that the read contains only 3 mismatches (i.e. the read is now 60 bp long). To disable read clipping, set this value to your maximum read length.

Minimum Distance Between Pairs

If the data contains paired end reads, this parameter specifies the minimum distance between two candidate paired end reads. If the distance between the reads is less than this number, the reads are not considered paired ends.

Maximum Distance Between Pairs

If the data contains paired end reads, this parameter specifies the maximum distance between candidate paired end reads If the distance between the reads is more than this number, the reads are not considered paired ends.

Monoclonal Filtering

If reads within the same sequencing library map to the same stranded chromosomal coordinate, only one read (randomly selected) is allowed.

E-mail Address

Methylation pipeline updates and output file links are sent to this e-mail address.

 

How Does the Mapping Algorithm Work?

Background

Bisulfite converted sequences (BSC) are difficult to map because sequences that are reverse complementary before bisulfite conversion are no longer reverse complementary after bisulfite conversion*.

 

The above figure shows a reference sequence before and after bisulfite conversion. All unmethylated cytosines are converted to uracils, which appear as thymines after PCR. Base pairing at unmethylated cytosines no long occurs after bisulfite conversion. Base pairing is maintained at methylation cytosine positions.

*In the case where all cytosines are methylated on both strands, the bisulfite converted reads are reverse complementary.

Mapping BSC Reads

Since the base identity at a cytosine position can be either T or C after PCR amplification, reads must be demethylated in silico so that the methylation states of the cytosines contained within the read do not affect mappability. In silico demethylation involves converting all cytosines on a read to thymines. The original read sequence is stored separately for use after read mapping.

How Reads Are Mapped to the Genome

During strand detection, the base composition of the each read is calculated to determine whether the read represents a bisulfite converted sequence or a PCR product of a bisulfite converted sequence. The strand detection method calculates the ratios of adenines to guanines and thymines to cytosines. Reads with a higher T/C ratio than A/G ratio are considered BSC sequences. This is due to the fact that unmethylated cytosines appear as thymines. The PCR product of R2, which is the reverse complement of R2, will thusly contain more adenines than guanines since the excess presence of thymines on the BSC sequence appear as an excess of adenines in the PCR product. Reads labeled as original BSC sequences are demethylated in silico since they will map directly to the unmethylated genome (either Watson or Crick). The reads that are labeled as PCR sequences undergo a G to A substitution so that their reverse complement will map to the demethylated reference genome.

The outline of the mapping algorithm is shown below:

 

a) The reference sequences in the strand detection method after in silico demethylation and the BSC sequences as obtained from Illumina sequencing. b) The reads are checked if they represent the BSC sequence or a PCR product thereof. If a read is marked as a BSC original sequence, it is in silico demethylated and mapped to the in silico demethylated Watson and Crick reference sequences. If a sequence is marked as a PCR product, it undergoes as G to A substitution before it is mapped to the in silico demethylated Watson an Crick strands. c) Read 1 is mapped to the Watson Strand. d) Read 2 is mapped to the Crick strand after reverse complementation.