Aligning bisulfite-converted DNA reads to a genome

The easy way

Use Bisulfighter, which wraps LAST.

The hard way

Bisulfite is used to detect methylated cytosines. It converts unmethylated Cs to Ts, but it leaves methylated Cs intact. If we then sequence the DNA and align it to a reference genome, we can infer cytosine methylation.

To align the DNA accurately, we should take the C->T conversion into account. Here is how to do it with LAST.

Let's assume we have bisulfite-converted DNA reads in a file called "reads.fastq" (in fastq-sanger format), and the genome is in "mygenome.fa" (in fasta format). We will also assume that all the reads are from the converted strand, and not its reverse-complement (i.e. they have C->T conversions and not G->A conversions).

First, we need to run lastdb twice, for forward-strand and reverse-strand alignments:

lastdb -uBISF my_f mygenome.fa
lastdb -uBISR my_r mygenome.fa

Then, there are several steps:

  1. Convert all Cs in the reads to Ts. (Debatable: this slightly degrades LAST's ability to align the reads, but it avoids a bias, due to unconverted DNA being easier to align than converted DNA.)

  2. Align the reads, one strand at a time.

  3. Merge the alignments, and find a unique best alignment for each part of each read.

  4. Undo the conversion from step 1.

The last-bisulfite script (in the examples directory) carries out steps 1-4. Its usage is:

last-bisulfite.sh my_f my_r reads.fastq > results.maf

You can parallelize it like this:

parallel-fastq "last-bisulfite.sh my_f my_r" < reads.fastq > results.maf

Tips

  • To go faster, try gapless alignment (add -j1 to the lastal options). Often, this is only minusculely less accurate than gapped alignment.

  • The .tis files created by lastdb (e.g. my_f.tis, my_r.tis) are identical. So you can shave a few GB by linking them:

    ln -f my_f.tis my_r.tis
    

Paired-end DNA reads

You can align paired-end reads by combining the preceding recipe with the one in last-pair-probs.html. This gets a bit complicated, so we provide a last-bisulfite-paired script in the examples directory. Typical usage:

lastdb -uBISF my_f mygenome.fa
lastdb -uBISR my_r mygenome.fa

last-bisulfite-paired.sh my_f my_r reads1.fastq reads2.fastq > results.maf

This assumes that reads1.fastq are all from the converted strand (i.e. they have C->T conversions) and reads2.fastq are all from the reverse-complement (i.e. they have G->A conversions). It requires GNU parallel to be installed. You are encouraged to customize this script.