LAST tutorial

2012-03-08 (updated 2012-10-24, 2013-03-07)

Go to the LAST homepage:

http://last.cbrc.jp/

There are 2 ways of using LAST:

  1. Download it to your own computer.
  2. Web service. (This is basically a "demo": it cannot handle large data.)

LAST web service

Click on "LAST web service".

lastweb.png

We need to input some sequences. Open a new browser window, and go here:

http://last.cbrc.jp/tutorial/webfiles/

Let's compare the human and UNI (ウニ) mitochondria!

Click on human-mito.fasta. The sequence is in the standard fasta format:

human-mito.png

Copy and paste the sequence into the LAST web form.

  • CTRL-a: select all
  • CTRL-c: copy
  • CTRL-v: paste

Next, copy and paste uni-mito.fasta into the other box.

Click "Align".

dotplot.png

The dotplot shows which parts of the human sequence align to which parts of the UNI sequence.

To see the detailed alignments, click "Coloured alignments".

colalign.png

TASK

Try the HOYA (ホヤ) mitochondrion too. Which mitochondria are most similar?

OPTIONAL TASK

Try some other animal mitochondria. You can get them from here:

http://www.ncbi.nlm.nih.gov/genomes/OrganelleResource.cgi?taxid=33208

Click on an Accession, then choose "Display Settings" → "FASTA (text)".

Using LAST on our own computer

To use LAST on our own computer, we must use a unix command line (e.g. Linux, Mac, Windows+Cygwin).

Installing LAST

Go to the LAST homepage:

http://last.cbrc.jp/

Click on "Download LAST", and save the latest version on your computer.

Open a terminal window:

plain-terminal.png

Now you need to figure out which directory LAST got saved in (maybe "Downloads"?) and move there using the cd (change directory) command:

$ cd Downloads

To see what files are present, use the ls (list) command:

$ ls

You should see something like last-274.zip. Unzip it:

$ unzip last-274.zip

This created a new directory called last-274. Move into it:

$ cd last-274

Now, type make to compile the main LAST programs:

$ make

Next, install the programs. You can do either a standard installation (using sudo to request administrator permission):

$ sudo make install

Or a single-user installation:

$ make install prefix=~

In the latter case, you might need to log out and back in before your computer recognizes the new programs.

Compare the human and UNI mitochondria again

Download human-mito.fasta and uni-mito.fasta from here:

http://last.cbrc.jp/tutorial/files/

Open a terminal window, and cd into the directory where the files got saved (maybe "Downloads"?):

$ cd Downloads

First, use lastdb to make an "index" of the human sequence:

$ lastdb my-index human-mito.fasta

Unix hint: You do not have to type all of human-mito.fasta. You can just type the first few letters, then press the TAB key.

Point: You can replace my-index with any name you like. For example, neko.

lastdb created some new files. Type ls to see them:

$ ls
human-mito.fasta  my-index0.des  my-index0.ssp  my-index.prj    y-pestis.fasta
last-274.zip      my-index0.prj  my-index0.suf  plague.fastq
my-index0.bck     my-index0.sds  my-index0.tis  uni-mito.fasta

Next, use lastal to compare the UNI sequence to my-index:

$ lastal my-index uni-mito.fasta > out.maf

Point: You can replace out.maf with any name you like.

This made a new file called out.maf. Type ls to see it:

$ ls
human-mito.fasta  my-index0.des  my-index0.ssp  my-index.prj  uni-mito.fasta
last-274.zip      my-index0.prj  my-index0.suf  out.maf       y-pestis.fasta
my-index0.bck     my-index0.sds  my-index0.tis  plague.fastq

To look at the alignments, use less -S:

$ less -S out.maf

You can use the arrow keys to move around. To quit looking, type "q".

The alignments are in MAF format, which looks like this:

a score=546
s Human 5915 1432 + 16569 CGTTGACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTATTA
s UNI   5796 1432 + 15650 CGATGATTATTTTCTACTAACCACAAGGACATCGGAACACTTTATTTAATT

a score=316
s Human 14795 1078 + 16569 CATTCATCGACCTCCCCACCCCATCCAACATCTCCGCATGATGAAACTTC
s UNI   14560 1078 + 15650 CATTCGTTGACCTCCCCCTTCCCTCCAACCTTTCCATTTGGTGAAACTCG

a score=224
s Human 9099 892 + 16569 ATCTTCACAATTCTAATTCTACTGACTATCCTAGAAATCGCTGTCGCCTTAA
s UNI   9221 893 + 15650 ATCCTTATTATTTTTATTTTATTATTTGTGCTAGAAATAGGAGTAGCCTGCA

a score=158
s Human 7582 664 + 16569 TTAATGGCACATGCAGCGCAAGTAGGTCTACAAGACGCTACTTCCCCTATCA
s UNI   7707 664 + 15650 TTAATGGGAACTTGAGCACAGTTTGGTCTACAAGATGCATCCTCCCCTCTTA

TASK

Compare these alignments to the ones we found with the web service. Can you guess what all the numbers mean?

(Hint: MAF uses "0-based coordinates", but the web alignments use "1-based coordinates".)

Here is more information on MAF format:

http://genome.ucsc.edu/FAQ/FAQformat.html#format5

Getting help with the commands

If you type each command on its own, it prints a short usage reminder:

$ lastdb
lastdb: please give me an output name and sequence file(s)

Usage: lastdb [options] output-name fasta-sequence-file(s)
Prepare sequences for subsequent alignment with lastal.

Main Options:
-h: show all options and their default settings
-p: interpret the sequences as proteins
-c: soft-mask lowercase letters
$ lastal
lastal: please give me a database name and sequence file(s)

Usage: lastal [options] lastdb-name fasta-sequence-file(s)

The -h option prints a detailed help message, including all options and their default settings:

$ lastdb -h
$ lastal -h

The full documentation is at the LAST website:

Fastq format

We have sequenced some DNA from an outbreak of plague (ペスト)! Our DNA is in fastq format. Let's look at it:

http://last.cbrc.jp/tutorial/files/plague.fastq

It looks like this:

fastq.png

Each sequence has 4 lines.

  • Line 1, starting with @, has a name (e.g. read1).
  • Line 2 has the sequence.
  • Line 3 is useless. (I didn't invent this format.)
  • Line 4 indicates the quality (i.e. reliability) of each base.

Current DNA sequencers often make errors. They indicate the reliability of each base using ASCII-coded numbers. We can understand them by consulting an ASCII table:

http://www.ie.u-ryukyu.ac.jp/~e085739/_images/ascii1.png

The symbols indicate error probabilities, like this:

Symbol ASCII error probability
! 33 1
+ 43 0.1
5 53 0.01
? 63 0.001

TASK: What is the symbol for error probability 0.0001?

Here is more information on fastq format:

Compare plague DNA to a reference genome

We would like to compare our DNA to a reference genome of the plague bacterium, Yersinia pestis.

We have the reference genome in fasta format. Let's look at it:

http://last.cbrc.jp/tutorial/files/y-pestis.fasta

pestis-fasta.png

Download plague.fastq and y-pestis.fasta from here:

http://last.cbrc.jp/tutorial/files/

How should we compare them? Let's check the documentation:

http://last.cbrc.jp/doc/last-tutorial.html → Example 6

First, make an index of the genome:

$ lastdb -m1111110 yp y-pestis.fasta

The -m1111110 option makes it better at finding short, strong alignments. By default, it is better at finding long, weak alignments. (If you are interested, 1111110 represents a spaced seed.)

Next, compare our DNA to the genome:

$ lastal -Q1 yp plague.fastq > hit.maf

The -Q1 option tells lastal that the input is in fastq format.

Let's look at the alignments:

$ less -S hit.maf
a score=300
s Y.pestis 3891 50 + 4653728 ATCGCCAACTGAGGTGCTGCCAGTAAACCTACCACCATCTCTTCGATA
s read360     0 50 +      50 ATCGCCAACTGAGGTGCTGCCAGTAAACCTACCACCATCTCTTCGATA
q read360                    9BBCCCBCCCCCBBCBCCBCBCCBCBCBBCCCB:6@C?CBCB<ACBBC

a score=300
s Y.pestis 17011 50 + 4653728 ATAACCGCTGAAAGCATCTAAGCGGGAAACTTGCCTCGAGATGAGTC
s read520      0 50 +      50 ATAACCGCTGAAAGCATCTAAGCGGGAAACTTGCCTCGAGATGAGTC
q read520                     8CC?CCCC=CC@CBCCBC@BCCCCCBCCCCCCC@CCCC@CBCC@CCC

...
Let's look at the alignment of read724.
While using less, we can search by typing "/".
So type this, followed by ENTER/RETURN:
/read724

We can type "n" to repeat the search, and "N" to search backwards.

TASK

How many alignments are there for read724? What are their scores?

Resolving ambiguous alignments

Often, we assume that each DNA read has at most one true alignment. In that case, last-map-probs can estimate the probability that each alignment is true.

Let's look at the documentation:

http://last.cbrc.jp/doc/last-map-probs.html

Let's follow the "typical usage":

$ lastal -Q1 -e120 yp plague.fastq | last-map-probs.py -s150 > h.maf

Unix point: The "|" is a pipe. It sends the output of one command (lastal) to the input of another command (last-map-probs).

Let's look at the result:

$ less -S h.maf
a score=300 mismap=0
s Y.pestis 1635255 50 + 4653728 CCGGTGGTCCCTGTGGCATAGAGTTCGTGTTGTGCCAGTAGATCT
s read662        0 50 -      50 CCGGTGGTCCCTGTGGCATAGAGTTCGTGTTGTGCCAGTAGATCT
q read662                       @?@@@@><???;?<AAAA>:B=B;B:BA?C@B@CC?@B@=ABB@?

a score=300 mismap=0.00408535
s Y.pestis 1664783 50 + 4653728 CGTTGATGTGCGCACTGTTGCGGCTGTGATCGCCGACTGGACCGG
s read652        0 50 -      50 CGTTGATGTGCGCACTGTTGCGGCTGTGATCGCCGACTGGACCGG
q read652                       >B@=>?:A=@9A><B>??BB<@<@@A:@:AAB@BA75B@@A@>>7

last-map-probs gives each alignment a mismap probability: the probability that it is wrong (assuming that each DNA read has one correct alignment). It discards alignments with mismap > 0.01.

TASK: What happened to read724?

To understand this more clearly, let's keep the alignments with mismap > 0.01. The last-map-probs option -mX means "discard alignments with mismap > X". To keep all the alignments, use -m1:

$ lastal -Q1 -e120 yp plague.fastq | last-map-probs.py -s150 -m1 > z.maf

TASK: What happened to read724 now?

Converting the alignments to another format

We can convert the alignments to another format, using maf-convert.py:

$ maf-convert.py blast h.maf > h.blast

This converted them to BLAST format:

$ less -S h.blast

Which looks like:

Query= read360
         (50 letters)

>Y.pestis
          Length = 4653728

 Score = 300
 Identities = 50/50 (100%)
 Strand = Plus / Plus

Query: 1    ATCGCCAACTGAGGTGCTGCCAGTAAACCTACCACCATCTCTTCGATAAG 50
            ||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct: 3892 ATCGCCAACTGAGGTGCTGCCAGTAAACCTACCACCATCTCTTCGATAAG 3941

...

We can convert them to other formats too. Look at the help message:

$ maf-convert.py -h
Usage:
  maf-convert.py --help
  maf-convert.py axt my-alignments.maf
  maf-convert.py blast my-alignments.maf
  maf-convert.py html my-alignments.maf
  maf-convert.py psl my-alignments.maf
  maf-convert.py sam my-alignments.maf
  maf-convert.py tab my-alignments.maf

Read MAF-format alignments & write them in another format.

TASK

Convert the alignments to some other formats. Which ones are easy to read? Which ones are compact (good for huge data)?

TASK

  • Find a DNA read that has a deletion, compared to the genome.
  • Find a DNA read that has an insertion, compared to the genome.
  • Find a DNA read that has a mismatch, compared to the genome.

(Hint: it might be easier with some formats than others.)