This script tries to find suitable score parameters (substitution and gap scores) for aligning some given sequences.

It (probabilistically) aligns the sequences using some initial score parameters, then estimates better score parameters based on the alignments, and repeats this procedure until the parameters stop changing.

The usage is like this:

lastdb mydb reference.fasta
last-train mydb queries.fasta

last-train prints a summary of each alignment step, followed by the final score parameters, in a format that can be read by lastal's -p option.

You can also pipe sequences into last-train, for example:

bzcat queries.fasta.bz2 | last-train mydb


-h, --help Show a help message, with default option values, and exit.
-v, --verbose Show more details of intermediate steps.

Training options

--revsym Force the substitution scores to have reverse-complement symmetry, e.g. score(A→G) = score(T→C). This is often appropriate, if neither strand is "special".
--matsym Force the substitution scores to have directional symmetry, e.g. score(A→G) = score(G→A).
--gapsym Force the insertion costs to equal the deletion costs.
--pid=PID Ignore alignments with > PID% identity. This aims to optimize the parameters for low-similarity alignments, similarly to the BLOSUM matrices.
--postmask=NUMBER By default, last-train ignores alignments of mostly-lowercase sequence (by using last-postmask). To turn this off, do --postmask=0.
--sample-number=N Use N randomly-chosen chunks of the query sequences. The queries are chopped into fixed-length chunks (as if they were first concatenated into one long sequence). If there are ≤ N chunks, all are picked. Otherwise, if the final chunk is shorter, it is never picked. 0 means use everything.
--sample-length=L Use randomly-chosen chunks of length L.

All options below this point are passed to lastal to do the alignments: they are described in more detail at lastal.html.

Initial parameter options

-r SCORE Initial match score.
-q COST Initial mismatch cost.
-p NAME Initial match/mismatch score matrix.
-a COST Initial gap existence cost.
-b COST Initial gap extension cost.
-A COST Initial insertion existence cost.
-B COST Initial insertion extension cost.

Alignment options

-D LENGTH Query letters per random alignment. (See here.)
-E EG2 Maximum expected alignments per square giga. (See here.)
-s NUMBER Which query strand to use: 0=reverse, 1=forward, 2=both.
-S NUMBER Score matrix applies to forward strand of: 0=reference, 1=query. This matters only if the scores lack reverse-complement symmetry.
-C COUNT Before extending gapped alignments, discard any gapless alignment whose query range lies in COUNT other gapless alignments with higher score-per-length. This aims to reduce run time.
-T NUMBER Type of alignment: 0=local, 1=overlap.
-m COUNT Maximum number of initial matches per query position.
-k STEP Look for initial matches starting only at every STEP-th position in each query.
-P COUNT Number of parallel threads.
-Q NUMBER Query sequence format: 0=fasta, 1=fastq-sanger. Important: if you use option -Q, last-train will only train the gap scores, and leave the substitution scores at their initial values.