This script tries to find suitable score parameters (substitution and gap scores) for aligning some given sequences.

It (probabilistically) aligns the sequences using some initial score parameters, then estimates better score parameters based on the alignments, and repeats this procedure until the parameters stop changing.

The usage is like this:

lastdb mydb reference.fasta last-train mydb queries.fasta

last-train prints a summary of each alignment step, followed by the final score parameters in a format that can be read by lastal's -p option.

-h, --helpShow a help message, with default option values, and exit.

--revsymForce the substitution scores to have reverse-complement symmetry, e.g. score(A→G) = score(T→C). This is often appropriate, if neither strand is "special". --matsymForce the substitution scores to have directional symmetry, e.g. score(A→G) = score(G→A). --gapsymForce the insertion costs to equal the deletion costs. --pid=PIDIgnore alignments with > PID% identity. This aims to optimize the parameters for low-similarity alignments, similarly to the BLOSUM matrices. --sample-number=NUse N randomly-chosen chunks of the query sequences. The queries are chopped into fixed-length chunks (as if they were first concatenated into one long sequence). If there are ≤ N chunks, all are picked. Otherwise, if the final chunk is shorter, it is never picked. 0 means use everything. --sample-length=LUse randomly-chosen chunks of length L.

All options below this point are passed to lastal to do the alignments: they are described in more detail at lastal.html.

-rSCOREInitial match score. -qCOSTInitial mismatch cost. -pNAMEInitial match/mismatch score matrix. -aCOSTInitial gap existence cost. -bCOSTInitial gap extension cost. -ACOSTInitial insertion existence cost. -BCOSTInitial insertion extension cost.

-DLENGTHQuery letters per random alignment. (See here.) -EEG2Maximum expected alignments per square giga. (See here.) -sNUMBERWhich query strand to use: 0=reverse, 1=forward, 2=both. -SNUMBERScore matrix applies to forward strand of: 0=reference, 1=query. This matters only if the scores lack reverse-complement symmetry. -TNUMBERType of alignment: 0=local, 1=overlap. -mCOUNTMaximum number of initial matches per query position. -PCOUNTNumber of parallel threads. -QNUMBERQuery sequence format: 0=fasta, 1=fastq-sanger. Important:if you use option -Q, last-train will only train the gap scores, and leave the substitution scores at their initial values.

last-train assumes that gap lengths roughly follow a geometric distribution. If they do not (which is often the case), the results may be poor.

last-train can fail for various reasons, e.g. if the sequences are too dissimilar. If it fails to find any alignments, you could try reducing the alignment significance threshold with option -D.