last-train finds the rates (probabilities) of insertion, deletion, and substitutions between two sets of sequences. It thereby finds suitable substitution and gap scores for aligning them.

It (probabilistically) aligns the sequences using some initial score parameters, then estimates better score parameters based on the alignments, and repeats this procedure until the parameters stop changing.

The usage is like this:

lastdb mydb reference.fasta last-train mydb queries.fasta

last-train prints a summary of each alignment step, followed by the final score parameters, in a format that can be read by lastal's -p option.

last-train can read .gz files, or from pipes:

bzcat queries.fasta.bz2 | last-train mydb

-h, --helpShow a help message, with default option values, and exit. -v, --verboseShow more details of intermediate steps.

--revsymForce the substitution scores to have reverse-complement symmetry, e.g. score(A→G) = score(T→C). This is often appropriate, if neither strand is "special". --matsymForce the substitution scores to have directional symmetry, e.g. score(A→G) = score(G→A). --gapsymForce the insertion costs to equal the deletion costs. --pid=PIDIgnore alignments with > PID% identity. This aims to optimize the parameters for low-similarity alignments (similarly to the BLOSUM matrices). --postmask=NUMBERBy default, last-train ignores alignments of mostly-lowercase sequence (by using last-postmask). To turn this off, do --postmask=0.--sample-number=NUse N randomly-chosen chunks of the query sequences. The queries are chopped into fixed-length chunks (as if they were first concatenated into one long sequence). If there are ≤ N chunks, all are picked. Otherwise, if the final chunk is shorter, it is never picked. 0 means use everything. --sample-length=LUse randomly-chosen chunks of length L. --scale=SOutput scores in units of 1/S bits. Traditional values include 2 for half-bit scores and 3 for 1/3-bit scores. (Note that 1/3-bit scores essentially equal Phred scores a.k.a. decibans, because log10(2) ≈ 3/10.) The default is to infer a scale from the initial score parameters. --codonDo training for DNA query sequences versus protein reference sequences. These options will be ignored: --revsym --matsym --gapsym --pid --postmask -q -p -S. If--codonis used, the "initial parameter options" are initial probabilities, not scores/costs.

All options below this point are passed to lastal to do the alignments: they are described in more detail at lastal.html.

-rSCOREInitial match score. -qCOSTInitial mismatch cost. -pNAMEInitial match/mismatch score matrix. -aCOSTInitial gap existence cost. -bCOSTInitial gap extension cost. -ACOSTInitial insertion existence cost. -BCOSTInitial insertion extension cost. -FLISTInitial frameshift probabilities (only used with --codon).

-DLENGTHQuery letters per random alignment. (See here.) -EEG2Maximum expected alignments per square giga. (See here.) -sNUMBERWhich query strand to use: 0=reverse, 1=forward, 2=both. If specified, this parameter is written in last-train's output, so it will override lastal's default. -SNUMBERSpecify how to use the substitution score matrix for reverse strands. If you use

--revsym, this makes no difference. "0" means that the matrix is used as-is for all alignments. "1" (the default) means that the matrix is used as-is for alignments of query sequence forward strands, and the complemented matrix is used for query sequence reverse strands.This parameter is always written in last-train's output, so it will override lastal's default.

-CCOUNTBefore extending gapped alignments, discard any gapless alignment whose query range lies in COUNT other gapless alignments with higher score-per-length. This aims to reduce run time. -TNUMBERType of alignment: 0=local, 1=overlap. -mCOUNTMaximum number of initial matches per query position. -kSTEPLook for initial matches starting only at every STEP-th position in each query. -PCOUNTNumber of parallel threads. -XNUMBERHow to score a match/mismatch involving N (for DNA) or X (otherwise). By default, the lowest match/mismatch score is used. 0 means the default; 1 means treat reference Ns/Xs as fully-ambiguous letters; 2 means treat query Ns/Xs as ambiguous; 3 means treat reference and query Ns/Xs as ambiguous.

If specified, this parameter is written in last-train's output, so it will override lastal's default.

-QNAMEHow to read the query sequences (the NAME is not case-sensitive):

Default fasta "0", "fastx" fasta or fastq: discard per-base quality data "1", "sanger" fastq-sangerThe

fastqformats are described here: lastal.html. last-train assumes the per-base quality codes indicate substitution error probabilities,notinsertion or deletion error probabilities. If this assumption is dubious (e.g. for data with many insertion or deletion errors), it may be better to discard the quality data. Forfastq-sanger, last-train finds the rates of substitutions not explained by the quality data (ideally, real substitutions as opposed to errors).If specified, this parameter is written in last-train's output, so it will override lastal's default.

last-train (and lastal) uses "Model A", in Figure 5A of btz576.

last-train (and lastal) converts between path and alignment parameters as in Supplementary Section 3.1 of btz576.

last-train uses parameters with "homogeneous letter probabilities" and "balanced length probability" (btz576).

last-train rounds the scores to integers, which makes them slightly inaccurate. It then finds an adjusted scale factor (without changing the scores), which makes the integer-rounded scores correspond to homogeneous letter probabilities and balanced length probability. It writes this adjusted scale (in nats, not bits) as a "-t" option for lastal, e.g. "-t4.4363".

In rare cases, it may be impossible to find such an adjusted scale factor. If that happens, last-train doubles the original scale (to reduce the inaccuracy of integer rounding), until the problem goes away.

last-train assumes that gap lengths roughly follow a geometric distribution. If they do not (which is often the case), the results may be poor.

last-train can fail for various reasons, e.g. if the sequences are too dissimilar. If it fails to find any alignments, you could try reducing the alignment significance threshold with option

`-D`.