LAST seeding schemes

LAST's critical first step is to find seeds, i.e. initial matches between query and reference sequences. It can use various seeding schemes, which allow different kinds of mismatches at different seed positions.

Note

A seeding scheme consists of a seed alphabet, such as:

1  A C G T
0  ACGT
T  AG CT

and one or more patterns, such as this one:

1T1T10T1101101

Each symbol in a pattern represents a grouping of sequence letters: in this example, T represents the grouping AG CT. At each position in an initial match, mismatches are allowed between letters that are grouped at that position in the pattern.

Although the patterns have fixed lengths, LAST's initial matches do not. LAST finds shorter matches by using a prefix of the pattern, and longer matches by cyclically repeating the pattern.

BISF

This seeding scheme is for aligning bisulfite-converted DNA forward strands to a closely-related genome (MC Frith, R Mori, K Asai, NAR 2012 40:e100). It uses this seed alphabet:

1  CT A G
0  ACGT

And this pattern:

1111110101100

It sets this lastdb default: -w2

It sets this lastal default: -pBISF

BISR

This seeding scheme is for aligning bisulfite-converted DNA reverse strands to a closely-related genome (MC Frith, R Mori, K Asai, NAR 2012 40:e100). It uses this seed alphabet:

1  AG C T
0  ACGT

And this pattern:

1111110101100

It sets this lastdb default: -w2

It sets this lastal default: -pBISR

MAM4

This DNA seeding scheme is like MAM8, but a bit less sensitive, and uses about half as much memory. [From Frith & Noé 2014 NAR 42:e59 Table S11, row 12.] It uses this seed alphabet:

1  A C G T
0  ACGT
T  AG CT

And these patterns:

11100TT01T00T10TTTT
TTTT110TT0T001T0T1T1
11TT010T01TT0001T
11TT10T1T101TT

MAM8

This DNA seeding scheme finds weak similarities with high sensitivity, but low speed and high memory usage (e.g. ~50 GB for mammal genomes). [From Frith & Noé 2014 NAR 42:e59 Table S12, row 15.] It uses this seed alphabet:

1  A C G T
0  ACGT
T  AG CT

And these patterns:

1101T1T0T1T00TT1TT
1TTTTT010TT0TT01011TTT
1TTTT10010T011T0TTTT1
111T011T0T01T100
1T10T100TT01000TT01TT11
111T101TT000T0T10T00T1T
111100T011TTT00T0TT01T
1T1T10T1101101

MURPHY10

This protein seeding scheme is popular for finding long-and-weak similarities (LR Murphy et al. Protein Eng. 2000 13:149). It uses this seed alphabet:

1  ILMV FWY A C G H P KR ST DENQ

And this pattern:

1

It sets this lastdb default: -p

NEAR

This DNA seeding scheme is good for finding short-and-strong (near-identical) similarities. It is also good for similarities with many gaps (insertions and deletions), because it can find the short matches between the gaps. (Long-and-weak seeding schemes allow for mismatches but not gaps.) It uses this seed alphabet:

1  A C G T
0  ACGT

And this pattern:

1111110

It sets this lastal default: -r6 -q18 -a21 -b9

YASS

This DNA seeding scheme is good for finding long-and-weak similarities. It is a good compromise for both protein-coding and non protein-coding DNA (L Noé & G Kucherov, NAR 2005 33:W540-W543). It uses this seed alphabet:

1  A C G T
0  ACGT
T  AG CT

And this pattern:

1T1001100101