Change last-split -m default to 1
authorMartin C. Frith
Tue Jul 16 07:51:22 2019 +0900 (4 months ago)
changeset 9832a722dbe737a
parent 982 4f77b3fc0720
child 984 fb43f850eba2
Change last-split -m default to 1
doc/last-split.txt
doc/last-tutorial.txt
scripts/last-train
src/split/last-split-main.cc
test/last-split-test.out
test/last-split-test.sh
     1.1 --- a/doc/last-split.txt	Thu Jun 27 13:50:12 2019 +0900
     1.2 +++ b/doc/last-split.txt	Tue Jul 16 07:51:22 2019 +0900
     1.3 @@ -1,8 +1,8 @@
     1.4  last-split
     1.5  ==========
     1.6  
     1.7 -This program estimates "split alignments" (typically for DNA) or
     1.8 -"spliced alignments" (typically for RNA).
     1.9 +This program finds "split alignments" (typically for DNA) or "spliced
    1.10 +alignments" (typically for RNA).
    1.11  
    1.12  It reads candidate alignments of query sequences to a genome, and
    1.13  looks for a unique best alignment for each part of each query.  It
    1.14 @@ -21,7 +21,8 @@
    1.15  can do the alignment like this::
    1.16  
    1.17    lastdb -uNEAR -R01 db genome.fasta
    1.18 -  lastal -Q1 db q.fastq | last-split > out.maf
    1.19 +  last-train -Q1 db q.fastq > train.out
    1.20 +  lastal -p train.out db q.fastq | last-split > out.maf
    1.21  
    1.22  Spliced alignment of RNA reads to a genome
    1.23  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1.24 @@ -32,7 +33,8 @@
    1.25  tells it where the splice signals are (GT, AG, etc)::
    1.26  
    1.27    lastdb -uNEAR -R01 db genome.fasta
    1.28 -  lastal -Q1 -D10 db q.fastq | last-split -g db > out.maf
    1.29 +  last-train -Q1 db q.fastq > train.out
    1.30 +  lastal -p train.out -D10 db q.fastq | last-split -g db > out.maf
    1.31  
    1.32  This will favour splices starting at GT (and to a lesser extent GC and
    1.33  AT), and ending at AG (and to a lesser extent AC).  However, it allows
    1.34 @@ -52,16 +54,7 @@
    1.35  Alignment of two whole genomes
    1.36  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1.37  
    1.38 -We can align the cat and rat genomes like this::
    1.39 -
    1.40 -  lastdb -uMAM8 -cR11 catdb cat.fasta
    1.41 -  lastal -m100 -E0.05 catdb rat.fasta | last-split -m1 > out.maf
    1.42 -
    1.43 -This will align each rat base-pair to at most one cat base-pair, but
    1.44 -not necessarily vice-versa.  We can get 1-to-1 alignments by swapping
    1.45 -the sequences and running last-split again::
    1.46 -
    1.47 -  maf-swap out.maf | last-split -m1 > out2.maf
    1.48 +See `here <https://github.com/mcfrith/last-genome-alignments>`_.
    1.49  
    1.50  FAQ
    1.51  ---
    1.52 @@ -73,16 +66,6 @@
    1.53      can prevent lastal and last-split from wasting time on such
    1.54      matches.
    1.55  
    1.56 -Going faster by parallelization
    1.57 --------------------------------
    1.58 -
    1.59 -For example, split alignment of DNA reads to a genome::
    1.60 -
    1.61 -  parallel-fastq "lastal -Q1 db | last-split" < q.fastq > out.maf
    1.62 -
    1.63 -This requires GNU parallel to be installed
    1.64 -(http://www.gnu.org/software/parallel/).
    1.65 -
    1.66  Output
    1.67  ------
    1.68  
    1.69 @@ -125,7 +108,7 @@
    1.70    Error probability <= 10 ^ -((ASCII value - 33) / 10)
    1.71  
    1.72  The "mismap" is simply the lowest probability from the "p" line.  (If
    1.73 -you run last-split twice, as in the genome alignment example, the
    1.74 +you run last-split twice, as in the genome alignment recipes, the
    1.75  mismap is the lowest combined error probability from both "p" lines.)
    1.76  
    1.77  Split versus spliced alignment
    1.78 @@ -159,8 +142,7 @@
    1.79  Spliced alignment can be slow.  It can be sped up, at a small cost in
    1.80  accuracy, by not favouring cis-splices::
    1.81  
    1.82 -  lastdb -uNEAR -R01 db genome.fasta
    1.83 -  lastal -Q1 -D10 db q.fastq | last-split -c0 -t0.004 -g db > out.maf
    1.84 +  lastal -p train.out -D10 db q.fastq | last-split -c0 -t0.004 -g db > out.maf
    1.85  
    1.86  The -c0 turns off cis-splicing, and the -t0.004 specifies a higher
    1.87  probability of trans-splicing.
    1.88 @@ -172,8 +154,7 @@
    1.89  middle of the query, we can do "spliced" alignment without considering
    1.90  splice signals or favouring cis-splices::
    1.91  
    1.92 -  lastdb -uNEAR -R01 db genome.fasta
    1.93 -  lastal -Q1 db q.fastq | last-split -c0 > out.maf
    1.94 +  lastal -p train.out db q.fastq | last-split -c0 > out.maf
    1.95  
    1.96  Options
    1.97  -------
    1.98 @@ -221,8 +202,6 @@
    1.99  
   1.100    -m, --mismap=PROB
   1.101           Don't write alignments with mismap probability > PROB.
   1.102 -         Low-confidence alignments will be discarded unless you
   1.103 -         increase this value!
   1.104  
   1.105    -s, --score=INT
   1.106           Don't write alignments with score < INT.
     2.1 --- a/doc/last-tutorial.txt	Thu Jun 27 13:50:12 2019 +0900
     2.2 +++ b/doc/last-tutorial.txt	Tue Jul 16 07:51:22 2019 +0900
     2.3 @@ -123,18 +123,16 @@
     2.4  
     2.5  By default, LAST is quite strict, and only reports significant
     2.6  alignments that will rarely occur by chance.  In the preceding
     2.7 -example, the minimum alignment length is about 28 bases (less for
     2.8 +example, the minimum alignment length is about 26 bases (less for
     2.9  smaller genomes).  To find shorter alignments, we must down-tune the
    2.10  strictness::
    2.11  
    2.12    lastdb -uNEAR -R01 humandb human/chr*.fa
    2.13 -  lastal -D100 humandb queries.fa | last-split -m1 > myalns.maf
    2.14 +  lastal -D100 humandb queries.fa | last-split > myalns.maf
    2.15  
    2.16  * -D100 makes lastal report alignments that could occur by chance once
    2.17    per hundred query letters.  (The default is once per million.)
    2.18  
    2.19 -* -m1 tells last-split to keep low-confidence alignments.
    2.20 -
    2.21  In this example, the minimum alignment length is about 20 bases (less
    2.22  for smaller genomes).
    2.23  
     3.1 --- a/scripts/last-train	Thu Jun 27 13:50:12 2019 +0900
     3.2 +++ b/scripts/last-train	Tue Jul 16 07:51:22 2019 +0900
     3.3 @@ -364,7 +364,7 @@
     3.4      if opts.A: x.append("-A" + opts.A)
     3.5      if opts.B: x.append("-B" + opts.B)
     3.6      x += args
     3.7 -    y = ["last-split", "-n"]
     3.8 +    y = ["last-split", "-n", "-m0.01"]  # xxx ???
     3.9      z = ["last-postmask"]
    3.10      p = subprocess.Popen(x, stdout=subprocess.PIPE, universal_newlines=True)
    3.11      q = subprocess.Popen(y, stdin=p.stdout, stdout=subprocess.PIPE,
     4.1 --- a/src/split/last-split-main.cc	Thu Jun 27 13:50:12 2019 +0900
     4.2 +++ b/src/split/last-split-main.cc	Tue Jul 16 07:51:22 2019 +0900
     4.3 @@ -41,7 +41,7 @@
     4.4    opts.trans = 1e-05;
     4.5    opts.mean = 7.0;
     4.6    opts.sdev = 1.7;
     4.7 -  opts.mismap = 0.01;
     4.8 +  opts.mismap = 1.0;
     4.9    opts.score = -1;
    4.10    opts.no_split = false;
    4.11    opts.bytes = 0;
     5.1 --- a/test/last-split-test.out	Thu Jun 27 13:50:12 2019 +0900
     5.2 +++ b/test/last-split-test.out	Tue Jul 16 07:51:22 2019 +0900
     5.3 @@ -13,7 +13,7 @@
     5.4   -t, --trans=PROB   trans-splice probability per base (default=1e-05)
     5.5   -M, --mean=MEAN    mean of ln[intron length] (default=7)
     5.6   -S, --sdev=SDEV    standard deviation of ln[intron length] (default=1.7)
     5.7 - -m, --mismap=PROB  maximum mismap probability (default=0.01)
     5.8 + -m, --mismap=PROB  maximum mismap probability (default=1)
     5.9   -s, --score=INT    minimum alignment score (default=e OR e+t*ln[100])
    5.10   -n, --no-split     write original, not split, alignments
    5.11   -b, --bytes=B      maximum memory (default=8T for split, 8G for spliced)
    5.12 @@ -46828,7 +46828,7 @@
    5.13  # T -21 -11 -20  6
    5.14  # N  0  0  0  0
    5.15  #
    5.16 -# m=0.01 s=110
    5.17 +# m=1 s=110
    5.18  #
    5.19  a score=291 mismap=1e-10
    5.20  s chr19 1395387 97 + 58617616 CAGGCTGCCCACCTACGGCCGAGGCCCTGC-TCTACGGCATCCTGCAGCTGcaGAGGAAGATCAAG-CGGGAGCGG-------AGGCTGCAGATCTGGTACCGCAG
    5.21 @@ -46901,7 +46901,7 @@
    5.22  # T -21 -11 -20  6
    5.23  # N  0  0  0  0
    5.24  #
    5.25 -# m=0.01 s=110
    5.26 +# m=1 s=110
    5.27  #
    5.28  a score=291 mismap=1e-10
    5.29  s chr19 1395387 97 + 58617616 CAGGCTGCCCACCTACGGCCGAGGCCCTGC-TCTACGGCATCCTGCAGCTGcaGAGGAAGATCAAG-CGGGAGCGG-------AGGCTGCAGATCTGGTACCGCAG
     6.1 --- a/test/last-split-test.sh	Thu Jun 27 13:50:12 2019 +0900
     6.2 +++ b/test/last-split-test.sh	Tue Jul 16 07:51:22 2019 +0900
     6.3 @@ -11,17 +11,17 @@
     6.4  {
     6.5      last-split -h
     6.6  
     6.7 -    last-split $maf
     6.8 +    last-split -m0.01 $maf
     6.9  
    6.10 -    last-split -c0 $maf
    6.11 +    last-split -m0.01 -c0 $maf
    6.12  
    6.13 -    last-split -t0 $maf
    6.14 +    last-split -m0.01 -t0 $maf
    6.15  
    6.16 -    last-split -M7.5 -S2 $maf
    6.17 +    last-split -m0.01 -M7.5 -S2 $maf
    6.18  
    6.19      last-split -m0.001 -s180 $maf
    6.20  
    6.21 -    last-split -n $maf
    6.22 +    last-split -m0.01 -n $maf
    6.23  
    6.24      last-split -d0 -m0.001 -s180 $maf
    6.25      last-split -d1 -m0.001 -s180 $maf