Made last-pair-probs able to read interleaved pairs.
authorMartin C. Frith
Wed Dec 10 12:01:46 2014 +0900 (2014-12-10)
changeset 529dc2c88c11662
parent 528 29ee3f694b98
child 530 6aa0a194f64b
Made last-pair-probs able to read interleaved pairs.
doc/last-pair-probs.txt
scripts/fastq-interleave
scripts/parallel-fastq
src/last-pair-probs-main.cc
src/last-pair-probs.cc
src/makefile
test/last-pair-test.out
     1.1 --- a/doc/last-pair-probs.txt	Tue Dec 09 12:07:36 2014 +0900
     1.2 +++ b/doc/last-pair-probs.txt	Wed Dec 10 12:01:46 2014 +0900
     1.3 @@ -1,67 +1,113 @@
     1.4  last-pair-probs
     1.5  ===============
     1.6  
     1.7 -This program reads alignments of paired DNA reads to a genome, and:
     1.8 +This program reads candidate alignments of paired DNA reads to a
     1.9 +genome, and:
    1.10  
    1.11  1. estimates the distribution of distances between paired reads,
    1.12  2. estimates the probability that each alignment represents the
    1.13     genomic source of the read.
    1.14  
    1.15 -The program takes as input two files of alignments, where one read
    1.16 -from each pair is in one file, and the other read is in the other
    1.17 -file.
    1.18 +"Paired" means that the reads come from either end of a DNA fragment,
    1.19 +in tail-to-tail orientation::
    1.20  
    1.21 -Typical usage::
    1.22 +  5' ---------->................................. 3'
    1.23 +  3' .................................<---------- 5'
    1.24  
    1.25 -  lastdb -m1111110 humandb human/chr*.fa
    1.26 -  lastal -Q1 -e120 -i1 humandb reads1.fastq > temp1.maf
    1.27 -  lastal -Q1 -e120 -i1 humandb reads2.fastq > temp2.maf
    1.28 -  last-pair-probs temp1.maf temp2.maf > results.maf
    1.29 +Or head-to-head orientation::
    1.30  
    1.31 -If your reads come from potentially-spliced RNA molecules, use the -r
    1.32 -option::
    1.33 +  5' .................................----------> 3'
    1.34 +  3' <----------................................. 5'
    1.35  
    1.36 -  last-pair-probs -r temp1.maf temp2.maf > results.maf
    1.37 +The program writes the alignments with "mismap" probabilities,
    1.38 +i.e. the probability that the alignment does not represent the genomic
    1.39 +source of the read. By default, it discards alignments with mismap
    1.40 +probability > 0.01.
    1.41 +
    1.42 +Simple usage
    1.43 +------------
    1.44 +
    1.45 +Suppose we have paired DNA reads in a file called "interleaved.fastq"
    1.46 +(in fastq-sanger format), where the first two reads are paired, the
    1.47 +next two reads are paired, and so on.  We can align them to the human
    1.48 +genome like this::
    1.49 +
    1.50 +  lastdb -m1111110 hg human-genome.fasta
    1.51 +  lastal -Q1 -e120 -i1 hg interleaved.fastq > temp.maf
    1.52 +  last-pair-probs temp.maf > out.maf
    1.53 +
    1.54 +Suppose we have paired reads in two files, where the two first reads
    1.55 +are paired, the two second reads are paired, and so on.  We can
    1.56 +interleave them like this::
    1.57 +
    1.58 +  fastq-interleave x.fastq y.fastq | lastal -Q1 -e120 -i1 hg > temp.maf
    1.59 +
    1.60 +Reads from potentially-spliced RNA molecules
    1.61 +--------------------------------------------
    1.62 +
    1.63 +Use the -r option::
    1.64 +
    1.65 +  last-pair-probs -r temp.maf > out.maf
    1.66  
    1.67  Without -r, it assumes the distances between paired reads follow a
    1.68  normal distribution.  With -r, it assumes the distances follow a
    1.69  skewed (log-normal) distribution, which is much more appropriate for
    1.70  spliced RNA.
    1.71  
    1.72 +Efficient usage
    1.73 +---------------
    1.74 +
    1.75 +The preceding recipes make a potentially-huge temp file, and
    1.76 +last-pair-probs reads it twice: first to estimate the distance
    1.77 +distribution, and then to estimate alignment probabilities.  It is
    1.78 +more efficient to estimate the distance distribution from a small
    1.79 +sample of the data::
    1.80 +
    1.81 +  lastal -Q1 -e120 -i1 hg sample.fastq | last-pair-probs -e
    1.82 +
    1.83 +Suppose this tells us that the mean distance is 250 and the standard
    1.84 +deviation is 38.5.  We can use that to estimate the alignment
    1.85 +probabilities::
    1.86 +
    1.87 +  lastal -Q1 -e120 -i1 hg all.fastq | last-pair-probs -f250 -s38.5 > out.maf
    1.88 +
    1.89 +Going faster by parallelization
    1.90 +-------------------------------
    1.91 +
    1.92 +This will run the pipeline on all your CPU cores::
    1.93 +
    1.94 +  fastq-interleave x.fastq y.fastq |
    1.95 +  parallel-fastq "lastal -Q1 -e120 -i1 hg | last-pair-probs -f250 -s38.5" > out.maf
    1.96 +
    1.97 +It requires GNU parallel to be installed
    1.98 +(http://www.gnu.org/software/parallel/).
    1.99 +
   1.100  Details
   1.101  -------
   1.102  
   1.103 -* The program writes the alignments with "mismap" probabilities,
   1.104 -  i.e. the probability that the alignment does not represent the
   1.105 -  genomic source of the read.  By default, it discards alignments with
   1.106 -  mismap probability > 0.01.
   1.107 +* The "distance" between a pair of reads means the distance between
   1.108 +  their 5' ends.  Positive distance indicates tail-to-tail
   1.109 +  orientation, and negative distance indicates head-to-head
   1.110 +  orientation.  Negative distances are not considered when -r is used,
   1.111 +  nor for circular chromosomes.
   1.112  
   1.113 -* It assumes that each pair of reads comes from opposite strands of a
   1.114 -  DNA molecule.  The "distance" between them means the distance
   1.115 -  between their 5' ends.  Positive distance indicates tail-to-tail
   1.116 -  orientation (with the 5' end being the head and the 3' end being the
   1.117 -  tail).  Negative distance indicates head-to-head orientation.
   1.118 -  Negative distances are not considered when -r is used, nor for
   1.119 -  circular chromosomes.
   1.120 +* The program reads one batch of alignments at a time (by looking for
   1.121 +  lines starting with "# batch").  It assumes there is exactly one DNA
   1.122 +  read per batch: if it finds more than one, it will complain.  The
   1.123 +  lastal -i1 option ensures there is one query per batch.
   1.124  
   1.125 -* The program reads one batch of alignments at a time from each file
   1.126 -  (by looking for lines starting with "# batch").  It assumes that
   1.127 -  each pair of batches has the alignments for one pair of reads.  To
   1.128 -  achieve that, the read pairs should be in the same order in the
   1.129 -  fastq files, and the lastal -i1 option ensures there is one query
   1.130 -  per batch.
   1.131 -
   1.132 -* The input files may be in either format produced by lastal (maf or
   1.133 +* The alignments may be in either format produced by lastal (maf or
   1.134    tabular).  They must include header lines (of the kind produced by
   1.135    lastal) describing the alignment parameters.
   1.136  
   1.137 -* By default, the program makes two passes over each file.  So they
   1.138 -  must be real files (not pipes).  However, if you use option -e, or
   1.139 -  both -f and -s, it makes just one pass over each file.
   1.140 +* If a read name ends in neither "/1" nor "/2", the program appends
   1.141 +  "/1" if it is the 1st in a pair or "/2" if it is the 2nd.
   1.142  
   1.143 -* If a read name ends in neither "/1" nor "/2", the program appends
   1.144 -  "/1" if it comes from the first file or "/2" if it comes from the
   1.145 -  second.
   1.146 +* It is also possible to supply the alignments in two files::
   1.147 +
   1.148 +    lastal -Q1 -e120 -i1 hg x.fastq > temp1.maf
   1.149 +    lastal -Q1 -e120 -i1 hg y.fastq > temp2.maf
   1.150 +    last-pair-probs temp1.maf temp2.maf > out.maf
   1.151  
   1.152  Options
   1.153  -------
   1.154 @@ -106,64 +152,9 @@
   1.155  Tips
   1.156  ----
   1.157  
   1.158 -* For greater speed and convenience, first estimate the distance
   1.159 -  distribution from a sample of your data (using -e), then analyze all
   1.160 -  your data with this estimate (using -f and -s).
   1.161 -
   1.162 -* You can avoid temp files by using named pipes::
   1.163 -
   1.164 -    mkfifo pipe1 pipe2
   1.165 -    lastal -Q1 -e120 -i1 humandb reads1.fastq > pipe1 &
   1.166 -    lastal -Q1 -e120 -i1 humandb reads2.fastq > pipe2 &
   1.167 -    last-pair-probs -f250 -s30 pipe1 pipe2 > results.maf
   1.168 -
   1.169 -  This streams the alignments from lastal to last-pair-probs.
   1.170 -
   1.171  * To go faster, try gapless alignment (add -j1 to the lastal options).
   1.172    Often, this is only minusculely less accurate than gapped alignment.
   1.173  
   1.174 -* Tabular output (lastal option -f0) is smaller and faster.
   1.175 -
   1.176 -Using multiple CPUs
   1.177 --------------------
   1.178 -
   1.179 -With large datasets, it's important to go faster by using multiple
   1.180 -CPUs.  One way to do that is by using GNU parallel
   1.181 -(http://www.gnu.org/software/parallel/), as follows.
   1.182 -
   1.183 -* Split reads1.fastq into multiple files, with (say) 100000 reads per
   1.184 -  file::
   1.185 -
   1.186 -    split -l400000 -a5 reads1.fastq 1
   1.187 -
   1.188 -  This assumes that a new fastq record begins every 4th line, i.e. no
   1.189 -  line wrapping.  The created files will be called 1aaaaa, 1aaaab,
   1.190 -  etc.
   1.191 -
   1.192 -* Split reads2.fastq::
   1.193 -
   1.194 -    split -l400000 -a5 reads2.fastq 2
   1.195 -
   1.196 -* Make a file called (say) last-pair.sh, with the following content
   1.197 -  (or similar - you might want to use -r, -j1, etc)::
   1.198 -
   1.199 -    #! /bin/sh
   1.200 -    lastal -Q1 -e120 -i1 "$3" "$4" > /tmp/$$.1
   1.201 -    lastal -Q1 -e120 -i1 "$3" "$5" > /tmp/$$.2
   1.202 -    last-pair-probs -f "$1" -s "$2" /tmp/$$.1 /tmp/$$.2
   1.203 -    rm /tmp/$$.*
   1.204 -
   1.205 -* Set execute permission::
   1.206 -
   1.207 -    chmod +x last-pair.sh
   1.208 -
   1.209 -* Run it in parallel on all your CPU cores::
   1.210 -
   1.211 -    parallel --xapply ./last-pair.sh 250 30 humandb ::: 1* ::: 2* > results.maf
   1.212 -
   1.213 -  Here we have specified a mean distance of 250 and a standard
   1.214 -  deviation of 30.
   1.215 -
   1.216  Reference
   1.217  ---------
   1.218  
     2.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     2.2 +++ b/scripts/fastq-interleave	Wed Dec 10 12:01:46 2014 +0900
     2.3 @@ -0,0 +1,12 @@
     2.4 +#! /bin/bash
     2.5 +
     2.6 +# Reads 2 fastq files, and writes them interleaved.
     2.7 +# Assumes 1 fastq per 4 lines, i.e. no line wrapping.
     2.8 +
     2.9 +paste <(cat "$1" | paste - - - -) <(cat "$2" | paste - - - -) | tr '\t' '\n'
    2.10 +
    2.11 +# Is this better?
    2.12 +#paste <(zcat -f "$1"|paste - - - -) <(zcat -f "$2"|paste - - - -)|tr '\t' '\n'
    2.13 +
    2.14 +# This does not interpret "-" as stdin:
    2.15 +#paste <(paste - - - - < "$1") <(paste - - - - < "$2") | tr '\t' '\n'
     3.1 --- a/scripts/parallel-fastq	Tue Dec 09 12:07:36 2014 +0900
     3.2 +++ b/scripts/parallel-fastq	Wed Dec 10 12:01:46 2014 +0900
     3.3 @@ -5,4 +5,5 @@
     3.4  parallel --gnu --minversion 20130222 > /dev/null ||
     3.5  echo $(basename $0): warning: old version of parallel, might be slow 1>&2
     3.6  
     3.7 -exec parallel --gnu --pipe -L4 "$@"
     3.8 +# use a record size of 8 lines, so that paired sequences stay together:
     3.9 +exec parallel --gnu --pipe -L8 "$@"
     4.1 --- a/src/last-pair-probs-main.cc	Tue Dec 09 12:07:36 2014 +0900
     4.2 +++ b/src/last-pair-probs-main.cc	Wed Dec 10 12:01:46 2014 +0900
     4.3 @@ -21,10 +21,11 @@
     4.4    opts.isSdev = false;
     4.5    opts.isDisjoint = false;
     4.6    opts.progName = argv[0];
     4.7 -  
     4.8 +
     4.9    std::string help = "\
    4.10  Usage:\n\
    4.11    " + std::string(argv[0]) + " --help\n\
    4.12 +  " + std::string(argv[0]) + " [options] interleaved-alignments\n\
    4.13    " + std::string(argv[0]) + " [options] alignments1 alignments2\n\
    4.14  \n\
    4.15  Read alignments of paired DNA reads to a genome, and: (1) estimate the\n\
    4.16 @@ -104,14 +105,19 @@
    4.17      }
    4.18    }
    4.19  
    4.20 +  if (optind == argc && !opts.estdist && (!opts.isFraglen || !opts.isSdev)) {
    4.21 +    std::cerr << help;
    4.22 +    throw std::runtime_error("");
    4.23 +  }
    4.24 +
    4.25    if (!opts.isDisjoint) {
    4.26      opts.disjoint = opts.rna ? 0.02 : 0.01;
    4.27    }
    4.28  
    4.29    opts.inputFileNames.assign(argv + optind, argv + argc);
    4.30  
    4.31 -  if (opts.inputFileNames.size() != 2) {
    4.32 -    throw std::runtime_error("please give me two file names");
    4.33 +  if (opts.inputFileNames.size() > 2) {
    4.34 +    throw std::runtime_error("too many file names");
    4.35    }
    4.36  
    4.37    if (!opts.circular.size()) {
     5.1 --- a/src/last-pair-probs.cc	Tue Dec 09 12:07:36 2014 +0900
     5.2 +++ b/src/last-pair-probs.cc	Wed Dec 10 12:01:46 2014 +0900
     5.3 @@ -2,6 +2,8 @@
     5.4  // Copyright 2014 Martin C. Frith
     5.5  
     5.6  #include "last-pair-probs.hh"
     5.7 +
     5.8 +#include "io.hh"
     5.9  #include "stringify.hh"
    5.10  
    5.11  #include <algorithm>
    5.12 @@ -122,12 +124,6 @@
    5.13    return c;
    5.14  }
    5.15  
    5.16 -static std::istream& openIn(const std::string& fileName, std::ifstream& ifs) {
    5.17 -  ifs.open(fileName.c_str());
    5.18 -  if (!ifs) err("can't open file: " + fileName);
    5.19 -  return ifs;
    5.20 -}
    5.21 -
    5.22  static double logSumExp(const double *beg, const double *end) {
    5.23    // Adds numbers, in log space, to avoid overflow.
    5.24    const double m = *std::max_element(beg, end);
    5.25 @@ -611,29 +607,38 @@
    5.26    opts.maxMissingScore2 = (params2.eGet() - 1) / params2.tGet() + maxLogPrior;
    5.27  }
    5.28  
    5.29 +static void skipOneBatchMarker(std::istream& input) {
    5.30 +  std::string line;
    5.31 +  while (std::getline(input, line))
    5.32 +    if (line.find("# batch ") == 0) break;
    5.33 +}
    5.34 +
    5.35  void lastPairProbs(LastPairProbsOptions& opts) {
    5.36 -  std::string fileName1 = opts.inputFileNames[0];
    5.37 -  std::string fileName2 = opts.inputFileNames[1];
    5.38 +  const std::vector<std::string>& inputs = opts.inputFileNames;
    5.39 +  size_t n = inputs.size();
    5.40  
    5.41    if (!opts.isFraglen || !opts.isSdev) {
    5.42      std::ifstream inFile1, inFile2;
    5.43 -    std::istream& in1 = openIn(fileName1, inFile1);
    5.44 -    std::istream& in2 = openIn(fileName2, inFile2);
    5.45 -    std::vector<long> lengths = readQueryPairs1pass(in1, in2, 1.0, 1.0, opts.circular);
    5.46 +    std::istream& in1 = (n > 0) ? cbrc::openIn(inputs[0], inFile1) : std::cin;
    5.47 +    std::istream& in2 = (n > 1) ? cbrc::openIn(inputs[1], inFile2) : in1;
    5.48 +    if (n < 2) skipOneBatchMarker(in1);
    5.49 +    std::vector<long> lengths = readQueryPairs1pass(in1, in2, 1.0, 1.0,
    5.50 +						    opts.circular);
    5.51      estimateFragmentLengthDistribution(lengths, opts);
    5.52    }
    5.53  
    5.54    if (!opts.estdist) {
    5.55      std::ifstream inFile1, inFile2;
    5.56 -    std::istream& in1 = openIn(fileName1, inFile1);
    5.57 -    std::istream& in2 = openIn(fileName2, inFile2);
    5.58 +    std::istream& in1 = (n > 0) ? cbrc::openIn(inputs[0], inFile1) : std::cin;
    5.59 +    std::istream& in2 = (n > 1) ? cbrc::openIn(inputs[1], inFile2) : in1;
    5.60      AlignmentParameters params1 = readHeaderOrDie(in1);
    5.61 -    AlignmentParameters params2 = readHeaderOrDie(in2);
    5.62 +    AlignmentParameters params2 = (n > 1) ? readHeaderOrDie(in2) : params1;
    5.63      calculateScorePieces(opts, params1, params2);
    5.64      std::cout << "# fraglen=" << opts.fraglen
    5.65                << " sdev=" << opts.sdev
    5.66                << " disjoint=" << opts.disjoint
    5.67                << " genome=" << params1.gGet() << "\n";
    5.68 +    if (n < 2) skipOneBatchMarker(in1);
    5.69      readQueryPairs2pass(in1, in2, params1.tGet(), params2.tGet(), opts);
    5.70    }
    5.71  }
     6.1 --- a/src/makefile	Tue Dec 09 12:07:36 2014 +0900
     6.2 +++ b/src/makefile	Wed Dec 10 12:01:46 2014 +0900
     6.3 @@ -38,7 +38,7 @@
     6.4  split/last-split.o split/cbrc_split_aligner.o split/last-split-main.o	\
     6.5  split/cbrc_unsplit_alignment.o
     6.6  
     6.7 -PPOBJ = last-pair-probs.o last-pair-probs-main.o
     6.8 +PPOBJ = last-pair-probs.o last-pair-probs-main.o io.o
     6.9  
    6.10  MBOBJ = last-merge-batches.o
    6.11  
    6.12 @@ -169,7 +169,8 @@
    6.13  io.o: io.cc io.hh
    6.14  last-pair-probs-main.o: last-pair-probs-main.cc last-pair-probs.hh \
    6.15   stringify.hh
    6.16 -last-pair-probs.o: last-pair-probs.cc last-pair-probs.hh
    6.17 +last-pair-probs.o: last-pair-probs.cc last-pair-probs.hh io.hh \
    6.18 + stringify.hh
    6.19  lastal.o: lastal.cc LastalArguments.hh SequenceFormat.hh \
    6.20   QualityPssmMaker.hh ScoreMatrixRow.hh OneQualityScoreMatrix.hh \
    6.21   TwoQualityScoreMatrix.hh qualityScoreUtil.hh stringify.hh \
     7.1 --- a/test/last-pair-test.out	Tue Dec 09 12:07:36 2014 +0900
     7.2 +++ b/test/last-pair-test.out	Wed Dec 10 12:01:46 2014 +0900
     7.3 @@ -1,5 +1,6 @@
     7.4  Usage:
     7.5    last-pair-probs --help
     7.6 +  last-pair-probs [options] interleaved-alignments
     7.7    last-pair-probs [options] alignments1 alignments2
     7.8  
     7.9  Read alignments of paired DNA reads to a genome, and: (1) estimate the