Removed .py and .sh suffixes from script names.
authorMartin C. Frith
Wed Sep 24 13:13:14 2014 +0900 (2014-09-24)
changeset 479e707702f53f8
parent 478 738937827432
child 480 aec8ba9a675a
Removed .py and .sh suffixes from script names.
doc/last-map-probs.txt
doc/last-pair-probs.txt
doc/last-scripts.txt
doc/maf-convert.txt
doc/maf-cull.txt
examples/last-bisulfite-paired.sh
examples/multiMito.sh
scripts/last-dotplot
scripts/last-dotplot.py
scripts/last-map-probs
scripts/last-map-probs.py
scripts/last-pair-probs
scripts/last-pair-probs.py
scripts/maf-convert
scripts/maf-convert.py
scripts/maf-cull
scripts/maf-cull.py
scripts/maf-join
scripts/maf-join.py
scripts/maf-sort
scripts/maf-sort.sh
scripts/maf-swap
scripts/maf-swap.py
test/maf-convert-test.out
test/maf-convert-test.sh
     1.1 --- a/doc/last-map-probs.txt	Wed Sep 24 11:41:48 2014 +0900
     1.2 +++ b/doc/last-map-probs.txt	Wed Sep 24 13:13:14 2014 +0900
     1.3 @@ -16,7 +16,7 @@
     1.4  These commands map DNA reads to the human genome::
     1.5  
     1.6    lastdb -m1111110 hu human/chr*.fa
     1.7 -  lastal -Q1 -e120 hu reads.fastq | last-map-probs.py > myalns.maf
     1.8 +  lastal -Q1 -e120 hu reads.fastq | last-map-probs > myalns.maf
     1.9  
    1.10  Options
    1.11  -------
    1.12 @@ -54,7 +54,7 @@
    1.13  
    1.14  This will run the pipeline on all your CPU cores::
    1.15  
    1.16 -  parallel-fastq "lastal -Q1 -e120 hu | last-map-probs.py" < reads.fastq > myalns.maf
    1.17 +  parallel-fastq "lastal -Q1 -e120 hu | last-map-probs" < reads.fastq > myalns.maf
    1.18  
    1.19  It requires GNU parallel to be installed
    1.20  (http://www.gnu.org/software/parallel/).
     2.1 --- a/doc/last-pair-probs.txt	Wed Sep 24 11:41:48 2014 +0900
     2.2 +++ b/doc/last-pair-probs.txt	Wed Sep 24 13:13:14 2014 +0900
     2.3 @@ -15,12 +15,12 @@
     2.4    lastdb -m1111110 humandb human/chr*.fa
     2.5    lastal -Q1 -e120 -i1 humandb reads1.fastq > temp1.maf
     2.6    lastal -Q1 -e120 -i1 humandb reads2.fastq > temp2.maf
     2.7 -  last-pair-probs.py temp1.maf temp2.maf > results.maf
     2.8 +  last-pair-probs temp1.maf temp2.maf > results.maf
     2.9  
    2.10  If your reads come from potentially-spliced RNA molecules, use the -r
    2.11  option::
    2.12  
    2.13 -  last-pair-probs.py -r temp1.maf temp2.maf > results.maf
    2.14 +  last-pair-probs -r temp1.maf temp2.maf > results.maf
    2.15  
    2.16  Without -r, it assumes the distances between paired reads follow a
    2.17  normal distribution.  With -r, it assumes the distances follow a
    2.18 @@ -114,7 +114,7 @@
    2.19      mkfifo pipe1 pipe2
    2.20      lastal -Q1 -e120 -i1 humandb reads1.fastq > pipe1 &
    2.21      lastal -Q1 -e120 -i1 humandb reads2.fastq > pipe2 &
    2.22 -    last-pair-probs.py -f250 -s30 pipe1 pipe2 > results.maf
    2.23 +    last-pair-probs -f250 -s30 pipe1 pipe2 > results.maf
    2.24  
    2.25    This streams the alignments from lastal to last-pair-probs.
    2.26  
    2.27 @@ -152,7 +152,7 @@
    2.28      #! /bin/sh
    2.29      lastal -Q1 -e120 -i1 "$3" "$4" > /tmp/$$.1
    2.30      lastal -Q1 -e120 -i1 "$3" "$5" > /tmp/$$.2
    2.31 -    last-pair-probs.py -f "$1" -s "$2" /tmp/$$.1 /tmp/$$.2
    2.32 +    last-pair-probs -f "$1" -s "$2" /tmp/$$.1 /tmp/$$.2
    2.33      rm /tmp/$$.*
    2.34  
    2.35  * Set execute permission::
     3.1 --- a/doc/last-scripts.txt	Wed Sep 24 11:41:48 2014 +0900
     3.2 +++ b/doc/last-scripts.txt	Wed Sep 24 13:13:14 2014 +0900
     3.3 @@ -1,31 +1,31 @@
     3.4  Description of scripts that accompany LAST
     3.5  ==========================================
     3.6  
     3.7 -last-dotplot.py
     3.8 ----------------
     3.9 +last-dotplot
    3.10 +------------
    3.11  
    3.12  This script makes a dotplot, a.k.a. Oxford Grid, of pair-wise
    3.13  alignments in MAF or LAST tabular format.  It requires the Python
    3.14  Imaging Library to be installed.  To get a usage message::
    3.15  
    3.16 -  last-dotplot.py --help
    3.17 +  last-dotplot --help
    3.18  
    3.19  To make a png-format dotplot of alignments in a file called "al"::
    3.20  
    3.21 -  last-dotplot.py al al.png
    3.22 +  last-dotplot al al.png
    3.23  
    3.24  To get a nicer font, try something like::
    3.25  
    3.26 -  last-dotplot.py -f /usr/share/fonts/truetype/freefont/FreeSans.ttf al al.png
    3.27 +  last-dotplot -f /usr/share/fonts/truetype/freefont/FreeSans.ttf al al.png
    3.28  
    3.29  If the fonts are located somewhere different on your computer, change
    3.30  this as appropriate.  To turn off the text and margins completely::
    3.31  
    3.32 -  last-dotplot.py -s0 al al.png
    3.33 +  last-dotplot -s0 al al.png
    3.34  
    3.35  To limit the plot to 500x500 pixels::
    3.36  
    3.37 -  last-dotplot.py -x500 -y500 al al.png
    3.38 +  last-dotplot -x500 -y500 al al.png
    3.39  
    3.40  If there are too many chromosomes, the dotplot will be very cluttered,
    3.41  or the script might give up with an error message.  So you may want to
    3.42 @@ -34,38 +34,38 @@
    3.43  chromosomes with names like "chr1_random"::
    3.44  
    3.45    grep -v 'random' al > plotme
    3.46 -  last-dotplot.py plotme plotme.png
    3.47 +  last-dotplot plotme plotme.png
    3.48  
    3.49  
    3.50 -maf-join.py
    3.51 ------------
    3.52 +maf-join
    3.53 +--------
    3.54  
    3.55  This script joins two or more sets of pairwise (or multiple)
    3.56  alignments into multiple alignments::
    3.57  
    3.58 -  maf-join.py aln1.maf aln2.maf aln3.maf > joined.maf
    3.59 +  maf-join aln1.maf aln2.maf aln3.maf > joined.maf
    3.60  
    3.61  The top genome in each input file should be the same, and the script
    3.62  simply joins alignment columns that are at the same position in the
    3.63  top genome.  IMPORTANT LIMITATION: alignment columns with gaps in the
    3.64  top sequence get joined arbitrarily, and probably wrongly.  Please
    3.65  disregard such columns in downstream analyses.  Each input file must
    3.66 -have been sorted using maf-sort.sh.  For an example of using LAST and
    3.67 -maf-join.py, see multiMito.sh in the examples directory.
    3.68 +have been sorted using maf-sort.  For an example of using LAST and
    3.69 +maf-join, see multiMito.sh in the examples directory.
    3.70  
    3.71  
    3.72 -maf-swap.py
    3.73 ------------
    3.74 +maf-swap
    3.75 +--------
    3.76  
    3.77  This script changes the order of the sequences in MAF-format
    3.78  alignments.  You can use option "-n" to move the "n"th sequence to the
    3.79  top (it defaults to 2)::
    3.80  
    3.81 -  maf-swap.py -n3 my-alignments.maf > my-swapped.maf
    3.82 +  maf-swap -n3 my-alignments.maf > my-swapped.maf
    3.83  
    3.84  
    3.85 -maf-sort.sh
    3.86 ------------
    3.87 +maf-sort
    3.88 +--------
    3.89  
    3.90  This sorts MAF-format alignments by sequence name, then strand, then
    3.91  start position, then end position, of the top sequence.  You can use
    3.92 @@ -80,4 +80,4 @@
    3.93     complex MAF data from elsewhere.
    3.94  
    3.95  2) These scripts do not work for DNA-versus-protein alignments:
    3.96 -   last-dotplot.py, maf-join.py.
    3.97 +   last-dotplot, maf-join.
     4.1 --- a/doc/maf-convert.txt	Wed Sep 24 11:41:48 2014 +0900
     4.2 +++ b/doc/maf-convert.txt	Wed Sep 24 13:13:14 2014 +0900
     4.3 @@ -5,11 +5,11 @@
     4.4  format.  It can write them in these formats: axt, blast, html, psl,
     4.5  sam, tab.  You can use it like this::
     4.6  
     4.7 -  maf-convert.py psl my-alignments.maf > my-alignments.psl
     4.8 +  maf-convert psl my-alignments.maf > my-alignments.psl
     4.9  
    4.10  It's often convenient to pipe in the input, like this::
    4.11  
    4.12 -  ... | maf-convert.py psl > my-alignments.psl
    4.13 +  ... | maf-convert psl > my-alignments.psl
    4.14  
    4.15  The input should be "multiple alignment format" as described in the
    4.16  UCSC Genome FAQ (not "MIRA assembly format" or any other maf).
    4.17 @@ -63,7 +63,7 @@
    4.18    using CreateSequenceDictionary).  Then, concatenate the output of a
    4.19    command like this::
    4.20  
    4.21 -    parallel-fastq "... | maf-convert.py -n sam" < q.fastq
    4.22 +    parallel-fastq "... | maf-convert -n sam" < q.fastq
    4.23  
    4.24  * Here is yet another way to get a sequence dictionary, using samtools
    4.25    (http://samtools.sourceforge.net/).  Assume the reference sequences
     5.1 --- a/doc/maf-cull.txt	Wed Sep 24 11:41:48 2014 +0900
     5.2 +++ b/doc/maf-cull.txt	Wed Sep 24 13:13:14 2014 +0900
     5.3 @@ -13,8 +13,8 @@
     5.4  
     5.5  You can use it like this::
     5.6  
     5.7 -  maf-cull.py --limit=3 sorted-alignments.maf > culled.maf
     5.8 -  maf-cull.py -l3 sorted-alignments.maf > culled.maf
     5.9 +  maf-cull --limit=3 sorted-alignments.maf > culled.maf
    5.10 +  maf-cull -l3 sorted-alignments.maf > culled.maf
    5.11  
    5.12  If you want to cull based on a sequence other than the top one, use
    5.13  maf-swap to bring that sequence to the top.
     6.1 --- a/examples/last-bisulfite-paired.sh	Wed Sep 24 11:41:48 2014 +0900
     6.2 +++ b/examples/last-bisulfite-paired.sh	Wed Sep 24 13:13:14 2014 +0900
     6.3 @@ -40,7 +40,7 @@
     6.4  last-merge-batches $t.t2f $t.t2r > $t.t2
     6.5  rm $t.t2f $t.t2r
     6.6  
     6.7 -last-pair-probs.py -m0.1 $t.t1 $t.t2 |
     6.8 +last-pair-probs -m0.1 $t.t1 $t.t2 |
     6.9  perl -F'(\s+)' -ane '$F[12] =~ y/ta/CG/ if /^s/ and $s++ % 2; print @F'
    6.10  rm $t.t1 $t.t2
    6.11  EOF
     7.1 --- a/examples/multiMito.sh	Wed Sep 24 11:41:48 2014 +0900
     7.2 +++ b/examples/multiMito.sh	Wed Sep 24 13:13:14 2014 +0900
     7.3 @@ -1,6 +1,6 @@
     7.4  #! /bin/sh
     7.5  
     7.6 -# This script demonstrates using LAST and maf-join.py to construct a
     7.7 +# This script demonstrates using LAST and maf-join to construct a
     7.8  # multiple alignment of the human, mouse, chicken, and fugu
     7.9  # mitochondrial genomes.
    7.10  
    7.11 @@ -14,16 +14,16 @@
    7.12  lastdb -c humanMito humanMito.fa
    7.13  
    7.14  # Align the mouse sequence to the human sequence:
    7.15 -lastal -e25 -j4 humanMito mouseMito.fa | last-split | maf-sort.sh > hm.maf
    7.16 +lastal -e25 -j4 humanMito mouseMito.fa | last-split | maf-sort > hm.maf
    7.17  
    7.18  # Align the chicken sequence to the human sequence:
    7.19 -lastal -e25 -j4 humanMito chickenMito.fa | last-split | maf-sort.sh > hc.maf
    7.20 +lastal -e25 -j4 humanMito chickenMito.fa | last-split | maf-sort > hc.maf
    7.21  
    7.22  # Align the fugu sequence to the human sequence:
    7.23 -lastal -e25 -j4 humanMito fuguMito.fa | last-split | maf-sort.sh > hf.maf
    7.24 +lastal -e25 -j4 humanMito fuguMito.fa | last-split | maf-sort > hf.maf
    7.25  
    7.26  # Join the pairwise alignments into a multiple alignment:
    7.27 -maf-join.py hm.maf hc.maf hf.maf
    7.28 +maf-join hm.maf hc.maf hf.maf
    7.29  
    7.30  # Clean up the intermediate files that we made:
    7.31  rm humanMito.??? h?.maf
     8.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     8.2 +++ b/scripts/last-dotplot	Wed Sep 24 13:13:14 2014 +0900
     8.3 @@ -0,0 +1,271 @@
     8.4 +#! /usr/bin/env python
     8.5 +
     8.6 +# Read pair-wise alignments in MAF or LAST tabular format: write an
     8.7 +# "Oxford grid", a.k.a. dotplot.
     8.8 +
     8.9 +# TODO: Currently, pixels with zero aligned nt-pairs are white, and
    8.10 +# pixels with one or more aligned nt-pairs are black.  This can look
    8.11 +# too crowded for large genome alignments.  I tried shading each pixel
    8.12 +# according to the number of aligned nt-pairs within it, but the
    8.13 +# result is too faint.  How can this be done better?
    8.14 +
    8.15 +import sys, os, re, itertools, optparse
    8.16 +
    8.17 +# Try to make PIL/PILLOW work:
    8.18 +try: from PIL import Image, ImageDraw, ImageFont, ImageColor
    8.19 +except ImportError: import Image, ImageDraw, ImageFont, ImageColor
    8.20 +
    8.21 +my_name = os.path.basename(sys.argv[0])
    8.22 +usage = """
    8.23 +  %prog --help
    8.24 +  %prog [options] last-tabular-output dotplot.png
    8.25 +  %prog [options] last-tabular-output dotplot.gif
    8.26 +  etc."""
    8.27 +parser = optparse.OptionParser(usage=usage)
    8.28 +# Replace "width" & "height" with a single "length" option?
    8.29 +parser.add_option("-x", "--width", type="int", dest="width", default=1000,
    8.30 +                  help="maximum width in pixels (default: %default)")
    8.31 +parser.add_option("-y", "--height", type="int", dest="height", default=1000,
    8.32 +                  help="maximum height in pixels (default: %default)")
    8.33 +parser.add_option("-f", "--fontfile", dest="fontfile",
    8.34 +                  help="TrueType or OpenType font file")
    8.35 +parser.add_option("-s", "--fontsize", type="int", dest="fontsize", default=11,
    8.36 +                  help="TrueType or OpenType font size (default: %default)")
    8.37 +parser.add_option("-c", "--forwardcolor", dest="forwardcolor", default="red",
    8.38 +                  help="Color for forward alignments (default: %default)")
    8.39 +parser.add_option("-r", "--reversecolor", dest="reversecolor", default="blue",
    8.40 +                  help="Color for reverse alignments (default: %default)")
    8.41 +(opts, args) = parser.parse_args()
    8.42 +if len(args) != 2: parser.error("2 arguments needed")
    8.43 +
    8.44 +if opts.fontfile:  font = ImageFont.truetype(opts.fontfile, opts.fontsize)
    8.45 +else:              font = ImageFont.load_default()
    8.46 +
    8.47 +# Make these options too?
    8.48 +text_color = "black"
    8.49 +background_color = "white"
    8.50 +pix_tween_seqs = 2  # number of border pixels between sequences
    8.51 +border_shade = 239, 239, 239  # the shade of grey to use for border pixels
    8.52 +label_space = 5     # minimum number of pixels between axis labels
    8.53 +
    8.54 +image_mode = 'RGB'
    8.55 +forward_color = ImageColor.getcolor(opts.forwardcolor, image_mode)
    8.56 +reverse_color = ImageColor.getcolor(opts.reversecolor, image_mode)
    8.57 +overlap_color = tuple([(i+j)//2 for i, j in zip(forward_color, reverse_color)])
    8.58 +
    8.59 +def isGapless(alignmentColumn):
    8.60 +    return "-" not in alignmentColumn
    8.61 +
    8.62 +def matchAndInsertLengths(alignmentColumns):
    8.63 +    for k, v in itertools.groupby(alignmentColumns, isGapless):
    8.64 +        if k:
    8.65 +            matchLength = sum(1 for i in v)
    8.66 +            yield str(matchLength)
    8.67 +        else:
    8.68 +            blockRows = itertools.izip(*v)
    8.69 +            insertLengths = (len(i) - i.count("-") for i in blockRows)
    8.70 +            yield ":".join(map(str, insertLengths))
    8.71 +
    8.72 +def alignmentInput(lines):  # read alignments in either tabular or MAF format
    8.73 +    for line in lines:
    8.74 +        w = line.split()
    8.75 +        if line[0].isdigit():  # tabular format
    8.76 +            yield w
    8.77 +        elif line[0] == "a":  # MAF format
    8.78 +            sLines = []
    8.79 +        elif line[0] == "s":  # MAF format
    8.80 +            sLines.append(w)
    8.81 +            if len(sLines) == 2:
    8.82 +                alignmentRows = (i[6] for i in sLines)
    8.83 +                alignmentColumns = itertools.izip(*alignmentRows)
    8.84 +                blocks = ",".join(matchAndInsertLengths(alignmentColumns))
    8.85 +                yield sLines[0][0:6] + sLines[1][1:6] + [blocks]
    8.86 +
    8.87 +seq_size_dic1 = {}  # sizes of the first set of sequences
    8.88 +seq_size_dic2 = {}  # sizes of the second set of sequences
    8.89 +alignments = []
    8.90 +
    8.91 +f = open(args[0])
    8.92 +sys.stderr.write(my_name + ": reading alignments...\n")
    8.93 +for w in alignmentInput(f):
    8.94 +    seq1, pos1, strand1, size1 = w[1], int(w[2]), w[4], int(w[5])
    8.95 +    seq2, pos2, strand2, size2 = w[6], int(w[7]), w[9], int(w[10])
    8.96 +    blocks = w[11]
    8.97 +    seq_size_dic1[seq1] = size1
    8.98 +    seq_size_dic2[seq2] = size2
    8.99 +    aln = seq1, seq2, pos1, pos2, strand1, strand2, blocks
   8.100 +    alignments.append(aln)
   8.101 +sys.stderr.write(my_name + ": done\n")
   8.102 +f.close()
   8.103 +
   8.104 +if not alignments:
   8.105 +    sys.exit(my_name + ": there are no alignments")
   8.106 +
   8.107 +def natural_sort_key(my_string):
   8.108 +    '''Return a sort key for "natural" ordering, e.g. chr9 < chr10.'''
   8.109 +    parts = re.split(r'(\d+)', my_string)
   8.110 +    parts[1::2] = map(int, parts[1::2])
   8.111 +    return parts
   8.112 +
   8.113 +def get_text_sizes(my_strings):
   8.114 +    '''Get widths & heights, in pixels, of some strings.'''
   8.115 +    if opts.fontsize == 0: return [(0, 0) for i in my_strings]
   8.116 +    image_size = 1, 1
   8.117 +    im = Image.new(image_mode, image_size)
   8.118 +    draw = ImageDraw.Draw(im)
   8.119 +    return [draw.textsize(i, font=font) for i in my_strings]
   8.120 +
   8.121 +def get_seq_info(seq_size_dic):
   8.122 +    '''Return miscellaneous information about the sequences.'''
   8.123 +    seq_names = seq_size_dic.keys()
   8.124 +    seq_names.sort(key=natural_sort_key)
   8.125 +    seq_sizes = [seq_size_dic[i] for i in seq_names]
   8.126 +    name_sizes = get_text_sizes(seq_names)
   8.127 +    margin = max(zip(*name_sizes)[1])  # maximum text height
   8.128 +    return seq_names, seq_sizes, name_sizes, margin
   8.129 +
   8.130 +seq_names1, seq_sizes1, name_sizes1, margin1 = get_seq_info(seq_size_dic1)
   8.131 +seq_names2, seq_sizes2, name_sizes2, margin2 = get_seq_info(seq_size_dic2)
   8.132 +
   8.133 +def div_ceil(x, y):
   8.134 +    '''Return x / y rounded up.'''
   8.135 +    q, r = divmod(x, y)
   8.136 +    return q + (r != 0)
   8.137 +
   8.138 +def tot_seq_pix(seq_sizes, bp_per_pix):
   8.139 +    '''Return the total pixels needed for sequences of the given sizes.'''
   8.140 +    return sum([div_ceil(i, bp_per_pix) for i in seq_sizes])
   8.141 +
   8.142 +def get_bp_per_pix(seq_sizes, pix_limit):
   8.143 +    '''Get the minimum bp-per-pixel that fits in the size limit.'''
   8.144 +    seq_num = len(seq_sizes)
   8.145 +    seq_pix_limit = pix_limit - pix_tween_seqs * (seq_num - 1)
   8.146 +    if seq_pix_limit < seq_num:
   8.147 +        sys.exit(my_name + ": can't fit the image: too many sequences?")
   8.148 +    lower_bound = div_ceil(sum(seq_sizes), seq_pix_limit)
   8.149 +    for bp_per_pix in itertools.count(lower_bound):  # slow linear search
   8.150 +        if tot_seq_pix(seq_sizes, bp_per_pix) <= seq_pix_limit: break
   8.151 +    return bp_per_pix
   8.152 +
   8.153 +sys.stderr.write(my_name + ": choosing bp per pixel...\n")
   8.154 +bp_per_pix1 = get_bp_per_pix(seq_sizes1, opts.width  - margin1)
   8.155 +bp_per_pix2 = get_bp_per_pix(seq_sizes2, opts.height - margin2)
   8.156 +bp_per_pix = max(bp_per_pix1, bp_per_pix2)
   8.157 +sys.stderr.write(my_name + ": bp per pixel = " + str(bp_per_pix) + "\n")
   8.158 +
   8.159 +def get_seq_starts(seq_pix, pix_tween_seqs, margin):
   8.160 +    '''Get the start pixel for each sequence.'''
   8.161 +    seq_starts = []
   8.162 +    pix_tot = margin - pix_tween_seqs
   8.163 +    for i in seq_pix:
   8.164 +        pix_tot += pix_tween_seqs
   8.165 +        seq_starts.append(pix_tot)
   8.166 +        pix_tot += i
   8.167 +    return seq_starts
   8.168 +
   8.169 +def get_pix_info(seq_sizes, margin):
   8.170 +    '''Return pixel information about the sequences.'''
   8.171 +    seq_pix = [div_ceil(i, bp_per_pix) for i in seq_sizes]
   8.172 +    seq_starts = get_seq_starts(seq_pix, pix_tween_seqs, margin)
   8.173 +    tot_pix = seq_starts[-1] + seq_pix[-1]
   8.174 +    return seq_pix, seq_starts, tot_pix
   8.175 +
   8.176 +seq_pix1, seq_starts1, width  = get_pix_info(seq_sizes1, margin1)
   8.177 +seq_pix2, seq_starts2, height = get_pix_info(seq_sizes2, margin2)
   8.178 +seq_start_dic1 = dict(zip(seq_names1, seq_starts1))
   8.179 +seq_start_dic2 = dict(zip(seq_names2, seq_starts2))
   8.180 +hits = [0] * (width * height)  # the image data
   8.181 +
   8.182 +sys.stderr.write(my_name + ": processing alignments...\n")
   8.183 +for aln in alignments:
   8.184 +    seq1, seq2, pos1, pos2, strand1, strand2, blocks = aln
   8.185 +    last1 = seq_size_dic1[seq1] - 1
   8.186 +    last2 = seq_size_dic2[seq2] - 1
   8.187 +    seq_start1 = seq_start_dic1[seq1]
   8.188 +    seq_start2 = seq_start_dic2[seq2]
   8.189 +    my_start = seq_start2 * width + seq_start1
   8.190 +    if strand1 == strand2: store_value = 1
   8.191 +    else:                  store_value = 2
   8.192 +    for i in blocks.split(","):
   8.193 +        if ":" in i:  # it's a gap region: skip over it
   8.194 +            insertLength1, insertLength2 = i.split(":")
   8.195 +            pos1 += int(insertLength1)
   8.196 +            pos2 += int(insertLength2)
   8.197 +        else:  # it's a match region: draw pixels for it
   8.198 +            matchLength = int(i)
   8.199 +            end1 = pos1 + matchLength
   8.200 +            end2 = pos2 + matchLength
   8.201 +            if strand1 == '+': j = xrange(pos1, end1)
   8.202 +            else:              j = xrange(last1 - pos1, last1 - end1, -1)
   8.203 +            if strand2 == '+': k = xrange(pos2, end2)
   8.204 +            else:              k = xrange(last2 - pos2, last2 - end2, -1)
   8.205 +            for real_pos1, real_pos2 in itertools.izip(j, k):
   8.206 +                pix1 = real_pos1 // bp_per_pix
   8.207 +                pix2 = real_pos2 // bp_per_pix
   8.208 +                hits[my_start + pix2 * width + pix1] |= store_value
   8.209 +            pos1 = end1
   8.210 +            pos2 = end2
   8.211 +sys.stderr.write(my_name + ": done\n")
   8.212 +
   8.213 +def make_label(text, text_size, range_start, range_size):
   8.214 +    '''Return an axis label with endpoint & sort-order information.'''
   8.215 +    text_width  = text_size[0]
   8.216 +    label_start = range_start + (range_size - text_width) // 2
   8.217 +    label_end   = label_start + text_width
   8.218 +    sort_key    = text_width - range_size
   8.219 +    return sort_key, label_start, label_end, text
   8.220 +
   8.221 +def get_nonoverlapping_labels(labels):
   8.222 +    '''Get a subset of non-overlapping axis labels, greedily.'''
   8.223 +    nonoverlapping_labels = []
   8.224 +    for i in labels:
   8.225 +        if True not in [i[1] < j[2] + label_space and j[1] < i[2] + label_space
   8.226 +                        for j in nonoverlapping_labels]:
   8.227 +            nonoverlapping_labels.append(i)
   8.228 +    return nonoverlapping_labels
   8.229 +
   8.230 +def get_axis_image(seq_names, name_sizes, seq_starts, seq_pix):
   8.231 +    '''Make an image of axis labels.'''
   8.232 +    min_pos = seq_starts[0]
   8.233 +    max_pos = seq_starts[-1] + seq_pix[-1]
   8.234 +    height = max(zip(*name_sizes)[1])
   8.235 +    labels = [make_label(i, j, k, l) for i, j, k, l in
   8.236 +              zip(seq_names, name_sizes, seq_starts, seq_pix)]
   8.237 +    labels = [i for i in labels if i[1] >= min_pos and i[2] <= max_pos]
   8.238 +    labels.sort()
   8.239 +    labels = get_nonoverlapping_labels(labels)
   8.240 +    image_size = max_pos, height
   8.241 +    im = Image.new(image_mode, image_size, border_shade)
   8.242 +    draw = ImageDraw.Draw(im)
   8.243 +    for i in labels:
   8.244 +        position = i[1], 0
   8.245 +        draw.text(position, i[3], font=font, fill=text_color)
   8.246 +    return im
   8.247 +
   8.248 +image_size = width, height
   8.249 +im = Image.new(image_mode, image_size, background_color)
   8.250 +
   8.251 +for i in range(height):
   8.252 +    for j in range(width):
   8.253 +        store_value = hits[i * width + j]
   8.254 +        xy = j, i
   8.255 +        if   store_value == 1: im.putpixel(xy, forward_color)
   8.256 +        elif store_value == 2: im.putpixel(xy, reverse_color)
   8.257 +        elif store_value == 3: im.putpixel(xy, overlap_color)
   8.258 +
   8.259 +if opts.fontsize != 0:
   8.260 +    axis1 = get_axis_image(seq_names1, name_sizes1, seq_starts1, seq_pix1)
   8.261 +    axis2 = get_axis_image(seq_names2, name_sizes2, seq_starts2, seq_pix2)
   8.262 +    axis2 = axis2.rotate(270)
   8.263 +    im.paste(axis1, (0, 0))
   8.264 +    im.paste(axis2, (0, 0))
   8.265 +
   8.266 +for i in seq_starts1[1:]:
   8.267 +    box = i - pix_tween_seqs, margin2, i, height
   8.268 +    im.paste(border_shade, box)
   8.269 +
   8.270 +for i in seq_starts2[1:]:
   8.271 +    box = margin1, i - pix_tween_seqs, width, i
   8.272 +    im.paste(border_shade, box)
   8.273 +
   8.274 +im.save(args[1])
     9.1 --- a/scripts/last-dotplot.py	Wed Sep 24 11:41:48 2014 +0900
     9.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
     9.3 @@ -1,271 +0,0 @@
     9.4 -#! /usr/bin/env python
     9.5 -
     9.6 -# Read pair-wise alignments in MAF or LAST tabular format: write an
     9.7 -# "Oxford grid", a.k.a. dotplot.
     9.8 -
     9.9 -# TODO: Currently, pixels with zero aligned nt-pairs are white, and
    9.10 -# pixels with one or more aligned nt-pairs are black.  This can look
    9.11 -# too crowded for large genome alignments.  I tried shading each pixel
    9.12 -# according to the number of aligned nt-pairs within it, but the
    9.13 -# result is too faint.  How can this be done better?
    9.14 -
    9.15 -import sys, os, re, itertools, optparse
    9.16 -
    9.17 -# Try to make PIL/PILLOW work:
    9.18 -try: from PIL import Image, ImageDraw, ImageFont, ImageColor
    9.19 -except ImportError: import Image, ImageDraw, ImageFont, ImageColor
    9.20 -
    9.21 -my_name = os.path.basename(sys.argv[0])
    9.22 -usage = """
    9.23 -  %prog --help
    9.24 -  %prog [options] last-tabular-output dotplot.png
    9.25 -  %prog [options] last-tabular-output dotplot.gif
    9.26 -  etc."""
    9.27 -parser = optparse.OptionParser(usage=usage)
    9.28 -# Replace "width" & "height" with a single "length" option?
    9.29 -parser.add_option("-x", "--width", type="int", dest="width", default=1000,
    9.30 -                  help="maximum width in pixels (default: %default)")
    9.31 -parser.add_option("-y", "--height", type="int", dest="height", default=1000,
    9.32 -                  help="maximum height in pixels (default: %default)")
    9.33 -parser.add_option("-f", "--fontfile", dest="fontfile",
    9.34 -                  help="TrueType or OpenType font file")
    9.35 -parser.add_option("-s", "--fontsize", type="int", dest="fontsize", default=11,
    9.36 -                  help="TrueType or OpenType font size (default: %default)")
    9.37 -parser.add_option("-c", "--forwardcolor", dest="forwardcolor", default="red",
    9.38 -                  help="Color for forward alignments (default: %default)")
    9.39 -parser.add_option("-r", "--reversecolor", dest="reversecolor", default="blue",
    9.40 -                  help="Color for reverse alignments (default: %default)")
    9.41 -(opts, args) = parser.parse_args()
    9.42 -if len(args) != 2: parser.error("2 arguments needed")
    9.43 -
    9.44 -if opts.fontfile:  font = ImageFont.truetype(opts.fontfile, opts.fontsize)
    9.45 -else:              font = ImageFont.load_default()
    9.46 -
    9.47 -# Make these options too?
    9.48 -text_color = "black"
    9.49 -background_color = "white"
    9.50 -pix_tween_seqs = 2  # number of border pixels between sequences
    9.51 -border_shade = 239, 239, 239  # the shade of grey to use for border pixels
    9.52 -label_space = 5     # minimum number of pixels between axis labels
    9.53 -
    9.54 -image_mode = 'RGB'
    9.55 -forward_color = ImageColor.getcolor(opts.forwardcolor, image_mode)
    9.56 -reverse_color = ImageColor.getcolor(opts.reversecolor, image_mode)
    9.57 -overlap_color = tuple([(i+j)//2 for i, j in zip(forward_color, reverse_color)])
    9.58 -
    9.59 -def isGapless(alignmentColumn):
    9.60 -    return "-" not in alignmentColumn
    9.61 -
    9.62 -def matchAndInsertLengths(alignmentColumns):
    9.63 -    for k, v in itertools.groupby(alignmentColumns, isGapless):
    9.64 -        if k:
    9.65 -            matchLength = sum(1 for i in v)
    9.66 -            yield str(matchLength)
    9.67 -        else:
    9.68 -            blockRows = itertools.izip(*v)
    9.69 -            insertLengths = (len(i) - i.count("-") for i in blockRows)
    9.70 -            yield ":".join(map(str, insertLengths))
    9.71 -
    9.72 -def alignmentInput(lines):  # read alignments in either tabular or MAF format
    9.73 -    for line in lines:
    9.74 -        w = line.split()
    9.75 -        if line[0].isdigit():  # tabular format
    9.76 -            yield w
    9.77 -        elif line[0] == "a":  # MAF format
    9.78 -            sLines = []
    9.79 -        elif line[0] == "s":  # MAF format
    9.80 -            sLines.append(w)
    9.81 -            if len(sLines) == 2:
    9.82 -                alignmentRows = (i[6] for i in sLines)
    9.83 -                alignmentColumns = itertools.izip(*alignmentRows)
    9.84 -                blocks = ",".join(matchAndInsertLengths(alignmentColumns))
    9.85 -                yield sLines[0][0:6] + sLines[1][1:6] + [blocks]
    9.86 -
    9.87 -seq_size_dic1 = {}  # sizes of the first set of sequences
    9.88 -seq_size_dic2 = {}  # sizes of the second set of sequences
    9.89 -alignments = []
    9.90 -
    9.91 -f = open(args[0])
    9.92 -sys.stderr.write(my_name + ": reading alignments...\n")
    9.93 -for w in alignmentInput(f):
    9.94 -    seq1, pos1, strand1, size1 = w[1], int(w[2]), w[4], int(w[5])
    9.95 -    seq2, pos2, strand2, size2 = w[6], int(w[7]), w[9], int(w[10])
    9.96 -    blocks = w[11]
    9.97 -    seq_size_dic1[seq1] = size1
    9.98 -    seq_size_dic2[seq2] = size2
    9.99 -    aln = seq1, seq2, pos1, pos2, strand1, strand2, blocks
   9.100 -    alignments.append(aln)
   9.101 -sys.stderr.write(my_name + ": done\n")
   9.102 -f.close()
   9.103 -
   9.104 -if not alignments:
   9.105 -    sys.exit(my_name + ": there are no alignments")
   9.106 -
   9.107 -def natural_sort_key(my_string):
   9.108 -    '''Return a sort key for "natural" ordering, e.g. chr9 < chr10.'''
   9.109 -    parts = re.split(r'(\d+)', my_string)
   9.110 -    parts[1::2] = map(int, parts[1::2])
   9.111 -    return parts
   9.112 -
   9.113 -def get_text_sizes(my_strings):
   9.114 -    '''Get widths & heights, in pixels, of some strings.'''
   9.115 -    if opts.fontsize == 0: return [(0, 0) for i in my_strings]
   9.116 -    image_size = 1, 1
   9.117 -    im = Image.new(image_mode, image_size)
   9.118 -    draw = ImageDraw.Draw(im)
   9.119 -    return [draw.textsize(i, font=font) for i in my_strings]
   9.120 -
   9.121 -def get_seq_info(seq_size_dic):
   9.122 -    '''Return miscellaneous information about the sequences.'''
   9.123 -    seq_names = seq_size_dic.keys()
   9.124 -    seq_names.sort(key=natural_sort_key)
   9.125 -    seq_sizes = [seq_size_dic[i] for i in seq_names]
   9.126 -    name_sizes = get_text_sizes(seq_names)
   9.127 -    margin = max(zip(*name_sizes)[1])  # maximum text height
   9.128 -    return seq_names, seq_sizes, name_sizes, margin
   9.129 -
   9.130 -seq_names1, seq_sizes1, name_sizes1, margin1 = get_seq_info(seq_size_dic1)
   9.131 -seq_names2, seq_sizes2, name_sizes2, margin2 = get_seq_info(seq_size_dic2)
   9.132 -
   9.133 -def div_ceil(x, y):
   9.134 -    '''Return x / y rounded up.'''
   9.135 -    q, r = divmod(x, y)
   9.136 -    return q + (r != 0)
   9.137 -
   9.138 -def tot_seq_pix(seq_sizes, bp_per_pix):
   9.139 -    '''Return the total pixels needed for sequences of the given sizes.'''
   9.140 -    return sum([div_ceil(i, bp_per_pix) for i in seq_sizes])
   9.141 -
   9.142 -def get_bp_per_pix(seq_sizes, pix_limit):
   9.143 -    '''Get the minimum bp-per-pixel that fits in the size limit.'''
   9.144 -    seq_num = len(seq_sizes)
   9.145 -    seq_pix_limit = pix_limit - pix_tween_seqs * (seq_num - 1)
   9.146 -    if seq_pix_limit < seq_num:
   9.147 -        sys.exit(my_name + ": can't fit the image: too many sequences?")
   9.148 -    lower_bound = div_ceil(sum(seq_sizes), seq_pix_limit)
   9.149 -    for bp_per_pix in itertools.count(lower_bound):  # slow linear search
   9.150 -        if tot_seq_pix(seq_sizes, bp_per_pix) <= seq_pix_limit: break
   9.151 -    return bp_per_pix
   9.152 -
   9.153 -sys.stderr.write(my_name + ": choosing bp per pixel...\n")
   9.154 -bp_per_pix1 = get_bp_per_pix(seq_sizes1, opts.width  - margin1)
   9.155 -bp_per_pix2 = get_bp_per_pix(seq_sizes2, opts.height - margin2)
   9.156 -bp_per_pix = max(bp_per_pix1, bp_per_pix2)
   9.157 -sys.stderr.write(my_name + ": bp per pixel = " + str(bp_per_pix) + "\n")
   9.158 -
   9.159 -def get_seq_starts(seq_pix, pix_tween_seqs, margin):
   9.160 -    '''Get the start pixel for each sequence.'''
   9.161 -    seq_starts = []
   9.162 -    pix_tot = margin - pix_tween_seqs
   9.163 -    for i in seq_pix:
   9.164 -        pix_tot += pix_tween_seqs
   9.165 -        seq_starts.append(pix_tot)
   9.166 -        pix_tot += i
   9.167 -    return seq_starts
   9.168 -
   9.169 -def get_pix_info(seq_sizes, margin):
   9.170 -    '''Return pixel information about the sequences.'''
   9.171 -    seq_pix = [div_ceil(i, bp_per_pix) for i in seq_sizes]
   9.172 -    seq_starts = get_seq_starts(seq_pix, pix_tween_seqs, margin)
   9.173 -    tot_pix = seq_starts[-1] + seq_pix[-1]
   9.174 -    return seq_pix, seq_starts, tot_pix
   9.175 -
   9.176 -seq_pix1, seq_starts1, width  = get_pix_info(seq_sizes1, margin1)
   9.177 -seq_pix2, seq_starts2, height = get_pix_info(seq_sizes2, margin2)
   9.178 -seq_start_dic1 = dict(zip(seq_names1, seq_starts1))
   9.179 -seq_start_dic2 = dict(zip(seq_names2, seq_starts2))
   9.180 -hits = [0] * (width * height)  # the image data
   9.181 -
   9.182 -sys.stderr.write(my_name + ": processing alignments...\n")
   9.183 -for aln in alignments:
   9.184 -    seq1, seq2, pos1, pos2, strand1, strand2, blocks = aln
   9.185 -    last1 = seq_size_dic1[seq1] - 1
   9.186 -    last2 = seq_size_dic2[seq2] - 1
   9.187 -    seq_start1 = seq_start_dic1[seq1]
   9.188 -    seq_start2 = seq_start_dic2[seq2]
   9.189 -    my_start = seq_start2 * width + seq_start1
   9.190 -    if strand1 == strand2: store_value = 1
   9.191 -    else:                  store_value = 2
   9.192 -    for i in blocks.split(","):
   9.193 -        if ":" in i:  # it's a gap region: skip over it
   9.194 -            insertLength1, insertLength2 = i.split(":")
   9.195 -            pos1 += int(insertLength1)
   9.196 -            pos2 += int(insertLength2)
   9.197 -        else:  # it's a match region: draw pixels for it
   9.198 -            matchLength = int(i)
   9.199 -            end1 = pos1 + matchLength
   9.200 -            end2 = pos2 + matchLength
   9.201 -            if strand1 == '+': j = xrange(pos1, end1)
   9.202 -            else:              j = xrange(last1 - pos1, last1 - end1, -1)
   9.203 -            if strand2 == '+': k = xrange(pos2, end2)
   9.204 -            else:              k = xrange(last2 - pos2, last2 - end2, -1)
   9.205 -            for real_pos1, real_pos2 in itertools.izip(j, k):
   9.206 -                pix1 = real_pos1 // bp_per_pix
   9.207 -                pix2 = real_pos2 // bp_per_pix
   9.208 -                hits[my_start + pix2 * width + pix1] |= store_value
   9.209 -            pos1 = end1
   9.210 -            pos2 = end2
   9.211 -sys.stderr.write(my_name + ": done\n")
   9.212 -
   9.213 -def make_label(text, text_size, range_start, range_size):
   9.214 -    '''Return an axis label with endpoint & sort-order information.'''
   9.215 -    text_width  = text_size[0]
   9.216 -    label_start = range_start + (range_size - text_width) // 2
   9.217 -    label_end   = label_start + text_width
   9.218 -    sort_key    = text_width - range_size
   9.219 -    return sort_key, label_start, label_end, text
   9.220 -
   9.221 -def get_nonoverlapping_labels(labels):
   9.222 -    '''Get a subset of non-overlapping axis labels, greedily.'''
   9.223 -    nonoverlapping_labels = []
   9.224 -    for i in labels:
   9.225 -        if True not in [i[1] < j[2] + label_space and j[1] < i[2] + label_space
   9.226 -                        for j in nonoverlapping_labels]:
   9.227 -            nonoverlapping_labels.append(i)
   9.228 -    return nonoverlapping_labels
   9.229 -
   9.230 -def get_axis_image(seq_names, name_sizes, seq_starts, seq_pix):
   9.231 -    '''Make an image of axis labels.'''
   9.232 -    min_pos = seq_starts[0]
   9.233 -    max_pos = seq_starts[-1] + seq_pix[-1]
   9.234 -    height = max(zip(*name_sizes)[1])
   9.235 -    labels = [make_label(i, j, k, l) for i, j, k, l in
   9.236 -              zip(seq_names, name_sizes, seq_starts, seq_pix)]
   9.237 -    labels = [i for i in labels if i[1] >= min_pos and i[2] <= max_pos]
   9.238 -    labels.sort()
   9.239 -    labels = get_nonoverlapping_labels(labels)
   9.240 -    image_size = max_pos, height
   9.241 -    im = Image.new(image_mode, image_size, border_shade)
   9.242 -    draw = ImageDraw.Draw(im)
   9.243 -    for i in labels:
   9.244 -        position = i[1], 0
   9.245 -        draw.text(position, i[3], font=font, fill=text_color)
   9.246 -    return im
   9.247 -
   9.248 -image_size = width, height
   9.249 -im = Image.new(image_mode, image_size, background_color)
   9.250 -
   9.251 -for i in range(height):
   9.252 -    for j in range(width):
   9.253 -        store_value = hits[i * width + j]
   9.254 -        xy = j, i
   9.255 -        if   store_value == 1: im.putpixel(xy, forward_color)
   9.256 -        elif store_value == 2: im.putpixel(xy, reverse_color)
   9.257 -        elif store_value == 3: im.putpixel(xy, overlap_color)
   9.258 -
   9.259 -if opts.fontsize != 0:
   9.260 -    axis1 = get_axis_image(seq_names1, name_sizes1, seq_starts1, seq_pix1)
   9.261 -    axis2 = get_axis_image(seq_names2, name_sizes2, seq_starts2, seq_pix2)
   9.262 -    axis2 = axis2.rotate(270)
   9.263 -    im.paste(axis1, (0, 0))
   9.264 -    im.paste(axis2, (0, 0))
   9.265 -
   9.266 -for i in seq_starts1[1:]:
   9.267 -    box = i - pix_tween_seqs, margin2, i, height
   9.268 -    im.paste(border_shade, box)
   9.269 -
   9.270 -for i in seq_starts2[1:]:
   9.271 -    box = margin1, i - pix_tween_seqs, width, i
   9.272 -    im.paste(border_shade, box)
   9.273 -
   9.274 -im.save(args[1])
    10.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    10.2 +++ b/scripts/last-map-probs	Wed Sep 24 13:13:14 2014 +0900
    10.3 @@ -0,0 +1,125 @@
    10.4 +#! /usr/bin/env python
    10.5 +
    10.6 +# Copyright 2010, 2011, 2012, 2014 Martin C. Frith
    10.7 +
    10.8 +# Read query-genome alignments: write them along with the probability
    10.9 +# that each alignment is not the true mapping of its query.  These
   10.10 +# probabilities make the risky assumption that one of the alignments
   10.11 +# reported for each query is correct.
   10.12 +
   10.13 +import sys, os, fileinput, math, optparse, signal
   10.14 +
   10.15 +def logsum(numbers):
   10.16 +    """Adds numbers, in log space, to avoid overflow."""
   10.17 +    m = max(numbers)
   10.18 +    s = sum(math.exp(i - m) for i in numbers)
   10.19 +    return m + math.log(s)
   10.20 +
   10.21 +def mafScore(aLine):
   10.22 +    for word in aLine.split():
   10.23 +        if word.startswith("score="):
   10.24 +            return float(word[6:])
   10.25 +    raise Exception("found an alignment without a score")
   10.26 +
   10.27 +def namesAndScores(lines):
   10.28 +    queryNames = []
   10.29 +    scores = []
   10.30 +    for line in lines:
   10.31 +        if line[0] == "a":  # faster than line.startswith("a")
   10.32 +            s = mafScore(line)
   10.33 +            scores.append(s)
   10.34 +            sLineCount = 0
   10.35 +        elif line[0] == "s":
   10.36 +            sLineCount += 1
   10.37 +            if sLineCount == 2: queryNames.append(line.split(None, 2)[1])
   10.38 +        elif line[0].isdigit():  # we have an alignment in tabular format
   10.39 +            w = line.split(None, 7)
   10.40 +            scores.append(float(w[0]))
   10.41 +            queryNames.append(w[6])
   10.42 +    return queryNames, scores
   10.43 +
   10.44 +def scoreTotals(queryNames, scores, temperature):
   10.45 +    queryLists = {}
   10.46 +    for n, s in zip(queryNames, scores):
   10.47 +        queryLists.setdefault(n, []).append(s / temperature)
   10.48 +    return dict((k, logsum(v)) for k, v in queryLists.iteritems())
   10.49 +
   10.50 +def writeOneBatch(lines, queryNames, scores, denominators, opts, temperature):
   10.51 +    isWanted = True
   10.52 +    i = 0
   10.53 +    for line in lines:
   10.54 +        if line[0] == "a":
   10.55 +            s = scores[i]
   10.56 +            p = 1.0 - math.exp(s / temperature - denominators[queryNames[i]])
   10.57 +            i += 1
   10.58 +            if s < opts.score or p > opts.mismap:
   10.59 +                isWanted = False
   10.60 +            else:
   10.61 +                newLineEnd = " mismap=%.3g\n" % p
   10.62 +                line = line.rstrip() + newLineEnd
   10.63 +        elif line[0].isdigit():  # we have an alignment in tabular format
   10.64 +            s = scores[i]
   10.65 +            p = 1.0 - math.exp(s / temperature - denominators[queryNames[i]])
   10.66 +            i += 1
   10.67 +            if s < opts.score or p > opts.mismap: continue
   10.68 +            newLineEnd = "\t%.3g\n" % p
   10.69 +            line = line.rstrip() + newLineEnd
   10.70 +        if isWanted: print line,
   10.71 +        if line.isspace(): isWanted = True  # reset at end of maf paragraph
   10.72 +
   10.73 +def doOneBatch(lines, opts, temperature):
   10.74 +    queryNames, scores = namesAndScores(lines)
   10.75 +    denominators = scoreTotals(queryNames, scores, temperature)
   10.76 +    writeOneBatch(lines, queryNames, scores, denominators, opts, temperature)
   10.77 +
   10.78 +def readHeaderOrDie(lines):
   10.79 +    t = 0.0
   10.80 +    e = -1
   10.81 +    for line in lines:
   10.82 +        if line.startswith("#") or line.isspace():
   10.83 +            print line,
   10.84 +            for i in line.split():
   10.85 +                if i.startswith("t="): t = float(i[2:])
   10.86 +                elif i.startswith("e="): e = int(i[2:])
   10.87 +            if t > 0 and e >= 0: break
   10.88 +        else:
   10.89 +            raise Exception("I need a header with t= and e=")
   10.90 +    return t, e
   10.91 +
   10.92 +def lastMapProbs(opts, args):
   10.93 +    f = fileinput.input(args)
   10.94 +    temperature, e = readHeaderOrDie(f)
   10.95 +    if opts.score < 0: opts.score = e + round(temperature * math.log(1000))
   10.96 +    lines = []
   10.97 +
   10.98 +    for line in f:
   10.99 +        if line.startswith("# batch"):
  10.100 +            doOneBatch(lines, opts, temperature)
  10.101 +            lines = []
  10.102 +        lines.append(line)
  10.103 +    doOneBatch(lines, opts, temperature)
  10.104 +
  10.105 +if __name__ == "__main__":
  10.106 +    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  10.107 +
  10.108 +    usage = """
  10.109 +  %prog --help
  10.110 +  %prog [options] lastal-alignments"""
  10.111 +
  10.112 +    description = "Calculate a mismap probability for each alignment.  This is the probability that the alignment does not reflect the origin of the query sequence, assuming that one reported alignment does reflect the origin of each query."
  10.113 +
  10.114 +    op = optparse.OptionParser(usage=usage, description=description)
  10.115 +    op.add_option("-m", "--mismap", type="float", default=0.01, metavar="M",
  10.116 +                  help="don't write alignments with mismap probability > M (default: %default)")
  10.117 +    op.add_option("-s", "--score", type="float", default=-1, metavar="S",
  10.118 +                  help="don't write alignments with score < S (default: e+t*ln[1000])")
  10.119 +    (opts, args) = op.parse_args()
  10.120 +    if not args and sys.stdin.isatty():
  10.121 +        op.print_help()
  10.122 +        op.exit()
  10.123 +
  10.124 +    try: lastMapProbs(opts, args)
  10.125 +    except KeyboardInterrupt: pass  # avoid silly error message
  10.126 +    except Exception, e:
  10.127 +        prog = os.path.basename(sys.argv[0])
  10.128 +        sys.exit(prog + ": error: " + str(e))
    11.1 --- a/scripts/last-map-probs.py	Wed Sep 24 11:41:48 2014 +0900
    11.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    11.3 @@ -1,125 +0,0 @@
    11.4 -#! /usr/bin/env python
    11.5 -
    11.6 -# Copyright 2010, 2011, 2012, 2014 Martin C. Frith
    11.7 -
    11.8 -# Read query-genome alignments: write them along with the probability
    11.9 -# that each alignment is not the true mapping of its query.  These
   11.10 -# probabilities make the risky assumption that one of the alignments
   11.11 -# reported for each query is correct.
   11.12 -
   11.13 -import sys, os, fileinput, math, optparse, signal
   11.14 -
   11.15 -def logsum(numbers):
   11.16 -    """Adds numbers, in log space, to avoid overflow."""
   11.17 -    m = max(numbers)
   11.18 -    s = sum(math.exp(i - m) for i in numbers)
   11.19 -    return m + math.log(s)
   11.20 -
   11.21 -def mafScore(aLine):
   11.22 -    for word in aLine.split():
   11.23 -        if word.startswith("score="):
   11.24 -            return float(word[6:])
   11.25 -    raise Exception("found an alignment without a score")
   11.26 -
   11.27 -def namesAndScores(lines):
   11.28 -    queryNames = []
   11.29 -    scores = []
   11.30 -    for line in lines:
   11.31 -        if line[0] == "a":  # faster than line.startswith("a")
   11.32 -            s = mafScore(line)
   11.33 -            scores.append(s)
   11.34 -            sLineCount = 0
   11.35 -        elif line[0] == "s":
   11.36 -            sLineCount += 1
   11.37 -            if sLineCount == 2: queryNames.append(line.split(None, 2)[1])
   11.38 -        elif line[0].isdigit():  # we have an alignment in tabular format
   11.39 -            w = line.split(None, 7)
   11.40 -            scores.append(float(w[0]))
   11.41 -            queryNames.append(w[6])
   11.42 -    return queryNames, scores
   11.43 -
   11.44 -def scoreTotals(queryNames, scores, temperature):
   11.45 -    queryLists = {}
   11.46 -    for n, s in zip(queryNames, scores):
   11.47 -        queryLists.setdefault(n, []).append(s / temperature)
   11.48 -    return dict((k, logsum(v)) for k, v in queryLists.iteritems())
   11.49 -
   11.50 -def writeOneBatch(lines, queryNames, scores, denominators, opts, temperature):
   11.51 -    isWanted = True
   11.52 -    i = 0
   11.53 -    for line in lines:
   11.54 -        if line[0] == "a":
   11.55 -            s = scores[i]
   11.56 -            p = 1.0 - math.exp(s / temperature - denominators[queryNames[i]])
   11.57 -            i += 1
   11.58 -            if s < opts.score or p > opts.mismap:
   11.59 -                isWanted = False
   11.60 -            else:
   11.61 -                newLineEnd = " mismap=%.3g\n" % p
   11.62 -                line = line.rstrip() + newLineEnd
   11.63 -        elif line[0].isdigit():  # we have an alignment in tabular format
   11.64 -            s = scores[i]
   11.65 -            p = 1.0 - math.exp(s / temperature - denominators[queryNames[i]])
   11.66 -            i += 1
   11.67 -            if s < opts.score or p > opts.mismap: continue
   11.68 -            newLineEnd = "\t%.3g\n" % p
   11.69 -            line = line.rstrip() + newLineEnd
   11.70 -        if isWanted: print line,
   11.71 -        if line.isspace(): isWanted = True  # reset at end of maf paragraph
   11.72 -
   11.73 -def doOneBatch(lines, opts, temperature):
   11.74 -    queryNames, scores = namesAndScores(lines)
   11.75 -    denominators = scoreTotals(queryNames, scores, temperature)
   11.76 -    writeOneBatch(lines, queryNames, scores, denominators, opts, temperature)
   11.77 -
   11.78 -def readHeaderOrDie(lines):
   11.79 -    t = 0.0
   11.80 -    e = -1
   11.81 -    for line in lines:
   11.82 -        if line.startswith("#") or line.isspace():
   11.83 -            print line,
   11.84 -            for i in line.split():
   11.85 -                if i.startswith("t="): t = float(i[2:])
   11.86 -                elif i.startswith("e="): e = int(i[2:])
   11.87 -            if t > 0 and e >= 0: break
   11.88 -        else:
   11.89 -            raise Exception("I need a header with t= and e=")
   11.90 -    return t, e
   11.91 -
   11.92 -def lastMapProbs(opts, args):
   11.93 -    f = fileinput.input(args)
   11.94 -    temperature, e = readHeaderOrDie(f)
   11.95 -    if opts.score < 0: opts.score = e + round(temperature * math.log(1000))
   11.96 -    lines = []
   11.97 -
   11.98 -    for line in f:
   11.99 -        if line.startswith("# batch"):
  11.100 -            doOneBatch(lines, opts, temperature)
  11.101 -            lines = []
  11.102 -        lines.append(line)
  11.103 -    doOneBatch(lines, opts, temperature)
  11.104 -
  11.105 -if __name__ == "__main__":
  11.106 -    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  11.107 -
  11.108 -    usage = """
  11.109 -  %prog --help
  11.110 -  %prog [options] lastal-alignments"""
  11.111 -
  11.112 -    description = "Calculate a mismap probability for each alignment.  This is the probability that the alignment does not reflect the origin of the query sequence, assuming that one reported alignment does reflect the origin of each query."
  11.113 -
  11.114 -    op = optparse.OptionParser(usage=usage, description=description)
  11.115 -    op.add_option("-m", "--mismap", type="float", default=0.01, metavar="M",
  11.116 -                  help="don't write alignments with mismap probability > M (default: %default)")
  11.117 -    op.add_option("-s", "--score", type="float", default=-1, metavar="S",
  11.118 -                  help="don't write alignments with score < S (default: e+t*ln[1000])")
  11.119 -    (opts, args) = op.parse_args()
  11.120 -    if not args and sys.stdin.isatty():
  11.121 -        op.print_help()
  11.122 -        op.exit()
  11.123 -
  11.124 -    try: lastMapProbs(opts, args)
  11.125 -    except KeyboardInterrupt: pass  # avoid silly error message
  11.126 -    except Exception, e:
  11.127 -        prog = os.path.basename(sys.argv[0])
  11.128 -        sys.exit(prog + ": error: " + str(e))
    12.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    12.2 +++ b/scripts/last-pair-probs	Wed Sep 24 13:13:14 2014 +0900
    12.3 @@ -0,0 +1,385 @@
    12.4 +#! /usr/bin/env python
    12.5 +
    12.6 +# Copyright 2011, 2012, 2013 Martin C. Frith
    12.7 +
    12.8 +# This script reads alignments of DNA reads to a genome, and estimates
    12.9 +# the probability that each alignment represents the genomic source of
   12.10 +# the read.  It assumes that the reads come in pairs, where each pair
   12.11 +# is from either end of a DNA fragment.
   12.12 +
   12.13 +# Seems to work with Python 2.x, x>=4.
   12.14 +
   12.15 +# The --rna option makes it assume that the genomic fragment lengths
   12.16 +# follow a log-normal distribution (instead of a normal distribution).
   12.17 +# In one test with human RNA, log-normal was a remarkably good fit,
   12.18 +# but not perfect.  The true distribution looked like a mixture of 2
   12.19 +# log-normals: a dominant one for shorter introns, and a minor one for
   12.20 +# huge introns.  Thus, our use of a single log-normal fails to model
   12.21 +# rare, huge introns.  To compensate for that, the default value of
   12.22 +# --disjoint is increased when --rna is used.
   12.23 +
   12.24 +# (Should we try to estimate the prior probability of disjoint mapping
   12.25 +# from the data?  But maybe ignore low-scoring alignments for that?
   12.26 +# Estimate disjoint maps to opposite strands of same chromosome = maps
   12.27 +# to same strand of same chromosome?)
   12.28 +
   12.29 +import itertools, math, operator, optparse, os, signal, sys
   12.30 +
   12.31 +def logSumExp(numbers):
   12.32 +    """Adds numbers, in log space, to avoid overflow."""
   12.33 +    n = list(numbers)
   12.34 +    if not n: return -1e99  # should be -inf
   12.35 +    m = max(n)
   12.36 +    s = sum(math.exp(i - m) for i in n)  # fsum is only Python >= 2.6.
   12.37 +    return math.log(s) + m
   12.38 +
   12.39 +def warn(*things):
   12.40 +    prog = os.path.basename(sys.argv[0])
   12.41 +    text = " ".join(map(str, things))
   12.42 +    sys.stderr.write(prog + ": " + text + "\n")
   12.43 +
   12.44 +def joinby(iterable1, iterable2, keyfunc):
   12.45 +    """Yields pairs from iterable1 and iterable2 that share the same key."""
   12.46 +    groups1 = itertools.groupby(iterable1, keyfunc)
   12.47 +    groups2 = itertools.groupby(iterable2, keyfunc)
   12.48 +    k1, v1 = groups1.next()
   12.49 +    k2, v2 = groups2.next()
   12.50 +    while 1:
   12.51 +        if k1 < k2:
   12.52 +            k1, v1 = groups1.next()
   12.53 +        elif k1 > k2:
   12.54 +            k2, v2 = groups2.next()
   12.55 +        else:
   12.56 +            v2 = list(v2)
   12.57 +            for i1 in v1:
   12.58 +                for i2 in v2:
   12.59 +                    yield i1, i2
   12.60 +            k1, v1 = groups1.next()
   12.61 +            k2, v2 = groups2.next()
   12.62 +
   12.63 +class AlignmentParameters:
   12.64 +    """Parses the score scale factor, minimum score, and genome size."""
   12.65 +
   12.66 +    def __init__(self):  # dummy values:
   12.67 +        self.t = -1  # score scale factor
   12.68 +        self.e = -1  # minimum score
   12.69 +        self.g = -1  # genome size
   12.70 +
   12.71 +    def update(self, line):
   12.72 +        for i in line.split():
   12.73 +            if self.t == -1 and i.startswith("t="):
   12.74 +                self.t = float(i[2:])
   12.75 +                if self.t <= 0: raise Exception("t must be positive")
   12.76 +            if self.e == -1 and i.startswith("e="):
   12.77 +                self.e = float(i[2:])
   12.78 +                if self.e <= 0: raise Exception("e must be positive")
   12.79 +            if self.g == -1 and i.startswith("letters="):
   12.80 +                self.g = float(i[8:])
   12.81 +                if self.g <= 0: raise Exception("letters must be positive")
   12.82 +
   12.83 +    def isValid(self):
   12.84 +        return self.t != -1 and self.e != -1 and self.g != -1
   12.85 +
   12.86 +    def validate(self):
   12.87 +        if self.t == -1: raise Exception("I need a header line with t=")
   12.88 +        if self.e == -1: raise Exception("I need a header line with e=")
   12.89 +        if self.g == -1: raise Exception("I need a header line with letters=")
   12.90 +
   12.91 +def printAlignmentWithMismapProb(alignment, prob, suf):
   12.92 +    lines = alignment[4]
   12.93 +    qName = alignment[5]
   12.94 +    if qName.endswith("/1") or qName.endswith("/2"): suf = ""
   12.95 +    p = "%.3g" % prob
   12.96 +    if len(lines) == 1:  # we have tabular format
   12.97 +        w = lines[0].split()
   12.98 +        w[6] += suf
   12.99 +        w.append(p)
  12.100 +        print "\t".join(w)
  12.101 +    else:  # we have MAF format
  12.102 +        print lines[0].rstrip() + " mismap=" + p
  12.103 +        pad = " " * len(suf)  # spacer to keep the alignment of MAF lines
  12.104 +        rNameEnd = len(alignment[0]) + 1  # where to insert the spacer
  12.105 +        qNameEnd = len(qName) + 2  # where to insert the suffix
  12.106 +        s = 0
  12.107 +        for i in lines[1:]:
  12.108 +            if i[0] in "sq":
  12.109 +                if i[0] == "s": s += 1
  12.110 +                if s == 1:    print i[:rNameEnd] + pad + i[rNameEnd:],
  12.111 +                else:         print i[:qNameEnd] + suf + i[qNameEnd:],
  12.112 +            elif i[0] == "p": print i[:1] + pad + i[1:]
  12.113 +            else:             print i,
  12.114 +        print  # each MAF block should end with a blank line
  12.115 +
  12.116 +def headToHeadDistance(alignment1, alignment2):
  12.117 +    """The 5'-to-5' distance between 2 alignments on opposite strands."""
  12.118 +    length = alignment1[1] + alignment2[1]
  12.119 +    if length > alignment1[2]: length -= alignment1[2]  # for circular chroms
  12.120 +    return length
  12.121 +
  12.122 +def conjointScores(aln1, alns2, fraglen, inner, isRna):
  12.123 +    for i in alns2:
  12.124 +        length = headToHeadDistance(aln1, i)
  12.125 +        if isRna:  # use a log-normal distribution
  12.126 +            if length <= 0: continue
  12.127 +            loglen = math.log(length)
  12.128 +            yield i[3] + inner * (loglen - fraglen) ** 2 - loglen
  12.129 +        else:      # use a normal distribution
  12.130 +            if (length > 0) != (fraglen > 0): continue  # ?
  12.131 +            yield i[3] + inner * (length - fraglen) ** 2
  12.132 +
  12.133 +def probForEachAlignment(alignments1, alignments2, opts):
  12.134 +    x = opts.disjointScore + logSumExp(i[3] for i in alignments2)
  12.135 +
  12.136 +    fraglen = opts.fraglen
  12.137 +    outer = opts.outer
  12.138 +    inner = opts.inner
  12.139 +    isRna = opts.rna
  12.140 +
  12.141 +    groups2 = itertools.groupby(alignments2, operator.itemgetter(0))
  12.142 +    genomeStrand2 = " "  # assume this is < any genomeStrand1
  12.143 +    for aln1 in alignments1:
  12.144 +        genomeStrand1 = aln1[0]
  12.145 +        # get the items in alignments2 that have the same genomeStrand:
  12.146 +        if genomeStrand2 < genomeStrand1:
  12.147 +            for genomeStrand2, alns2 in groups2:
  12.148 +                if genomeStrand2 >= genomeStrand1:
  12.149 +                    alns2 = list(alns2)
  12.150 +                    break
  12.151 +            else:
  12.152 +                genomeStrand2 = "~"  # assume this is > any genomeStrand1
  12.153 +        if genomeStrand1 == genomeStrand2:
  12.154 +            y = outer + logSumExp(conjointScores(aln1, alns2, fraglen, inner, isRna))
  12.155 +            yield aln1[3] + logSumExp((x, y))
  12.156 +        else:  # no items in alignments2 have the same genomeStrand
  12.157 +            yield aln1[3] + x
  12.158 +
  12.159 +def printAlnsForOneRead(alignments1, alignments2, opts, maxMissingScore, suf):
  12.160 +    if alignments2:
  12.161 +        zs = list(probForEachAlignment(alignments1, alignments2, opts))
  12.162 +        w = maxMissingScore + max(i[3] for i in alignments2)
  12.163 +    else:
  12.164 +        zs = [i[3] + opts.disjointScore for i in alignments1]
  12.165 +        w = maxMissingScore
  12.166 +
  12.167 +    z = logSumExp(zs)
  12.168 +    zw = logSumExp((z, w))
  12.169 +
  12.170 +    for i, j in itertools.izip(alignments1, zs):
  12.171 +        prob = 1 - math.exp(j - zw)
  12.172 +        if prob <= opts.mismap: printAlignmentWithMismapProb(i, prob, suf)
  12.173 +
  12.174 +def unambiguousFragmentLength(alignments1, alignments2):
  12.175 +    """Returns the fragment length implied by alignments of a pair of reads."""
  12.176 +    old = None
  12.177 +    for i, j in joinby(alignments1, alignments2, operator.itemgetter(0)):
  12.178 +        new = headToHeadDistance(i, j)
  12.179 +        if old is None: old = new
  12.180 +        elif new != old: return None  # the fragment length is ambiguous
  12.181 +    return old
  12.182 +
  12.183 +def unambiguousFragmentLengths(queryPairs):
  12.184 +    for i, j in queryPairs:
  12.185 +        length = unambiguousFragmentLength(i, j)
  12.186 +        if length is not None: yield length
  12.187 +
  12.188 +def readHeaderOrDie(lines):
  12.189 +    params = AlignmentParameters()
  12.190 +    for line in lines:
  12.191 +        if line[0] == "#":
  12.192 +            params.update(line)
  12.193 +            if params.isValid():
  12.194 +                return params
  12.195 +        elif not line.isspace():
  12.196 +            break
  12.197 +    params.validate()  # die
  12.198 +
  12.199 +def parseAlignment(score, rName, rStart, rSpan, rSize, qName, qStrand, text,
  12.200 +                   strand, scale, circularChroms):
  12.201 +    if qStrand == strand: genomeStrand = rName + "+"
  12.202 +    else:                 genomeStrand = rName + "-"
  12.203 +
  12.204 +    rStart = int(rStart)
  12.205 +    rSize = int(rSize)
  12.206 +
  12.207 +    if qStrand == "+":
  12.208 +        c = -rStart
  12.209 +    else:
  12.210 +        c = rStart + int(rSpan)
  12.211 +        if rName in circularChroms or "." in circularChroms: c += rSize
  12.212 +
  12.213 +    scaledScore = float(score) / scale  # needed in 2nd pass
  12.214 +
  12.215 +    return genomeStrand, c, rSize, scaledScore, text, qName
  12.216 +
  12.217 +def parseMafScore(aLine):
  12.218 +    for i in aLine.split():
  12.219 +        if i.startswith("score="): return i[6:]
  12.220 +    raise Exception("missing score")
  12.221 +
  12.222 +def parseMaf(lines, strand, scale, circularChroms):
  12.223 +    score = parseMafScore(lines[0])
  12.224 +    r, q = [i.split() for i in lines if i[0] == "s"]
  12.225 +    return parseAlignment(score, r[1], r[2], r[3], r[5], q[1], q[4], lines,
  12.226 +                          strand, scale, circularChroms)
  12.227 +
  12.228 +def parseTab(line, strand, scale, circularChroms):
  12.229 +    w = line.split()
  12.230 +    return parseAlignment(w[0], w[1], w[2], w[3], w[5], w[6], w[9], [line],
  12.231 +                          strand, scale, circularChroms)
  12.232 +
  12.233 +def readBatches(lines, strand, scale, circularChroms):
  12.234 +    """Yields alignment data from MAF or tabular format."""
  12.235 +    alns = []
  12.236 +    maf = []
  12.237 +    for line in lines:
  12.238 +        if line[0].isdigit():
  12.239 +            alns.append(parseTab(line, strand, scale, circularChroms))
  12.240 +        elif line[0].isalpha():
  12.241 +            maf.append(line)
  12.242 +        elif line.isspace():
  12.243 +            if maf: alns.append(parseMaf(maf, strand, scale, circularChroms))
  12.244 +            maf = []
  12.245 +        elif line.startswith("# batch "):
  12.246 +            if maf: alns.append(parseMaf(maf, strand, scale, circularChroms))
  12.247 +            maf = []
  12.248 +            yield alns  # might be empty
  12.249 +            alns = []
  12.250 +    if maf: alns.append(parseMaf(maf, strand, scale, circularChroms))
  12.251 +    yield alns  # might be empty
  12.252 +
  12.253 +def readQueryPairs(in1, in2, scale1, scale2, circularChroms):
  12.254 +    batches1 = readBatches(in1, "+", scale1, circularChroms)
  12.255 +    batches2 = readBatches(in2, "-", scale2, circularChroms)
  12.256 +    for i, j in itertools.izip(batches1, batches2):
  12.257 +        i.sort()
  12.258 +        j.sort()
  12.259 +        yield i, j
  12.260 +
  12.261 +def myRound(myFloat):
  12.262 +    """Round a real number to a moderate amount of significant figures."""
  12.263 +    return float("%g" % myFloat)
  12.264 +
  12.265 +def estimateFragmentLengthDistribution(lengths, opts):
  12.266 +    if not lengths:
  12.267 +        raise Exception("can't estimate the distribution of distances")
  12.268 +
  12.269 +    # Define quartiles in the most naive way possible:
  12.270 +    lengths.sort()
  12.271 +    sampleSize = len(lengths)
  12.272 +    quartile1 = lengths[sampleSize // 4]
  12.273 +    quartile2 = lengths[sampleSize // 2]
  12.274 +    quartile3 = lengths[sampleSize * 3 // 4]
  12.275 +
  12.276 +    warn("distance sample size:", sampleSize)
  12.277 +    warn("distance quartiles:", quartile1, quartile2, quartile3)
  12.278 +
  12.279 +    if opts.rna and quartile1 <= 0:
  12.280 +        raise Exception("too many distances <= 0")
  12.281 +
  12.282 +    if opts.rna: thing = "ln[distance]"
  12.283 +    else:        thing = "distance"
  12.284 +
  12.285 +    if opts.fraglen is None:
  12.286 +        if opts.rna: opts.fraglen = myRound(math.log(quartile2))
  12.287 +        else:        opts.fraglen = float(quartile2)
  12.288 +        warn("estimated mean %s: %s" % (thing, opts.fraglen))
  12.289 +
  12.290 +    if opts.sdev is None:
  12.291 +        if opts.rna: iqr = math.log(quartile3) - math.log(quartile1)
  12.292 +        else:        iqr = quartile3 - quartile1
  12.293 +        # Normal Distribution: sdev = iqr / (2 * qnorm(0.75))
  12.294 +        opts.sdev = myRound(iqr / 1.34898)
  12.295 +        warn("estimated standard deviation of %s: %s" % (thing, opts.sdev))
  12.296 +
  12.297 +def safeLog(x):
  12.298 +    if x == 0: return -1e99
  12.299 +    else:      return math.log(x)
  12.300 +
  12.301 +def calculateScorePieces(opts, params1, params2):
  12.302 +    if opts.sdev == 0:
  12.303 +        if opts.rna: opts.outer = opts.fraglen
  12.304 +        else:        opts.outer = 0.0
  12.305 +        opts.inner = -1e99
  12.306 +    else:  # parameters for a Normal Distribution (of fragment lengths):
  12.307 +        opts.outer = -math.log(opts.sdev * math.sqrt(2 * math.pi))
  12.308 +        opts.inner = -1.0 / (2 * opts.sdev ** 2)
  12.309 +
  12.310 +    opts.outer += safeLog(1 - opts.disjoint)
  12.311 +
  12.312 +    if params1.g != params2.g: raise Exception("unequal genome sizes")
  12.313 +    # Multiply genome size by 2, because it has 2 strands:
  12.314 +    opts.disjointScore = safeLog(opts.disjoint) - math.log(params1.g * 2)
  12.315 +
  12.316 +    # Max possible influence of an alignment just below the score threshold:
  12.317 +    maxLogPrior = opts.outer
  12.318 +    if opts.rna: maxLogPrior += opts.sdev ** 2 / 2 - opts.fraglen
  12.319 +    opts.maxMissingScore1 = (params1.e - 1) / params1.t + maxLogPrior
  12.320 +    opts.maxMissingScore2 = (params2.e - 1) / params2.t + maxLogPrior
  12.321 +
  12.322 +def lastPairProbs(opts, args):
  12.323 +    fileName1, fileName2 = args
  12.324 +
  12.325 +    if opts.fraglen is None or opts.sdev is None:
  12.326 +        in1 = open(fileName1)
  12.327 +        in2 = open(fileName2)
  12.328 +        qp = readQueryPairs(in1, in2, 1, 1, opts.circular)
  12.329 +        lengths = list(unambiguousFragmentLengths(qp))
  12.330 +        estimateFragmentLengthDistribution(lengths, opts)
  12.331 +        in1.close()
  12.332 +        in2.close()
  12.333 +
  12.334 +    if not opts.estdist:
  12.335 +        in1 = open(fileName1)
  12.336 +        in2 = open(fileName2)
  12.337 +        params1 = readHeaderOrDie(in1)
  12.338 +        params2 = readHeaderOrDie(in2)
  12.339 +        calculateScorePieces(opts, params1, params2)
  12.340 +        printme = opts.fraglen, opts.sdev, opts.disjoint, params1.g
  12.341 +        print "# fraglen=%s sdev=%s disjoint=%s genome=%.17g" % printme
  12.342 +        qp = readQueryPairs(in1, in2, params1.t, params2.t, opts.circular)
  12.343 +        for i, j in qp:
  12.344 +            printAlnsForOneRead(i, j, opts, opts.maxMissingScore1, "/1")
  12.345 +            printAlnsForOneRead(j, i, opts, opts.maxMissingScore2, "/2")
  12.346 +        in1.close()
  12.347 +        in2.close()
  12.348 +
  12.349 +if __name__ == "__main__":
  12.350 +    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  12.351 +
  12.352 +    usage = """
  12.353 +  %prog --help
  12.354 +  %prog [options] alignments1 alignments2"""
  12.355 +
  12.356 +    description = "Read alignments of paired DNA reads to a genome, and: (1) estimate the distribution of distances between paired reads, (2) estimate the probability that each alignment represents the genomic source of the read."
  12.357 +
  12.358 +    op = optparse.OptionParser(usage=usage, description=description)
  12.359 +    op.add_option("-r", "--rna", action="store_true", help=
  12.360 +                  "assume the reads are from potentially-spliced RNA")
  12.361 +    op.add_option("-e", "--estdist", action="store_true",
  12.362 +                  help="just estimate the distribution of distances")
  12.363 +    op.add_option("-m", "--mismap", type="float", default=0.01, metavar="M",
  12.364 +                  help="don't write alignments with mismap probability > M (default: %default)")
  12.365 +    op.add_option("-f", "--fraglen", type="float", metavar="BP",
  12.366 +                  help="mean distance in bp")
  12.367 +    op.add_option("-s", "--sdev", type="float", metavar="BP",
  12.368 +                  help="standard deviation of distance")
  12.369 +    op.add_option("-d", "--disjoint", type="float",
  12.370 +                  metavar="PROB", help=
  12.371 +                  "prior probability of disjoint mapping (default: 0.02 if -r, else 0.01)")
  12.372 +    op.add_option("-c", "--circular", action="append", metavar="CHROM",
  12.373 +                  help="specifies that chromosome CHROM is circular (default: chrM)")
  12.374 +    (opts, args) = op.parse_args()
  12.375 +    if opts.disjoint is None:
  12.376 +        if opts.rna: opts.disjoint = 0.02
  12.377 +        else:        opts.disjoint = 0.01
  12.378 +    if opts.disjoint < 0: op.error("option -d: should be >= 0")
  12.379 +    if opts.disjoint > 1: op.error("option -d: should be <= 1")
  12.380 +    if opts.sdev and opts.sdev < 0: op.error("option -s: should be >= 0")
  12.381 +    if len(args) != 2: op.error("please give me two file names")
  12.382 +    if opts.circular is None: opts.circular = ["chrM"]
  12.383 +
  12.384 +    try: lastPairProbs(opts, args)
  12.385 +    except KeyboardInterrupt: pass  # avoid silly error message
  12.386 +    except Exception, e:
  12.387 +        warn("error:", e)
  12.388 +        sys.exit(1)
    13.1 --- a/scripts/last-pair-probs.py	Wed Sep 24 11:41:48 2014 +0900
    13.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    13.3 @@ -1,385 +0,0 @@
    13.4 -#! /usr/bin/env python
    13.5 -
    13.6 -# Copyright 2011, 2012, 2013 Martin C. Frith
    13.7 -
    13.8 -# This script reads alignments of DNA reads to a genome, and estimates
    13.9 -# the probability that each alignment represents the genomic source of
   13.10 -# the read.  It assumes that the reads come in pairs, where each pair
   13.11 -# is from either end of a DNA fragment.
   13.12 -
   13.13 -# Seems to work with Python 2.x, x>=4.
   13.14 -
   13.15 -# The --rna option makes it assume that the genomic fragment lengths
   13.16 -# follow a log-normal distribution (instead of a normal distribution).
   13.17 -# In one test with human RNA, log-normal was a remarkably good fit,
   13.18 -# but not perfect.  The true distribution looked like a mixture of 2
   13.19 -# log-normals: a dominant one for shorter introns, and a minor one for
   13.20 -# huge introns.  Thus, our use of a single log-normal fails to model
   13.21 -# rare, huge introns.  To compensate for that, the default value of
   13.22 -# --disjoint is increased when --rna is used.
   13.23 -
   13.24 -# (Should we try to estimate the prior probability of disjoint mapping
   13.25 -# from the data?  But maybe ignore low-scoring alignments for that?
   13.26 -# Estimate disjoint maps to opposite strands of same chromosome = maps
   13.27 -# to same strand of same chromosome?)
   13.28 -
   13.29 -import itertools, math, operator, optparse, os, signal, sys
   13.30 -
   13.31 -def logSumExp(numbers):
   13.32 -    """Adds numbers, in log space, to avoid overflow."""
   13.33 -    n = list(numbers)
   13.34 -    if not n: return -1e99  # should be -inf
   13.35 -    m = max(n)
   13.36 -    s = sum(math.exp(i - m) for i in n)  # fsum is only Python >= 2.6.
   13.37 -    return math.log(s) + m
   13.38 -
   13.39 -def warn(*things):
   13.40 -    prog = os.path.basename(sys.argv[0])
   13.41 -    text = " ".join(map(str, things))
   13.42 -    sys.stderr.write(prog + ": " + text + "\n")
   13.43 -
   13.44 -def joinby(iterable1, iterable2, keyfunc):
   13.45 -    """Yields pairs from iterable1 and iterable2 that share the same key."""
   13.46 -    groups1 = itertools.groupby(iterable1, keyfunc)
   13.47 -    groups2 = itertools.groupby(iterable2, keyfunc)
   13.48 -    k1, v1 = groups1.next()
   13.49 -    k2, v2 = groups2.next()
   13.50 -    while 1:
   13.51 -        if k1 < k2:
   13.52 -            k1, v1 = groups1.next()
   13.53 -        elif k1 > k2:
   13.54 -            k2, v2 = groups2.next()
   13.55 -        else:
   13.56 -            v2 = list(v2)
   13.57 -            for i1 in v1:
   13.58 -                for i2 in v2:
   13.59 -                    yield i1, i2
   13.60 -            k1, v1 = groups1.next()
   13.61 -            k2, v2 = groups2.next()
   13.62 -
   13.63 -class AlignmentParameters:
   13.64 -    """Parses the score scale factor, minimum score, and genome size."""
   13.65 -
   13.66 -    def __init__(self):  # dummy values:
   13.67 -        self.t = -1  # score scale factor
   13.68 -        self.e = -1  # minimum score
   13.69 -        self.g = -1  # genome size
   13.70 -
   13.71 -    def update(self, line):
   13.72 -        for i in line.split():
   13.73 -            if self.t == -1 and i.startswith("t="):
   13.74 -                self.t = float(i[2:])
   13.75 -                if self.t <= 0: raise Exception("t must be positive")
   13.76 -            if self.e == -1 and i.startswith("e="):
   13.77 -                self.e = float(i[2:])
   13.78 -                if self.e <= 0: raise Exception("e must be positive")
   13.79 -            if self.g == -1 and i.startswith("letters="):
   13.80 -                self.g = float(i[8:])
   13.81 -                if self.g <= 0: raise Exception("letters must be positive")
   13.82 -
   13.83 -    def isValid(self):
   13.84 -        return self.t != -1 and self.e != -1 and self.g != -1
   13.85 -
   13.86 -    def validate(self):
   13.87 -        if self.t == -1: raise Exception("I need a header line with t=")
   13.88 -        if self.e == -1: raise Exception("I need a header line with e=")
   13.89 -        if self.g == -1: raise Exception("I need a header line with letters=")
   13.90 -
   13.91 -def printAlignmentWithMismapProb(alignment, prob, suf):
   13.92 -    lines = alignment[4]
   13.93 -    qName = alignment[5]
   13.94 -    if qName.endswith("/1") or qName.endswith("/2"): suf = ""
   13.95 -    p = "%.3g" % prob
   13.96 -    if len(lines) == 1:  # we have tabular format
   13.97 -        w = lines[0].split()
   13.98 -        w[6] += suf
   13.99 -        w.append(p)
  13.100 -        print "\t".join(w)
  13.101 -    else:  # we have MAF format
  13.102 -        print lines[0].rstrip() + " mismap=" + p
  13.103 -        pad = " " * len(suf)  # spacer to keep the alignment of MAF lines
  13.104 -        rNameEnd = len(alignment[0]) + 1  # where to insert the spacer
  13.105 -        qNameEnd = len(qName) + 2  # where to insert the suffix
  13.106 -        s = 0
  13.107 -        for i in lines[1:]:
  13.108 -            if i[0] in "sq":
  13.109 -                if i[0] == "s": s += 1
  13.110 -                if s == 1:    print i[:rNameEnd] + pad + i[rNameEnd:],
  13.111 -                else:         print i[:qNameEnd] + suf + i[qNameEnd:],
  13.112 -            elif i[0] == "p": print i[:1] + pad + i[1:]
  13.113 -            else:             print i,
  13.114 -        print  # each MAF block should end with a blank line
  13.115 -
  13.116 -def headToHeadDistance(alignment1, alignment2):
  13.117 -    """The 5'-to-5' distance between 2 alignments on opposite strands."""
  13.118 -    length = alignment1[1] + alignment2[1]
  13.119 -    if length > alignment1[2]: length -= alignment1[2]  # for circular chroms
  13.120 -    return length
  13.121 -
  13.122 -def conjointScores(aln1, alns2, fraglen, inner, isRna):
  13.123 -    for i in alns2:
  13.124 -        length = headToHeadDistance(aln1, i)
  13.125 -        if isRna:  # use a log-normal distribution
  13.126 -            if length <= 0: continue
  13.127 -            loglen = math.log(length)
  13.128 -            yield i[3] + inner * (loglen - fraglen) ** 2 - loglen
  13.129 -        else:      # use a normal distribution
  13.130 -            if (length > 0) != (fraglen > 0): continue  # ?
  13.131 -            yield i[3] + inner * (length - fraglen) ** 2
  13.132 -
  13.133 -def probForEachAlignment(alignments1, alignments2, opts):
  13.134 -    x = opts.disjointScore + logSumExp(i[3] for i in alignments2)
  13.135 -
  13.136 -    fraglen = opts.fraglen
  13.137 -    outer = opts.outer
  13.138 -    inner = opts.inner
  13.139 -    isRna = opts.rna
  13.140 -
  13.141 -    groups2 = itertools.groupby(alignments2, operator.itemgetter(0))
  13.142 -    genomeStrand2 = " "  # assume this is < any genomeStrand1
  13.143 -    for aln1 in alignments1:
  13.144 -        genomeStrand1 = aln1[0]
  13.145 -        # get the items in alignments2 that have the same genomeStrand:
  13.146 -        if genomeStrand2 < genomeStrand1:
  13.147 -            for genomeStrand2, alns2 in groups2:
  13.148 -                if genomeStrand2 >= genomeStrand1:
  13.149 -                    alns2 = list(alns2)
  13.150 -                    break
  13.151 -            else:
  13.152 -                genomeStrand2 = "~"  # assume this is > any genomeStrand1
  13.153 -        if genomeStrand1 == genomeStrand2:
  13.154 -            y = outer + logSumExp(conjointScores(aln1, alns2, fraglen, inner, isRna))
  13.155 -            yield aln1[3] + logSumExp((x, y))
  13.156 -        else:  # no items in alignments2 have the same genomeStrand
  13.157 -            yield aln1[3] + x
  13.158 -
  13.159 -def printAlnsForOneRead(alignments1, alignments2, opts, maxMissingScore, suf):
  13.160 -    if alignments2:
  13.161 -        zs = list(probForEachAlignment(alignments1, alignments2, opts))
  13.162 -        w = maxMissingScore + max(i[3] for i in alignments2)
  13.163 -    else:
  13.164 -        zs = [i[3] + opts.disjointScore for i in alignments1]
  13.165 -        w = maxMissingScore
  13.166 -
  13.167 -    z = logSumExp(zs)
  13.168 -    zw = logSumExp((z, w))
  13.169 -
  13.170 -    for i, j in itertools.izip(alignments1, zs):
  13.171 -        prob = 1 - math.exp(j - zw)
  13.172 -        if prob <= opts.mismap: printAlignmentWithMismapProb(i, prob, suf)
  13.173 -
  13.174 -def unambiguousFragmentLength(alignments1, alignments2):
  13.175 -    """Returns the fragment length implied by alignments of a pair of reads."""
  13.176 -    old = None
  13.177 -    for i, j in joinby(alignments1, alignments2, operator.itemgetter(0)):
  13.178 -        new = headToHeadDistance(i, j)
  13.179 -        if old is None: old = new
  13.180 -        elif new != old: return None  # the fragment length is ambiguous
  13.181 -    return old
  13.182 -
  13.183 -def unambiguousFragmentLengths(queryPairs):
  13.184 -    for i, j in queryPairs:
  13.185 -        length = unambiguousFragmentLength(i, j)
  13.186 -        if length is not None: yield length
  13.187 -
  13.188 -def readHeaderOrDie(lines):
  13.189 -    params = AlignmentParameters()
  13.190 -    for line in lines:
  13.191 -        if line[0] == "#":
  13.192 -            params.update(line)
  13.193 -            if params.isValid():
  13.194 -                return params
  13.195 -        elif not line.isspace():
  13.196 -            break
  13.197 -    params.validate()  # die
  13.198 -
  13.199 -def parseAlignment(score, rName, rStart, rSpan, rSize, qName, qStrand, text,
  13.200 -                   strand, scale, circularChroms):
  13.201 -    if qStrand == strand: genomeStrand = rName + "+"
  13.202 -    else:                 genomeStrand = rName + "-"
  13.203 -
  13.204 -    rStart = int(rStart)
  13.205 -    rSize = int(rSize)
  13.206 -
  13.207 -    if qStrand == "+":
  13.208 -        c = -rStart
  13.209 -    else:
  13.210 -        c = rStart + int(rSpan)
  13.211 -        if rName in circularChroms or "." in circularChroms: c += rSize
  13.212 -
  13.213 -    scaledScore = float(score) / scale  # needed in 2nd pass
  13.214 -
  13.215 -    return genomeStrand, c, rSize, scaledScore, text, qName
  13.216 -
  13.217 -def parseMafScore(aLine):
  13.218 -    for i in aLine.split():
  13.219 -        if i.startswith("score="): return i[6:]
  13.220 -    raise Exception("missing score")
  13.221 -
  13.222 -def parseMaf(lines, strand, scale, circularChroms):
  13.223 -    score = parseMafScore(lines[0])
  13.224 -    r, q = [i.split() for i in lines if i[0] == "s"]
  13.225 -    return parseAlignment(score, r[1], r[2], r[3], r[5], q[1], q[4], lines,
  13.226 -                          strand, scale, circularChroms)
  13.227 -
  13.228 -def parseTab(line, strand, scale, circularChroms):
  13.229 -    w = line.split()
  13.230 -    return parseAlignment(w[0], w[1], w[2], w[3], w[5], w[6], w[9], [line],
  13.231 -                          strand, scale, circularChroms)
  13.232 -
  13.233 -def readBatches(lines, strand, scale, circularChroms):
  13.234 -    """Yields alignment data from MAF or tabular format."""
  13.235 -    alns = []
  13.236 -    maf = []
  13.237 -    for line in lines:
  13.238 -        if line[0].isdigit():
  13.239 -            alns.append(parseTab(line, strand, scale, circularChroms))
  13.240 -        elif line[0].isalpha():
  13.241 -            maf.append(line)
  13.242 -        elif line.isspace():
  13.243 -            if maf: alns.append(parseMaf(maf, strand, scale, circularChroms))
  13.244 -            maf = []
  13.245 -        elif line.startswith("# batch "):
  13.246 -            if maf: alns.append(parseMaf(maf, strand, scale, circularChroms))
  13.247 -            maf = []
  13.248 -            yield alns  # might be empty
  13.249 -            alns = []
  13.250 -    if maf: alns.append(parseMaf(maf, strand, scale, circularChroms))
  13.251 -    yield alns  # might be empty
  13.252 -
  13.253 -def readQueryPairs(in1, in2, scale1, scale2, circularChroms):
  13.254 -    batches1 = readBatches(in1, "+", scale1, circularChroms)
  13.255 -    batches2 = readBatches(in2, "-", scale2, circularChroms)
  13.256 -    for i, j in itertools.izip(batches1, batches2):
  13.257 -        i.sort()
  13.258 -        j.sort()
  13.259 -        yield i, j
  13.260 -
  13.261 -def myRound(myFloat):
  13.262 -    """Round a real number to a moderate amount of significant figures."""
  13.263 -    return float("%g" % myFloat)
  13.264 -
  13.265 -def estimateFragmentLengthDistribution(lengths, opts):
  13.266 -    if not lengths:
  13.267 -        raise Exception("can't estimate the distribution of distances")
  13.268 -
  13.269 -    # Define quartiles in the most naive way possible:
  13.270 -    lengths.sort()
  13.271 -    sampleSize = len(lengths)
  13.272 -    quartile1 = lengths[sampleSize // 4]
  13.273 -    quartile2 = lengths[sampleSize // 2]
  13.274 -    quartile3 = lengths[sampleSize * 3 // 4]
  13.275 -
  13.276 -    warn("distance sample size:", sampleSize)
  13.277 -    warn("distance quartiles:", quartile1, quartile2, quartile3)
  13.278 -
  13.279 -    if opts.rna and quartile1 <= 0:
  13.280 -        raise Exception("too many distances <= 0")
  13.281 -
  13.282 -    if opts.rna: thing = "ln[distance]"
  13.283 -    else:        thing = "distance"
  13.284 -
  13.285 -    if opts.fraglen is None:
  13.286 -        if opts.rna: opts.fraglen = myRound(math.log(quartile2))
  13.287 -        else:        opts.fraglen = float(quartile2)
  13.288 -        warn("estimated mean %s: %s" % (thing, opts.fraglen))
  13.289 -
  13.290 -    if opts.sdev is None:
  13.291 -        if opts.rna: iqr = math.log(quartile3) - math.log(quartile1)
  13.292 -        else:        iqr = quartile3 - quartile1
  13.293 -        # Normal Distribution: sdev = iqr / (2 * qnorm(0.75))
  13.294 -        opts.sdev = myRound(iqr / 1.34898)
  13.295 -        warn("estimated standard deviation of %s: %s" % (thing, opts.sdev))
  13.296 -
  13.297 -def safeLog(x):
  13.298 -    if x == 0: return -1e99
  13.299 -    else:      return math.log(x)
  13.300 -
  13.301 -def calculateScorePieces(opts, params1, params2):
  13.302 -    if opts.sdev == 0:
  13.303 -        if opts.rna: opts.outer = opts.fraglen
  13.304 -        else:        opts.outer = 0.0
  13.305 -        opts.inner = -1e99
  13.306 -    else:  # parameters for a Normal Distribution (of fragment lengths):
  13.307 -        opts.outer = -math.log(opts.sdev * math.sqrt(2 * math.pi))
  13.308 -        opts.inner = -1.0 / (2 * opts.sdev ** 2)
  13.309 -
  13.310 -    opts.outer += safeLog(1 - opts.disjoint)
  13.311 -
  13.312 -    if params1.g != params2.g: raise Exception("unequal genome sizes")
  13.313 -    # Multiply genome size by 2, because it has 2 strands:
  13.314 -    opts.disjointScore = safeLog(opts.disjoint) - math.log(params1.g * 2)
  13.315 -
  13.316 -    # Max possible influence of an alignment just below the score threshold:
  13.317 -    maxLogPrior = opts.outer
  13.318 -    if opts.rna: maxLogPrior += opts.sdev ** 2 / 2 - opts.fraglen
  13.319 -    opts.maxMissingScore1 = (params1.e - 1) / params1.t + maxLogPrior
  13.320 -    opts.maxMissingScore2 = (params2.e - 1) / params2.t + maxLogPrior
  13.321 -
  13.322 -def lastPairProbs(opts, args):
  13.323 -    fileName1, fileName2 = args
  13.324 -
  13.325 -    if opts.fraglen is None or opts.sdev is None:
  13.326 -        in1 = open(fileName1)
  13.327 -        in2 = open(fileName2)
  13.328 -        qp = readQueryPairs(in1, in2, 1, 1, opts.circular)
  13.329 -        lengths = list(unambiguousFragmentLengths(qp))
  13.330 -        estimateFragmentLengthDistribution(lengths, opts)
  13.331 -        in1.close()
  13.332 -        in2.close()
  13.333 -
  13.334 -    if not opts.estdist:
  13.335 -        in1 = open(fileName1)
  13.336 -        in2 = open(fileName2)
  13.337 -        params1 = readHeaderOrDie(in1)
  13.338 -        params2 = readHeaderOrDie(in2)
  13.339 -        calculateScorePieces(opts, params1, params2)
  13.340 -        printme = opts.fraglen, opts.sdev, opts.disjoint, params1.g
  13.341 -        print "# fraglen=%s sdev=%s disjoint=%s genome=%.17g" % printme
  13.342 -        qp = readQueryPairs(in1, in2, params1.t, params2.t, opts.circular)
  13.343 -        for i, j in qp:
  13.344 -            printAlnsForOneRead(i, j, opts, opts.maxMissingScore1, "/1")
  13.345 -            printAlnsForOneRead(j, i, opts, opts.maxMissingScore2, "/2")
  13.346 -        in1.close()
  13.347 -        in2.close()
  13.348 -
  13.349 -if __name__ == "__main__":
  13.350 -    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  13.351 -
  13.352 -    usage = """
  13.353 -  %prog --help
  13.354 -  %prog [options] alignments1 alignments2"""
  13.355 -
  13.356 -    description = "Read alignments of paired DNA reads to a genome, and: (1) estimate the distribution of distances between paired reads, (2) estimate the probability that each alignment represents the genomic source of the read."
  13.357 -
  13.358 -    op = optparse.OptionParser(usage=usage, description=description)
  13.359 -    op.add_option("-r", "--rna", action="store_true", help=
  13.360 -                  "assume the reads are from potentially-spliced RNA")
  13.361 -    op.add_option("-e", "--estdist", action="store_true",
  13.362 -                  help="just estimate the distribution of distances")
  13.363 -    op.add_option("-m", "--mismap", type="float", default=0.01, metavar="M",
  13.364 -                  help="don't write alignments with mismap probability > M (default: %default)")
  13.365 -    op.add_option("-f", "--fraglen", type="float", metavar="BP",
  13.366 -                  help="mean distance in bp")
  13.367 -    op.add_option("-s", "--sdev", type="float", metavar="BP",
  13.368 -                  help="standard deviation of distance")
  13.369 -    op.add_option("-d", "--disjoint", type="float",
  13.370 -                  metavar="PROB", help=
  13.371 -                  "prior probability of disjoint mapping (default: 0.02 if -r, else 0.01)")
  13.372 -    op.add_option("-c", "--circular", action="append", metavar="CHROM",
  13.373 -                  help="specifies that chromosome CHROM is circular (default: chrM)")
  13.374 -    (opts, args) = op.parse_args()
  13.375 -    if opts.disjoint is None:
  13.376 -        if opts.rna: opts.disjoint = 0.02
  13.377 -        else:        opts.disjoint = 0.01
  13.378 -    if opts.disjoint < 0: op.error("option -d: should be >= 0")
  13.379 -    if opts.disjoint > 1: op.error("option -d: should be <= 1")
  13.380 -    if opts.sdev and opts.sdev < 0: op.error("option -s: should be >= 0")
  13.381 -    if len(args) != 2: op.error("please give me two file names")
  13.382 -    if opts.circular is None: opts.circular = ["chrM"]
  13.383 -
  13.384 -    try: lastPairProbs(opts, args)
  13.385 -    except KeyboardInterrupt: pass  # avoid silly error message
  13.386 -    except Exception, e:
  13.387 -        warn("error:", e)
  13.388 -        sys.exit(1)
    14.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    14.2 +++ b/scripts/maf-convert	Wed Sep 24 13:13:14 2014 +0900
    14.3 @@ -0,0 +1,622 @@
    14.4 +#! /usr/bin/env python
    14.5 +# Copyright 2010, 2011, 2013, 2014 Martin C. Frith
    14.6 +# Read MAF-format alignments: write them in other formats.
    14.7 +# Seems to work with Python 2.x, x>=4
    14.8 +
    14.9 +# By "MAF" we mean "multiple alignment format" described in the UCSC
   14.10 +# Genome FAQ, not e.g. "MIRA assembly format".
   14.11 +
   14.12 +from itertools import *
   14.13 +import sys, os, fileinput, math, operator, optparse, signal, string
   14.14 +
   14.15 +def maxlen(s):
   14.16 +    return max(map(len, s))
   14.17 +
   14.18 +def joined(things, delimiter):
   14.19 +    return delimiter.join(map(str, things))
   14.20 +
   14.21 +identityTranslation = string.maketrans("", "")
   14.22 +def deleted(myString, deleteChars):
   14.23 +    return myString.translate(identityTranslation, deleteChars)
   14.24 +
   14.25 +def quantify(iterable, pred=bool):
   14.26 +    """Count how many times the predicate is true."""
   14.27 +    return sum(map(pred, iterable))
   14.28 +
   14.29 +class Maf:
   14.30 +    def __init__(self, aLine, sLines, qLines, pLines):
   14.31 +        self.namesAndValues = dict(i.split("=") for i in aLine[1:])
   14.32 +
   14.33 +        if not sLines: raise Exception("empty alignment")
   14.34 +        cols = zip(*sLines)
   14.35 +        self.seqNames = cols[1]
   14.36 +        self.alnStarts = map(int, cols[2])
   14.37 +        self.alnSizes = map(int, cols[3])
   14.38 +        self.strands = cols[4]
   14.39 +        self.seqSizes = map(int, cols[5])
   14.40 +        self.alnStrings = cols[6]
   14.41 +
   14.42 +        self.qLines = qLines
   14.43 +        self.pLines = pLines
   14.44 +
   14.45 +def dieUnlessPairwise(maf):
   14.46 +    if len(maf.alnStrings) != 2:
   14.47 +        raise Exception("pairwise alignments only, please")
   14.48 +
   14.49 +def insertionCounts(alnString):
   14.50 +    gaps = alnString.count("-")
   14.51 +    forwardFrameshifts = alnString.count("\\")
   14.52 +    reverseFrameshifts = alnString.count("/")
   14.53 +    letters = len(alnString) - gaps - forwardFrameshifts - reverseFrameshifts
   14.54 +    return letters, forwardFrameshifts, reverseFrameshifts
   14.55 +
   14.56 +def coordinatesPerLetter(alnString, alnSize):
   14.57 +    letters, forwardShifts, reverseShifts = insertionCounts(alnString)
   14.58 +    if forwardShifts or reverseShifts or letters < alnSize: return 3
   14.59 +    else: return 1
   14.60 +
   14.61 +def mafLetterSizes(maf):
   14.62 +    return map(coordinatesPerLetter, maf.alnStrings, maf.alnSizes)
   14.63 +
   14.64 +def insertionSize(alnString, letterSize):
   14.65 +    """Get the length of sequence included in the alnString."""
   14.66 +    letters, forwardShifts, reverseShifts = insertionCounts(alnString)
   14.67 +    return letters * letterSize + forwardShifts - reverseShifts
   14.68 +
   14.69 +def isMatch(alignmentColumn):
   14.70 +    # No special treatment of ambiguous bases/residues: same as NCBI BLAST.
   14.71 +    first = alignmentColumn[0].upper()
   14.72 +    for i in alignmentColumn[1:]:
   14.73 +        if i.upper() != first: return False
   14.74 +    return True
   14.75 +
   14.76 +def isGapless(alignmentColumn):
   14.77 +    return "-" not in alignmentColumn
   14.78 +
   14.79 +def matchAndInsertSizes(alignmentColumns, letterSizes):
   14.80 +    """Get sizes of gapless blocks, and of the inserts between them."""
   14.81 +    for k, v in groupby(alignmentColumns, isGapless):
   14.82 +        if k:
   14.83 +            sizeOfGaplessBlock = len(list(v))
   14.84 +            yield [sizeOfGaplessBlock]
   14.85 +        else:
   14.86 +            blockRows = zip(*v)
   14.87 +            blockRows = map(''.join, blockRows)
   14.88 +            yield map(insertionSize, blockRows, letterSizes)
   14.89 +
   14.90 +##### Routines for converting to AXT format: #####
   14.91 +
   14.92 +axtCounter = count()
   14.93 +
   14.94 +def writeAxt(maf):
   14.95 +    if maf.strands[0] != "+":
   14.96 +        raise Exception("for AXT, the 1st strand in each alignment must be +")
   14.97 +
   14.98 +    # Convert to AXT's 1-based coordinates:
   14.99 +    alnStarts = imap(operator.add, maf.alnStarts, repeat(1))
  14.100 +    alnEnds = imap(operator.add, maf.alnStarts, maf.alnSizes)
  14.101 +
  14.102 +    rows = zip(maf.seqNames, alnStarts, alnEnds, maf.strands)
  14.103 +    head, tail = rows[0], rows[1:]
  14.104 +
  14.105 +    outWords = []
  14.106 +    outWords.append(axtCounter.next())
  14.107 +    outWords.extend(head[:3])
  14.108 +    outWords.extend(chain(*tail))
  14.109 +    outWords.append(maf.namesAndValues["score"])
  14.110 +
  14.111 +    print joined(outWords, " ")
  14.112 +    for i in maf.alnStrings: print i
  14.113 +    print  # print a blank line at the end
  14.114 +
  14.115 +##### Routines for converting to tabular format: #####
  14.116 +
  14.117 +def writeTab(maf):
  14.118 +    outWords = []
  14.119 +    outWords.append(maf.namesAndValues["score"])
  14.120 +
  14.121 +    cols = maf.seqNames, maf.alnStarts, maf.alnSizes, maf.strands, maf.seqSizes
  14.122 +    rows = zip(*cols)
  14.123 +    outWords.extend(chain(*rows))
  14.124 +
  14.125 +    alignmentColumns = zip(*maf.alnStrings)
  14.126 +    letterSizes = mafLetterSizes(maf)
  14.127 +    gapInfo = matchAndInsertSizes(alignmentColumns, letterSizes)
  14.128 +    gapStrings = imap(joined, gapInfo, repeat(":"))
  14.129 +    gapWord = ",".join(gapStrings)
  14.130 +    outWords.append(gapWord)
  14.131 +
  14.132 +    try: outWords.append(maf.namesAndValues["expect"])
  14.133 +    except KeyError: pass
  14.134 +
  14.135 +    try: outWords.append(maf.namesAndValues["mismap"])
  14.136 +    except KeyError: pass
  14.137 +
  14.138 +    print joined(outWords, "\t")
  14.139 +
  14.140 +##### Routines for converting to PSL format: #####
  14.141 +
  14.142 +def pslBlocks(alignmentColumns, alnStarts, letterSizes):
  14.143 +    """Get sizes and start coordinates of gapless blocks in an alignment."""
  14.144 +    coordinates = alnStarts
  14.145 +    for i in matchAndInsertSizes(alignmentColumns, letterSizes):
  14.146 +        if len(i) == 1:  # we have the size of a gapless block
  14.147 +            yield i + coordinates
  14.148 +            increments = imap(operator.mul, letterSizes, cycle(i))
  14.149 +            coordinates = map(operator.add, coordinates, increments)
  14.150 +        else:  # we have sizes of inserts between gapless blocks
  14.151 +            coordinates = map(operator.add, coordinates, i)
  14.152 +
  14.153 +def pslCommaString(things):
  14.154 +    # UCSC software seems to prefer a trailing comma
  14.155 +    return joined(things, ",") + ","
  14.156 +
  14.157 +def gapRunCount(letters):
  14.158 +    """Get the number of runs of gap characters."""
  14.159 +    uniqLetters = map(operator.itemgetter(0), groupby(letters))
  14.160 +    return uniqLetters.count("-")
  14.161 +
  14.162 +def pslEndpoints(alnStart, alnSize, strand, seqSize):
  14.163 +    alnEnd = alnStart + alnSize
  14.164 +    if strand == "+": return alnStart, alnEnd
  14.165 +    else: return seqSize - alnEnd, seqSize - alnStart
  14.166 +
  14.167 +def caseSensitivePairwiseMatchCounts(columns, isProtein):
  14.168 +    # repMatches is always zero
  14.169 +    # for proteins, nCount is always zero, because that's what BLATv34 does
  14.170 +    standardBases = "ACGTU"
  14.171 +    matches = mismatches = repMatches = nCount = 0
  14.172 +    for i in columns:
  14.173 +        if "-" in i: continue
  14.174 +        x, y = i
  14.175 +        if x in standardBases and y in standardBases or isProtein:
  14.176 +            if x == y: matches += 1
  14.177 +            else: mismatches += 1
  14.178 +        else: nCount += 1
  14.179 +    return matches, mismatches, repMatches, nCount
  14.180 +
  14.181 +def writePsl(maf, isProtein):
  14.182 +    dieUnlessPairwise(maf)
  14.183 +
  14.184 +    alnStrings = map(str.upper, maf.alnStrings)
  14.185 +    alignmentColumns = zip(*alnStrings)
  14.186 +    letterSizes = mafLetterSizes(maf)
  14.187 +
  14.188 +    matchCounts = caseSensitivePairwiseMatchCounts(alignmentColumns, isProtein)
  14.189 +    matches, mismatches, repMatches, nCount = matchCounts
  14.190 +    numGaplessColumns = sum(matchCounts)
  14.191 +
  14.192 +    qNumInsert = gapRunCount(maf.alnStrings[0])
  14.193 +    qBaseInsert = maf.alnSizes[1] - numGaplessColumns * letterSizes[1]
  14.194 +
  14.195 +    tNumInsert = gapRunCount(maf.alnStrings[1])
  14.196 +    tBaseInsert = maf.alnSizes[0] - numGaplessColumns * letterSizes[0]
  14.197 +
  14.198 +    strand = maf.strands[1]
  14.199 +    if max(letterSizes) > 1:
  14.200 +        strand += maf.strands[0]
  14.201 +    elif maf.strands[0] != "+":
  14.202 +        raise Exception("for non-translated PSL, the 1st strand in each alignment must be +")
  14.203 +
  14.204 +    tName, qName = maf.seqNames
  14.205 +    tSeqSize, qSeqSize = maf.seqSizes
  14.206 +
  14.207 +    rows = zip(maf.alnStarts, maf.alnSizes, maf.strands, maf.seqSizes)
  14.208 +    tStart, tEnd = pslEndpoints(*rows[0])
  14.209 +    qStart, qEnd = pslEndpoints(*rows[1])
  14.210 +
  14.211 +    blocks = list(pslBlocks(alignmentColumns, maf.alnStarts, letterSizes))
  14.212 +    blockCount = len(blocks)
  14.213 +    blockSizes, tStarts, qStarts = map(pslCommaString, zip(*blocks))
  14.214 +
  14.215 +    outWords = (matches, mismatches, repMatches, nCount,
  14.216 +                qNumInsert, qBaseInsert, tNumInsert, tBaseInsert, strand,
  14.217 +                qName, qSeqSize, qStart, qEnd, tName, tSeqSize, tStart, tEnd,
  14.218 +                blockCount, blockSizes, qStarts, tStarts)
  14.219 +    print joined(outWords, "\t")
  14.220 +
  14.221 +##### Routines for converting to SAM format: #####
  14.222 +
  14.223 +def readGroupId(readGroupItems):
  14.224 +    for i in readGroupItems:
  14.225 +        if i.startswith("ID:"):
  14.226 +            return i[3:]
  14.227 +    raise Exception("readgroup must include ID")
  14.228 +
  14.229 +def readSequenceLengths(lines):
  14.230 +    """Read name & length of topmost sequence in each maf block."""
  14.231 +    sequenceLengths = {}  # an OrderedDict might be nice
  14.232 +    isSearching = True
  14.233 +    for line in lines:
  14.234 +        if line.isspace(): isSearching = True
  14.235 +        elif isSearching:
  14.236 +            w = line.split()
  14.237 +            if w[0] == "s":
  14.238 +                seqName, seqSize = w[1], w[5]
  14.239 +                sequenceLengths[seqName] = seqSize
  14.240 +                isSearching = False
  14.241 +    return sequenceLengths
  14.242 +
  14.243 +def naturalSortKey(s):
  14.244 +    """Return a key that sorts strings in "natural" order."""
  14.245 +    return [(str, int)[k]("".join(v)) for k, v in groupby(s, str.isdigit)]
  14.246 +
  14.247 +def karyotypicSortKey(s):
  14.248 +    """Attempt to sort chromosomes in GATK's ridiculous order."""
  14.249 +    if s == "chrM": return []
  14.250 +    if s == "MT": return ["~"]
  14.251 +    return naturalSortKey(s)
  14.252 +
  14.253 +def writeSamHeader(sequenceLengths, dictFile, readGroupItems):
  14.254 +    print "@HD\tVN:1.3\tSO:unsorted"
  14.255 +    for k in sorted(sequenceLengths, key=karyotypicSortKey):
  14.256 +        print "@SQ\tSN:%s\tLN:%s" % (k, sequenceLengths[k])
  14.257 +    if dictFile:
  14.258 +        for i in fileinput.input(dictFile):
  14.259 +            if i.startswith("@SQ"): print i,
  14.260 +            elif not i.startswith("@"): break
  14.261 +    if readGroupItems:
  14.262 +        print "@RG\t" + "\t".join(readGroupItems)
  14.263 +
  14.264 +mapqMissing = "255"
  14.265 +mapqMaximum = "254"
  14.266 +mapqMaximumNum = float(mapqMaximum)
  14.267 +
  14.268 +def mapqFromProb(probString):
  14.269 +    try: p = float(probString)
  14.270 +    except ValueError: raise Exception("bad probability: " + probString)
  14.271 +    if p < 0 or p > 1: raise Exception("bad probability: " + probString)
  14.272 +    if p == 0: return mapqMaximum
  14.273 +    phred = -10 * math.log(p, 10)
  14.274 +    if phred >= mapqMaximumNum: return mapqMaximum
  14.275 +    return str(int(round(phred)))
  14.276 +
  14.277 +def cigarCategory(alignmentColumn):
  14.278 +    x, y = alignmentColumn
  14.279 +    if x == "-":
  14.280 +        if y == "-": return "P"
  14.281 +        else: return "I"
  14.282 +    else:
  14.283 +        if y == "-": return "D"
  14.284 +        else: return "M"
  14.285 +
  14.286 +def cigarParts(beg, alignmentColumns, end):
  14.287 +    if beg:
  14.288 +        yield str(beg) + "H"
  14.289 +    # (doesn't handle translated alignments)
  14.290 +    for k, v in groupby(alignmentColumns, cigarCategory):
  14.291 +        yield str(sum(1 for _ in v)) + k
  14.292 +    if end:
  14.293 +        yield str(end) + "H"
  14.294 +
  14.295 +def writeSam(maf, rg):
  14.296 +    if 3 in mafLetterSizes(maf):
  14.297 +        raise Exception("this looks like translated DNA - can't convert to SAM format")
  14.298 +
  14.299 +    try: score = "AS:i:" + str(int(maf.namesAndValues["score"]))
  14.300 +    except (KeyError, ValueError): score = None  # it must be an integer
  14.301 +
  14.302 +    try: evalue = "EV:Z:" + maf.namesAndValues["expect"]
  14.303 +    except KeyError: evalue = None
  14.304 +
  14.305 +    try: mapq = mapqFromProb(maf.namesAndValues["mismap"])
  14.306 +    except KeyError: mapq = mapqMissing
  14.307 +
  14.308 +    rows = zip(maf.seqNames, maf.alnStarts, maf.alnSizes,
  14.309 +               maf.strands, maf.seqSizes, maf.alnStrings)
  14.310 +    head, tail = rows[0], rows[1:]
  14.311 +
  14.312 +    rName, rStart, rAlnSize, rStrand, rSeqSize, rAlnString = head
  14.313 +    if rStrand != "+":
  14.314 +        raise Exception("for SAM, the 1st strand in each alignment must be +")
  14.315 +    pos = str(rStart + 1)  # convert to 1-based coordinate
  14.316 +
  14.317 +    for qName, qStart, qAlnSize, qStrand, qSeqSize, qAlnString in tail:
  14.318 +        alignmentColumns = zip(rAlnString.upper(), qAlnString.upper())
  14.319 +
  14.320 +        qRevStart = qSeqSize - qStart - qAlnSize
  14.321 +        cigar = "".join(cigarParts(qStart, alignmentColumns, qRevStart))
  14.322 +
  14.323 +        seq = deleted(qAlnString, "-")
  14.324 +
  14.325 +        qual = "*"
  14.326 +        if maf.qLines:
  14.327 +            q, qualityName, qualityString = maf.qLines[0]
  14.328 +            # don't try to handle multiple alignments for now (YAGNI):
  14.329 +            if len(maf.qLines) > 1 or len(tail) > 1 or qualityName != qName:
  14.330 +                raise Exception("can't interpret the quality data")
  14.331 +            qual = ''.join(j for i, j in zip(qAlnString, qualityString)
  14.332 +                           if i != "-")
  14.333 +
  14.334 +        # It's hard to get all the pair info, so this is very
  14.335 +        # incomplete, but hopefully good enough.
  14.336 +        # I'm not sure whether to add 2 and/or 8 to flag.
  14.337 +        if qName.endswith("/1"):
  14.338 +            qName = qName[:-2]
  14.339 +            if qStrand == "+": flag = "99"  # 1 + 2 + 32 + 64
  14.340 +            else:              flag = "83"  # 1 + 2 + 16 + 64
  14.341 +        elif qName.endswith("/2"):
  14.342 +            qName = qName[:-2]
  14.343 +            if qStrand == "+": flag = "163"  # 1 + 2 + 32 + 128
  14.344 +            else:              flag = "147"  # 1 + 2 + 16 + 128
  14.345 +        else:
  14.346 +            if qStrand == "+": flag = "0"
  14.347 +            else:              flag = "16"
  14.348 +
  14.349 +        editDistance = sum(1 for x, y in alignmentColumns if x != y)
  14.350 +        # no special treatment of ambiguous bases: might be a minor bug
  14.351 +        editDistance = "NM:i:" + str(editDistance)
  14.352 +
  14.353 +        outWords = [qName, flag, rName, pos, mapq, cigar, "*\t0\t0", seq, qual]
  14.354 +        outWords.append(editDistance)
  14.355 +        if score: outWords.append(score)
  14.356 +        if evalue: outWords.append(evalue)
  14.357 +        if rg: outWords.append(rg)
  14.358 +        print "\t".join(outWords)
  14.359 +
  14.360 +##### Routines for converting to BLAST-like format: #####
  14.361 +
  14.362 +def pairwiseMatchSymbol(alignmentColumn):
  14.363 +    if isMatch(alignmentColumn): return "|"
  14.364 +    else: return " "
  14.365 +
  14.366 +def strandText(strand):
  14.367 +    if strand == "+": return "Plus"
  14.368 +    else: return "Minus"
  14.369 +
  14.370 +def blastCoordinate(oneBasedCoordinate, strand, seqSize):
  14.371 +    if strand == "-":
  14.372 +        oneBasedCoordinate = seqSize - oneBasedCoordinate + 1
  14.373 +    return str(oneBasedCoordinate)
  14.374 +
  14.375 +def chunker(things, chunkSize):
  14.376 +    for i in range(0, len(things), chunkSize):
  14.377 +        yield things[i:i+chunkSize]
  14.378 +
  14.379 +def blastChunker(maf, lineSize, alignmentColumns):
  14.380 +    letterSizes = mafLetterSizes(maf)
  14.381 +    coords = maf.alnStarts
  14.382 +    for chunkCols in chunker(alignmentColumns, lineSize):
  14.383 +        chunkRows = zip(*chunkCols)
  14.384 +        chunkRows = map(''.join, chunkRows)
  14.385 +        starts = [i + 1 for i in coords]  # change to 1-based coordinates
  14.386 +        starts = map(blastCoordinate, starts, maf.strands, maf.seqSizes)
  14.387 +        increments = map(insertionSize, chunkRows, letterSizes)
  14.388 +        coords = map(operator.add, coords, increments)
  14.389 +        ends = map(blastCoordinate, coords, maf.strands, maf.seqSizes)
  14.390 +        yield chunkCols, chunkRows, starts, ends
  14.391 +
  14.392 +def writeBlast(maf, lineSize, oldQueryName):
  14.393 +    dieUnlessPairwise(maf)
  14.394 +
  14.395 +    if maf.seqNames[1] != oldQueryName:
  14.396 +        print "Query= %s" % maf.seqNames[1]
  14.397 +        print "         (%s letters)" % maf.seqSizes[1]
  14.398 +        print
  14.399 +
  14.400 +    print ">%s" % maf.seqNames[0]
  14.401 +    print "          Length = %s" % maf.seqSizes[0]
  14.402 +    print
  14.403 +
  14.404 +    scoreLine = " Score = %s" % maf.namesAndValues["score"]
  14.405 +    try: scoreLine += ", Expect = %s" % maf.namesAndValues["expect"]
  14.406 +    except KeyError: pass
  14.407 +    print scoreLine
  14.408 +
  14.409 +    alignmentColumns = zip(*maf.alnStrings)
  14.410 +
  14.411 +    alnSize = len(alignmentColumns)
  14.412 +    matches = quantify(alignmentColumns, isMatch)
  14.413 +    matchPercent = 100 * matches // alnSize  # round down, like BLAST
  14.414 +    identLine = " Identities = %s/%s (%s%%)" % (matches, alnSize, matchPercent)
  14.415 +    gaps = alnSize - quantify(alignmentColumns, isGapless)
  14.416 +    gapPercent = 100 * gaps // alnSize  # round down, like BLAST
  14.417 +    if gaps: identLine += ", Gaps = %s/%s (%s%%)" % (gaps, alnSize, gapPercent)
  14.418 +    print identLine
  14.419 +
  14.420 +    strands = map(strandText, maf.strands)
  14.421 +    print " Strand = %s / %s" % (strands[1], strands[0])
  14.422 +    print
  14.423 +
  14.424 +    for chunk in blastChunker(maf, lineSize, alignmentColumns):
  14.425 +        cols, rows, starts, ends = chunk
  14.426 +        startWidth = maxlen(starts)
  14.427 +        matchSymbols = map(pairwiseMatchSymbol, cols)
  14.428 +        matchSymbols = ''.join(matchSymbols)
  14.429 +        print "Query: %-*s %s %s" % (startWidth, starts[1], rows[1], ends[1])
  14.430 +        print "       %-*s %s"    % (startWidth, " ", matchSymbols)
  14.431 +        print "Sbjct: %-*s %s %s" % (startWidth, starts[0], rows[0], ends[0])
  14.432 +        print
  14.433 +
  14.434 +##### Routines for converting to HTML format: #####
  14.435 +
  14.436 +def writeHtmlHeader():
  14.437 +    print '''
  14.438 +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  14.439 + "http://www.w3.org/TR/html4/strict.dtd">
  14.440 +<html lang="en"><head>
  14.441 +<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
  14.442 +<title>Reliable Alignments</title>
  14.443 +<style type="text/css">
  14.444 +/* Try to force monospace, working around browser insanity: */
  14.445 +pre {font-family: "Courier New", monospace, serif; font-size: 0.8125em}
  14.446 +.a {background-color: #3333FF}
  14.447 +.b {background-color: #9933FF}
  14.448 +.c {background-color: #FF66CC}
  14.449 +.d {background-color: #FF3333}
  14.450 +.e {background-color: #FF9933}
  14.451 +.f {background-color: #FFFF00}
  14.452 +.key {display:inline; margin-right:2em}
  14.453 +</style>
  14.454 +</head><body>
  14.455 +
  14.456 +<div style="line-height:1">
  14.457 +<pre class="key"><span class="a">  </span> prob &gt; 0.999</pre>
  14.458 +<pre class="key"><span class="b">  </span> prob &gt; 0.99 </pre>
  14.459 +<pre class="key"><span class="c">  </span> prob &gt; 0.95 </pre>
  14.460 +<pre class="key"><span class="d">  </span> prob &gt; 0.9  </pre>
  14.461 +<pre class="key"><span class="e">  </span> prob &gt; 0.5  </pre>
  14.462 +<pre class="key"><span class="f">  </span> prob &le; 0.5  </pre>
  14.463 +</div>
  14.464 +'''
  14.465 +
  14.466 +def probabilityClass(probabilityColumn):
  14.467 +    p = ord(min(probabilityColumn)) - 33
  14.468 +    if   p >= 30: return 'a'
  14.469 +    elif p >= 20: return 'b'
  14.470 +    elif p >= 13: return 'c'
  14.471 +    elif p >= 10: return 'd'
  14.472 +    elif p >=  3: return 'e'
  14.473 +    else: return 'f'
  14.474 +
  14.475 +def identicalRuns(s):
  14.476 +    """Yield (item, start, end) for each run of identical items in s."""
  14.477 +    beg = 0
  14.478 +    for k, v in groupby(s):
  14.479 +        end = beg + len(list(v))
  14.480 +        yield k, beg, end
  14.481 +        beg = end
  14.482 +
  14.483 +def htmlSpan(text, classRun):
  14.484 +    key, beg, end = classRun
  14.485 +    textbit = text[beg:end]
  14.486 +    if key: return '<span class="%s">%s</span>' % (key, textbit)
  14.487 +    else: return textbit
  14.488 +
  14.489 +def multipleMatchSymbol(alignmentColumn):
  14.490 +    if isMatch(alignmentColumn): return "*"
  14.491 +    else: return " "
  14.492 +
  14.493 +def writeHtml(maf, lineSize):
  14.494 +    scoreLine = "Alignment"
  14.495 +    try:
  14.496 +        scoreLine += " score=%s" % maf.namesAndValues["score"]
  14.497 +        scoreLine += ", expect=%s" % maf.namesAndValues["expect"]
  14.498 +    except KeyError: pass
  14.499 +    print "<h3>%s:</h3>" % scoreLine
  14.500 +
  14.501 +    if maf.pLines:
  14.502 +        probRows = [i[1] for i in maf.pLines]
  14.503 +        probCols = izip(*probRows)
  14.504 +        classes = imap(probabilityClass, probCols)
  14.505 +    else:
  14.506 +        classes = repeat(None)
  14.507 +
  14.508 +    nameWidth = maxlen(maf.seqNames)
  14.509 +    alignmentColumns = zip(*maf.alnStrings)
  14.510 +
  14.511 +    print '<pre>'
  14.512 +    for chunk in blastChunker(maf, lineSize, alignmentColumns):
  14.513 +        cols, rows, starts, ends = chunk
  14.514 +        startWidth = maxlen(starts)
  14.515 +        endWidth = maxlen(ends)
  14.516 +        matchSymbols = map(multipleMatchSymbol, cols)
  14.517 +        matchSymbols = ''.join(matchSymbols)
  14.518 +        classChunk = islice(classes, lineSize)
  14.519 +        classRuns = list(identicalRuns(classChunk))
  14.520 +        for n, s, r, e in zip(maf.seqNames, starts, rows, ends):
  14.521 +            spans = [htmlSpan(r, i) for i in classRuns]
  14.522 +            spans = ''.join(spans)
  14.523 +            formatParams = nameWidth, n, startWidth, s, spans, endWidth, e
  14.524 +            print '%-*s %*s %s %*s' % formatParams
  14.525 +        print ' ' * nameWidth, ' ' * startWidth, matchSymbols
  14.526 +        print
  14.527 +    print '</pre>'
  14.528 +
  14.529 +##### Routines for reading MAF format: #####
  14.530 +
  14.531 +def mafInput(lines, isKeepComments):
  14.532 +    a = []
  14.533 +    s = []
  14.534 +    q = []
  14.535 +    p = []
  14.536 +    for i in lines:
  14.537 +        w = i.split()
  14.538 +        if not w:
  14.539 +            if a: yield Maf(a, s, q, p)
  14.540 +            a = []
  14.541 +            s = []
  14.542 +            q = []
  14.543 +            p = []
  14.544 +        elif w[0] == "a":
  14.545 +            a = w
  14.546 +        elif w[0] == "s":
  14.547 +            s.append(w)
  14.548 +        elif w[0] == "q":
  14.549 +            q.append(w)
  14.550 +        elif w[0] == "p":
  14.551 +            p.append(w)
  14.552 +        elif i[0] == "#" and isKeepComments:
  14.553 +            print i,
  14.554 +    if a: yield Maf(a, s, q, p)
  14.555 +
  14.556 +def isFormat(myString, myFormat):
  14.557 +    return myFormat.startswith(myString)
  14.558 +
  14.559 +def mafConvert(opts, args):
  14.560 +    format = args[0].lower()
  14.561 +    if isFormat(format, "sam"):
  14.562 +        if opts.dictionary: d = readSequenceLengths(fileinput.input(args[1:]))
  14.563 +        else: d = {}
  14.564 +        if opts.readgroup:
  14.565 +            readGroupItems = opts.readgroup.split()
  14.566 +            rg = "RG:Z:" + readGroupId(readGroupItems)
  14.567 +        else:
  14.568 +            readGroupItems = []
  14.569 +            rg = ""
  14.570 +        if not opts.noheader: writeSamHeader(d, opts.dictfile, readGroupItems)
  14.571 +    inputLines = fileinput.input(args[1:])
  14.572 +    if isFormat(format, "html") and not opts.noheader: writeHtmlHeader()
  14.573 +    isKeepCommentLines = isFormat(format, "tabular") and not opts.noheader
  14.574 +    oldQueryName = ""
  14.575 +    for maf in mafInput(inputLines, isKeepCommentLines):
  14.576 +        if   isFormat(format, "axt"): writeAxt(maf)
  14.577 +        elif isFormat(format, "blast"):
  14.578 +            writeBlast(maf, opts.linesize, oldQueryName)
  14.579 +            oldQueryName = maf.seqNames[1]
  14.580 +        elif isFormat(format, "html"): writeHtml(maf, opts.linesize)
  14.581 +        elif isFormat(format, "psl"): writePsl(maf, opts.protein)
  14.582 +        elif isFormat(format, "sam"): writeSam(maf, rg)
  14.583 +        elif isFormat(format, "tabular"): writeTab(maf)
  14.584 +        else: raise Exception("unknown format: " + format)
  14.585 +    if isFormat(format, "html") and not opts.noheader: print "</body></html>"
  14.586 +
  14.587 +if __name__ == "__main__":
  14.588 +    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  14.589 +
  14.590 +    usage = """
  14.591 +  %prog --help
  14.592 +  %prog axt mafFile(s)
  14.593 +  %prog blast mafFile(s)
  14.594 +  %prog html mafFile(s)
  14.595 +  %prog psl mafFile(s)
  14.596 +  %prog sam mafFile(s)
  14.597 +  %prog tab mafFile(s)"""
  14.598 +
  14.599 +    description = "Read MAF-format alignments & write them in another format."
  14.600 +
  14.601 +    op = optparse.OptionParser(usage=usage, description=description)
  14.602 +    op.add_option("-p", "--protein", action="store_true",
  14.603 +                  help="assume protein alignments, for psl match counts")
  14.604 +    op.add_option("-n", "--noheader", action="store_true",
  14.605 +                  help="omit any header lines from the output")
  14.606 +    op.add_option("-d", "--dictionary", action="store_true",
  14.607 +                  help="include dictionary of sequence lengths in sam format")
  14.608 +    op.add_option("-f", "--dictfile",
  14.609 +                  help="get sequence dictionary from DICTFILE")
  14.610 +    op.add_option("-r", "--readgroup",
  14.611 +                  help="read group info for sam format")
  14.612 +    op.add_option("-l", "--linesize", type="int", default=60, #metavar="CHARS",
  14.613 +                  help="line length for blast and html formats (default: %default)")
  14.614 +    (opts, args) = op.parse_args()
  14.615 +    if opts.linesize <= 0: op.error("option -l: should be >= 1")
  14.616 +    if opts.dictionary and opts.dictfile: op.error("can't use both -d and -f")
  14.617 +    if len(args) < 1: op.error("I need a format-name and some MAF alignments")
  14.618 +    if opts.dictionary and (len(args) == 1 or "-" in args[1:]):
  14.619 +        op.error("need file (not pipe) with option -d")
  14.620 +
  14.621 +    try: mafConvert(opts, args)
  14.622 +    except KeyboardInterrupt: pass  # avoid silly error message
  14.623 +    except Exception, e:
  14.624 +        prog = os.path.basename(sys.argv[0])
  14.625 +        sys.exit(prog + ": error: " + str(e))
    15.1 --- a/scripts/maf-convert.py	Wed Sep 24 11:41:48 2014 +0900
    15.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    15.3 @@ -1,622 +0,0 @@
    15.4 -#! /usr/bin/env python
    15.5 -# Copyright 2010, 2011, 2013, 2014 Martin C. Frith
    15.6 -# Read MAF-format alignments: write them in other formats.
    15.7 -# Seems to work with Python 2.x, x>=4
    15.8 -
    15.9 -# By "MAF" we mean "multiple alignment format" described in the UCSC
   15.10 -# Genome FAQ, not e.g. "MIRA assembly format".
   15.11 -
   15.12 -from itertools import *
   15.13 -import sys, os, fileinput, math, operator, optparse, signal, string
   15.14 -
   15.15 -def maxlen(s):
   15.16 -    return max(map(len, s))
   15.17 -
   15.18 -def joined(things, delimiter):
   15.19 -    return delimiter.join(map(str, things))
   15.20 -
   15.21 -identityTranslation = string.maketrans("", "")
   15.22 -def deleted(myString, deleteChars):
   15.23 -    return myString.translate(identityTranslation, deleteChars)
   15.24 -
   15.25 -def quantify(iterable, pred=bool):
   15.26 -    """Count how many times the predicate is true."""
   15.27 -    return sum(map(pred, iterable))
   15.28 -
   15.29 -class Maf:
   15.30 -    def __init__(self, aLine, sLines, qLines, pLines):
   15.31 -        self.namesAndValues = dict(i.split("=") for i in aLine[1:])
   15.32 -
   15.33 -        if not sLines: raise Exception("empty alignment")
   15.34 -        cols = zip(*sLines)
   15.35 -        self.seqNames = cols[1]
   15.36 -        self.alnStarts = map(int, cols[2])
   15.37 -        self.alnSizes = map(int, cols[3])
   15.38 -        self.strands = cols[4]
   15.39 -        self.seqSizes = map(int, cols[5])
   15.40 -        self.alnStrings = cols[6]
   15.41 -
   15.42 -        self.qLines = qLines
   15.43 -        self.pLines = pLines
   15.44 -
   15.45 -def dieUnlessPairwise(maf):
   15.46 -    if len(maf.alnStrings) != 2:
   15.47 -        raise Exception("pairwise alignments only, please")
   15.48 -
   15.49 -def insertionCounts(alnString):
   15.50 -    gaps = alnString.count("-")
   15.51 -    forwardFrameshifts = alnString.count("\\")
   15.52 -    reverseFrameshifts = alnString.count("/")
   15.53 -    letters = len(alnString) - gaps - forwardFrameshifts - reverseFrameshifts
   15.54 -    return letters, forwardFrameshifts, reverseFrameshifts
   15.55 -
   15.56 -def coordinatesPerLetter(alnString, alnSize):
   15.57 -    letters, forwardShifts, reverseShifts = insertionCounts(alnString)
   15.58 -    if forwardShifts or reverseShifts or letters < alnSize: return 3
   15.59 -    else: return 1
   15.60 -
   15.61 -def mafLetterSizes(maf):
   15.62 -    return map(coordinatesPerLetter, maf.alnStrings, maf.alnSizes)
   15.63 -
   15.64 -def insertionSize(alnString, letterSize):
   15.65 -    """Get the length of sequence included in the alnString."""
   15.66 -    letters, forwardShifts, reverseShifts = insertionCounts(alnString)
   15.67 -    return letters * letterSize + forwardShifts - reverseShifts
   15.68 -
   15.69 -def isMatch(alignmentColumn):
   15.70 -    # No special treatment of ambiguous bases/residues: same as NCBI BLAST.
   15.71 -    first = alignmentColumn[0].upper()
   15.72 -    for i in alignmentColumn[1:]:
   15.73 -        if i.upper() != first: return False
   15.74 -    return True
   15.75 -
   15.76 -def isGapless(alignmentColumn):
   15.77 -    return "-" not in alignmentColumn
   15.78 -
   15.79 -def matchAndInsertSizes(alignmentColumns, letterSizes):
   15.80 -    """Get sizes of gapless blocks, and of the inserts between them."""
   15.81 -    for k, v in groupby(alignmentColumns, isGapless):
   15.82 -        if k:
   15.83 -            sizeOfGaplessBlock = len(list(v))
   15.84 -            yield [sizeOfGaplessBlock]
   15.85 -        else:
   15.86 -            blockRows = zip(*v)
   15.87 -            blockRows = map(''.join, blockRows)
   15.88 -            yield map(insertionSize, blockRows, letterSizes)
   15.89 -
   15.90 -##### Routines for converting to AXT format: #####
   15.91 -
   15.92 -axtCounter = count()
   15.93 -
   15.94 -def writeAxt(maf):
   15.95 -    if maf.strands[0] != "+":
   15.96 -        raise Exception("for AXT, the 1st strand in each alignment must be +")
   15.97 -
   15.98 -    # Convert to AXT's 1-based coordinates:
   15.99 -    alnStarts = imap(operator.add, maf.alnStarts, repeat(1))
  15.100 -    alnEnds = imap(operator.add, maf.alnStarts, maf.alnSizes)
  15.101 -
  15.102 -    rows = zip(maf.seqNames, alnStarts, alnEnds, maf.strands)
  15.103 -    head, tail = rows[0], rows[1:]
  15.104 -
  15.105 -    outWords = []
  15.106 -    outWords.append(axtCounter.next())
  15.107 -    outWords.extend(head[:3])
  15.108 -    outWords.extend(chain(*tail))
  15.109 -    outWords.append(maf.namesAndValues["score"])
  15.110 -
  15.111 -    print joined(outWords, " ")
  15.112 -    for i in maf.alnStrings: print i
  15.113 -    print  # print a blank line at the end
  15.114 -
  15.115 -##### Routines for converting to tabular format: #####
  15.116 -
  15.117 -def writeTab(maf):
  15.118 -    outWords = []
  15.119 -    outWords.append(maf.namesAndValues["score"])
  15.120 -
  15.121 -    cols = maf.seqNames, maf.alnStarts, maf.alnSizes, maf.strands, maf.seqSizes
  15.122 -    rows = zip(*cols)
  15.123 -    outWords.extend(chain(*rows))
  15.124 -
  15.125 -    alignmentColumns = zip(*maf.alnStrings)
  15.126 -    letterSizes = mafLetterSizes(maf)
  15.127 -    gapInfo = matchAndInsertSizes(alignmentColumns, letterSizes)
  15.128 -    gapStrings = imap(joined, gapInfo, repeat(":"))
  15.129 -    gapWord = ",".join(gapStrings)
  15.130 -    outWords.append(gapWord)
  15.131 -
  15.132 -    try: outWords.append(maf.namesAndValues["expect"])
  15.133 -    except KeyError: pass
  15.134 -
  15.135 -    try: outWords.append(maf.namesAndValues["mismap"])
  15.136 -    except KeyError: pass
  15.137 -
  15.138 -    print joined(outWords, "\t")
  15.139 -
  15.140 -##### Routines for converting to PSL format: #####
  15.141 -
  15.142 -def pslBlocks(alignmentColumns, alnStarts, letterSizes):
  15.143 -    """Get sizes and start coordinates of gapless blocks in an alignment."""
  15.144 -    coordinates = alnStarts
  15.145 -    for i in matchAndInsertSizes(alignmentColumns, letterSizes):
  15.146 -        if len(i) == 1:  # we have the size of a gapless block
  15.147 -            yield i + coordinates
  15.148 -            increments = imap(operator.mul, letterSizes, cycle(i))
  15.149 -            coordinates = map(operator.add, coordinates, increments)
  15.150 -        else:  # we have sizes of inserts between gapless blocks
  15.151 -            coordinates = map(operator.add, coordinates, i)
  15.152 -
  15.153 -def pslCommaString(things):
  15.154 -    # UCSC software seems to prefer a trailing comma
  15.155 -    return joined(things, ",") + ","
  15.156 -
  15.157 -def gapRunCount(letters):
  15.158 -    """Get the number of runs of gap characters."""
  15.159 -    uniqLetters = map(operator.itemgetter(0), groupby(letters))
  15.160 -    return uniqLetters.count("-")
  15.161 -
  15.162 -def pslEndpoints(alnStart, alnSize, strand, seqSize):
  15.163 -    alnEnd = alnStart + alnSize
  15.164 -    if strand == "+": return alnStart, alnEnd
  15.165 -    else: return seqSize - alnEnd, seqSize - alnStart
  15.166 -
  15.167 -def caseSensitivePairwiseMatchCounts(columns, isProtein):
  15.168 -    # repMatches is always zero
  15.169 -    # for proteins, nCount is always zero, because that's what BLATv34 does
  15.170 -    standardBases = "ACGTU"
  15.171 -    matches = mismatches = repMatches = nCount = 0
  15.172 -    for i in columns:
  15.173 -        if "-" in i: continue
  15.174 -        x, y = i
  15.175 -        if x in standardBases and y in standardBases or isProtein:
  15.176 -            if x == y: matches += 1
  15.177 -            else: mismatches += 1
  15.178 -        else: nCount += 1
  15.179 -    return matches, mismatches, repMatches, nCount
  15.180 -
  15.181 -def writePsl(maf, isProtein):
  15.182 -    dieUnlessPairwise(maf)
  15.183 -
  15.184 -    alnStrings = map(str.upper, maf.alnStrings)
  15.185 -    alignmentColumns = zip(*alnStrings)
  15.186 -    letterSizes = mafLetterSizes(maf)
  15.187 -
  15.188 -    matchCounts = caseSensitivePairwiseMatchCounts(alignmentColumns, isProtein)
  15.189 -    matches, mismatches, repMatches, nCount = matchCounts
  15.190 -    numGaplessColumns = sum(matchCounts)
  15.191 -
  15.192 -    qNumInsert = gapRunCount(maf.alnStrings[0])
  15.193 -    qBaseInsert = maf.alnSizes[1] - numGaplessColumns * letterSizes[1]
  15.194 -
  15.195 -    tNumInsert = gapRunCount(maf.alnStrings[1])
  15.196 -    tBaseInsert = maf.alnSizes[0] - numGaplessColumns * letterSizes[0]
  15.197 -
  15.198 -    strand = maf.strands[1]
  15.199 -    if max(letterSizes) > 1:
  15.200 -        strand += maf.strands[0]
  15.201 -    elif maf.strands[0] != "+":
  15.202 -        raise Exception("for non-translated PSL, the 1st strand in each alignment must be +")
  15.203 -
  15.204 -    tName, qName = maf.seqNames
  15.205 -    tSeqSize, qSeqSize = maf.seqSizes
  15.206 -
  15.207 -    rows = zip(maf.alnStarts, maf.alnSizes, maf.strands, maf.seqSizes)
  15.208 -    tStart, tEnd = pslEndpoints(*rows[0])
  15.209 -    qStart, qEnd = pslEndpoints(*rows[1])
  15.210 -
  15.211 -    blocks = list(pslBlocks(alignmentColumns, maf.alnStarts, letterSizes))
  15.212 -    blockCount = len(blocks)
  15.213 -    blockSizes, tStarts, qStarts = map(pslCommaString, zip(*blocks))
  15.214 -
  15.215 -    outWords = (matches, mismatches, repMatches, nCount,
  15.216 -                qNumInsert, qBaseInsert, tNumInsert, tBaseInsert, strand,
  15.217 -                qName, qSeqSize, qStart, qEnd, tName, tSeqSize, tStart, tEnd,
  15.218 -                blockCount, blockSizes, qStarts, tStarts)
  15.219 -    print joined(outWords, "\t")
  15.220 -
  15.221 -##### Routines for converting to SAM format: #####
  15.222 -
  15.223 -def readGroupId(readGroupItems):
  15.224 -    for i in readGroupItems:
  15.225 -        if i.startswith("ID:"):
  15.226 -            return i[3:]
  15.227 -    raise Exception("readgroup must include ID")
  15.228 -
  15.229 -def readSequenceLengths(lines):
  15.230 -    """Read name & length of topmost sequence in each maf block."""
  15.231 -    sequenceLengths = {}  # an OrderedDict might be nice
  15.232 -    isSearching = True
  15.233 -    for line in lines:
  15.234 -        if line.isspace(): isSearching = True
  15.235 -        elif isSearching:
  15.236 -            w = line.split()
  15.237 -            if w[0] == "s":
  15.238 -                seqName, seqSize = w[1], w[5]
  15.239 -                sequenceLengths[seqName] = seqSize
  15.240 -                isSearching = False
  15.241 -    return sequenceLengths
  15.242 -
  15.243 -def naturalSortKey(s):
  15.244 -    """Return a key that sorts strings in "natural" order."""
  15.245 -    return [(str, int)[k]("".join(v)) for k, v in groupby(s, str.isdigit)]
  15.246 -
  15.247 -def karyotypicSortKey(s):
  15.248 -    """Attempt to sort chromosomes in GATK's ridiculous order."""
  15.249 -    if s == "chrM": return []
  15.250 -    if s == "MT": return ["~"]
  15.251 -    return naturalSortKey(s)
  15.252 -
  15.253 -def writeSamHeader(sequenceLengths, dictFile, readGroupItems):
  15.254 -    print "@HD\tVN:1.3\tSO:unsorted"
  15.255 -    for k in sorted(sequenceLengths, key=karyotypicSortKey):
  15.256 -        print "@SQ\tSN:%s\tLN:%s" % (k, sequenceLengths[k])
  15.257 -    if dictFile:
  15.258 -        for i in fileinput.input(dictFile):
  15.259 -            if i.startswith("@SQ"): print i,
  15.260 -            elif not i.startswith("@"): break
  15.261 -    if readGroupItems:
  15.262 -        print "@RG\t" + "\t".join(readGroupItems)
  15.263 -
  15.264 -mapqMissing = "255"
  15.265 -mapqMaximum = "254"
  15.266 -mapqMaximumNum = float(mapqMaximum)
  15.267 -
  15.268 -def mapqFromProb(probString):
  15.269 -    try: p = float(probString)
  15.270 -    except ValueError: raise Exception("bad probability: " + probString)
  15.271 -    if p < 0 or p > 1: raise Exception("bad probability: " + probString)
  15.272 -    if p == 0: return mapqMaximum
  15.273 -    phred = -10 * math.log(p, 10)
  15.274 -    if phred >= mapqMaximumNum: return mapqMaximum
  15.275 -    return str(int(round(phred)))
  15.276 -
  15.277 -def cigarCategory(alignmentColumn):
  15.278 -    x, y = alignmentColumn
  15.279 -    if x == "-":
  15.280 -        if y == "-": return "P"
  15.281 -        else: return "I"
  15.282 -    else:
  15.283 -        if y == "-": return "D"
  15.284 -        else: return "M"
  15.285 -
  15.286 -def cigarParts(beg, alignmentColumns, end):
  15.287 -    if beg:
  15.288 -        yield str(beg) + "H"
  15.289 -    # (doesn't handle translated alignments)
  15.290 -    for k, v in groupby(alignmentColumns, cigarCategory):
  15.291 -        yield str(sum(1 for _ in v)) + k
  15.292 -    if end:
  15.293 -        yield str(end) + "H"
  15.294 -
  15.295 -def writeSam(maf, rg):
  15.296 -    if 3 in mafLetterSizes(maf):
  15.297 -        raise Exception("this looks like translated DNA - can't convert to SAM format")
  15.298 -
  15.299 -    try: score = "AS:i:" + str(int(maf.namesAndValues["score"]))
  15.300 -    except (KeyError, ValueError): score = None  # it must be an integer
  15.301 -
  15.302 -    try: evalue = "EV:Z:" + maf.namesAndValues["expect"]
  15.303 -    except KeyError: evalue = None
  15.304 -
  15.305 -    try: mapq = mapqFromProb(maf.namesAndValues["mismap"])
  15.306 -    except KeyError: mapq = mapqMissing
  15.307 -
  15.308 -    rows = zip(maf.seqNames, maf.alnStarts, maf.alnSizes,
  15.309 -               maf.strands, maf.seqSizes, maf.alnStrings)
  15.310 -    head, tail = rows[0], rows[1:]
  15.311 -
  15.312 -    rName, rStart, rAlnSize, rStrand, rSeqSize, rAlnString = head
  15.313 -    if rStrand != "+":
  15.314 -        raise Exception("for SAM, the 1st strand in each alignment must be +")
  15.315 -    pos = str(rStart + 1)  # convert to 1-based coordinate
  15.316 -
  15.317 -    for qName, qStart, qAlnSize, qStrand, qSeqSize, qAlnString in tail:
  15.318 -        alignmentColumns = zip(rAlnString.upper(), qAlnString.upper())
  15.319 -
  15.320 -        qRevStart = qSeqSize - qStart - qAlnSize
  15.321 -        cigar = "".join(cigarParts(qStart, alignmentColumns, qRevStart))
  15.322 -
  15.323 -        seq = deleted(qAlnString, "-")
  15.324 -
  15.325 -        qual = "*"
  15.326 -        if maf.qLines:
  15.327 -            q, qualityName, qualityString = maf.qLines[0]
  15.328 -            # don't try to handle multiple alignments for now (YAGNI):
  15.329 -            if len(maf.qLines) > 1 or len(tail) > 1 or qualityName != qName:
  15.330 -                raise Exception("can't interpret the quality data")
  15.331 -            qual = ''.join(j for i, j in zip(qAlnString, qualityString)
  15.332 -                           if i != "-")
  15.333 -
  15.334 -        # It's hard to get all the pair info, so this is very
  15.335 -        # incomplete, but hopefully good enough.
  15.336 -        # I'm not sure whether to add 2 and/or 8 to flag.
  15.337 -        if qName.endswith("/1"):
  15.338 -            qName = qName[:-2]
  15.339 -            if qStrand == "+": flag = "99"  # 1 + 2 + 32 + 64
  15.340 -            else:              flag = "83"  # 1 + 2 + 16 + 64
  15.341 -        elif qName.endswith("/2"):
  15.342 -            qName = qName[:-2]
  15.343 -            if qStrand == "+": flag = "163"  # 1 + 2 + 32 + 128
  15.344 -            else:              flag = "147"  # 1 + 2 + 16 + 128
  15.345 -        else:
  15.346 -            if qStrand == "+": flag = "0"
  15.347 -            else:              flag = "16"
  15.348 -
  15.349 -        editDistance = sum(1 for x, y in alignmentColumns if x != y)
  15.350 -        # no special treatment of ambiguous bases: might be a minor bug
  15.351 -        editDistance = "NM:i:" + str(editDistance)
  15.352 -
  15.353 -        outWords = [qName, flag, rName, pos, mapq, cigar, "*\t0\t0", seq, qual]
  15.354 -        outWords.append(editDistance)
  15.355 -        if score: outWords.append(score)
  15.356 -        if evalue: outWords.append(evalue)
  15.357 -        if rg: outWords.append(rg)
  15.358 -        print "\t".join(outWords)
  15.359 -
  15.360 -##### Routines for converting to BLAST-like format: #####
  15.361 -
  15.362 -def pairwiseMatchSymbol(alignmentColumn):
  15.363 -    if isMatch(alignmentColumn): return "|"
  15.364 -    else: return " "
  15.365 -
  15.366 -def strandText(strand):
  15.367 -    if strand == "+": return "Plus"
  15.368 -    else: return "Minus"
  15.369 -
  15.370 -def blastCoordinate(oneBasedCoordinate, strand, seqSize):
  15.371 -    if strand == "-":
  15.372 -        oneBasedCoordinate = seqSize - oneBasedCoordinate + 1
  15.373 -    return str(oneBasedCoordinate)
  15.374 -
  15.375 -def chunker(things, chunkSize):
  15.376 -    for i in range(0, len(things), chunkSize):
  15.377 -        yield things[i:i+chunkSize]
  15.378 -
  15.379 -def blastChunker(maf, lineSize, alignmentColumns):
  15.380 -    letterSizes = mafLetterSizes(maf)
  15.381 -    coords = maf.alnStarts
  15.382 -    for chunkCols in chunker(alignmentColumns, lineSize):
  15.383 -        chunkRows = zip(*chunkCols)
  15.384 -        chunkRows = map(''.join, chunkRows)
  15.385 -        starts = [i + 1 for i in coords]  # change to 1-based coordinates
  15.386 -        starts = map(blastCoordinate, starts, maf.strands, maf.seqSizes)
  15.387 -        increments = map(insertionSize, chunkRows, letterSizes)
  15.388 -        coords = map(operator.add, coords, increments)
  15.389 -        ends = map(blastCoordinate, coords, maf.strands, maf.seqSizes)
  15.390 -        yield chunkCols, chunkRows, starts, ends
  15.391 -
  15.392 -def writeBlast(maf, lineSize, oldQueryName):
  15.393 -    dieUnlessPairwise(maf)
  15.394 -
  15.395 -    if maf.seqNames[1] != oldQueryName:
  15.396 -        print "Query= %s" % maf.seqNames[1]
  15.397 -        print "         (%s letters)" % maf.seqSizes[1]
  15.398 -        print
  15.399 -
  15.400 -    print ">%s" % maf.seqNames[0]
  15.401 -    print "          Length = %s" % maf.seqSizes[0]
  15.402 -    print
  15.403 -
  15.404 -    scoreLine = " Score = %s" % maf.namesAndValues["score"]
  15.405 -    try: scoreLine += ", Expect = %s" % maf.namesAndValues["expect"]
  15.406 -    except KeyError: pass
  15.407 -    print scoreLine
  15.408 -
  15.409 -    alignmentColumns = zip(*maf.alnStrings)
  15.410 -
  15.411 -    alnSize = len(alignmentColumns)
  15.412 -    matches = quantify(alignmentColumns, isMatch)
  15.413 -    matchPercent = 100 * matches // alnSize  # round down, like BLAST
  15.414 -    identLine = " Identities = %s/%s (%s%%)" % (matches, alnSize, matchPercent)
  15.415 -    gaps = alnSize - quantify(alignmentColumns, isGapless)
  15.416 -    gapPercent = 100 * gaps // alnSize  # round down, like BLAST
  15.417 -    if gaps: identLine += ", Gaps = %s/%s (%s%%)" % (gaps, alnSize, gapPercent)
  15.418 -    print identLine
  15.419 -
  15.420 -    strands = map(strandText, maf.strands)
  15.421 -    print " Strand = %s / %s" % (strands[1], strands[0])
  15.422 -    print
  15.423 -
  15.424 -    for chunk in blastChunker(maf, lineSize, alignmentColumns):
  15.425 -        cols, rows, starts, ends = chunk
  15.426 -        startWidth = maxlen(starts)
  15.427 -        matchSymbols = map(pairwiseMatchSymbol, cols)
  15.428 -        matchSymbols = ''.join(matchSymbols)
  15.429 -        print "Query: %-*s %s %s" % (startWidth, starts[1], rows[1], ends[1])
  15.430 -        print "       %-*s %s"    % (startWidth, " ", matchSymbols)
  15.431 -        print "Sbjct: %-*s %s %s" % (startWidth, starts[0], rows[0], ends[0])
  15.432 -        print
  15.433 -
  15.434 -##### Routines for converting to HTML format: #####
  15.435 -
  15.436 -def writeHtmlHeader():
  15.437 -    print '''
  15.438 -<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
  15.439 - "http://www.w3.org/TR/html4/strict.dtd">
  15.440 -<html lang="en"><head>
  15.441 -<meta http-equiv="Content-type" content="text/html; charset=UTF-8">
  15.442 -<title>Reliable Alignments</title>
  15.443 -<style type="text/css">
  15.444 -/* Try to force monospace, working around browser insanity: */
  15.445 -pre {font-family: "Courier New", monospace, serif; font-size: 0.8125em}
  15.446 -.a {background-color: #3333FF}
  15.447 -.b {background-color: #9933FF}
  15.448 -.c {background-color: #FF66CC}
  15.449 -.d {background-color: #FF3333}
  15.450 -.e {background-color: #FF9933}
  15.451 -.f {background-color: #FFFF00}
  15.452 -.key {display:inline; margin-right:2em}
  15.453 -</style>
  15.454 -</head><body>
  15.455 -
  15.456 -<div style="line-height:1">
  15.457 -<pre class="key"><span class="a">  </span> prob &gt; 0.999</pre>
  15.458 -<pre class="key"><span class="b">  </span> prob &gt; 0.99 </pre>
  15.459 -<pre class="key"><span class="c">  </span> prob &gt; 0.95 </pre>
  15.460 -<pre class="key"><span class="d">  </span> prob &gt; 0.9  </pre>
  15.461 -<pre class="key"><span class="e">  </span> prob &gt; 0.5  </pre>
  15.462 -<pre class="key"><span class="f">  </span> prob &le; 0.5  </pre>
  15.463 -</div>
  15.464 -'''
  15.465 -
  15.466 -def probabilityClass(probabilityColumn):
  15.467 -    p = ord(min(probabilityColumn)) - 33
  15.468 -    if   p >= 30: return 'a'
  15.469 -    elif p >= 20: return 'b'
  15.470 -    elif p >= 13: return 'c'
  15.471 -    elif p >= 10: return 'd'
  15.472 -    elif p >=  3: return 'e'
  15.473 -    else: return 'f'
  15.474 -
  15.475 -def identicalRuns(s):
  15.476 -    """Yield (item, start, end) for each run of identical items in s."""
  15.477 -    beg = 0
  15.478 -    for k, v in groupby(s):
  15.479 -        end = beg + len(list(v))
  15.480 -        yield k, beg, end
  15.481 -        beg = end
  15.482 -
  15.483 -def htmlSpan(text, classRun):
  15.484 -    key, beg, end = classRun
  15.485 -    textbit = text[beg:end]
  15.486 -    if key: return '<span class="%s">%s</span>' % (key, textbit)
  15.487 -    else: return textbit
  15.488 -
  15.489 -def multipleMatchSymbol(alignmentColumn):
  15.490 -    if isMatch(alignmentColumn): return "*"
  15.491 -    else: return " "
  15.492 -
  15.493 -def writeHtml(maf, lineSize):
  15.494 -    scoreLine = "Alignment"
  15.495 -    try:
  15.496 -        scoreLine += " score=%s" % maf.namesAndValues["score"]
  15.497 -        scoreLine += ", expect=%s" % maf.namesAndValues["expect"]
  15.498 -    except KeyError: pass
  15.499 -    print "<h3>%s:</h3>" % scoreLine
  15.500 -
  15.501 -    if maf.pLines:
  15.502 -        probRows = [i[1] for i in maf.pLines]
  15.503 -        probCols = izip(*probRows)
  15.504 -        classes = imap(probabilityClass, probCols)
  15.505 -    else:
  15.506 -        classes = repeat(None)
  15.507 -
  15.508 -    nameWidth = maxlen(maf.seqNames)
  15.509 -    alignmentColumns = zip(*maf.alnStrings)
  15.510 -
  15.511 -    print '<pre>'
  15.512 -    for chunk in blastChunker(maf, lineSize, alignmentColumns):
  15.513 -        cols, rows, starts, ends = chunk
  15.514 -        startWidth = maxlen(starts)
  15.515 -        endWidth = maxlen(ends)
  15.516 -        matchSymbols = map(multipleMatchSymbol, cols)
  15.517 -        matchSymbols = ''.join(matchSymbols)
  15.518 -        classChunk = islice(classes, lineSize)
  15.519 -        classRuns = list(identicalRuns(classChunk))
  15.520 -        for n, s, r, e in zip(maf.seqNames, starts, rows, ends):
  15.521 -            spans = [htmlSpan(r, i) for i in classRuns]
  15.522 -            spans = ''.join(spans)
  15.523 -            formatParams = nameWidth, n, startWidth, s, spans, endWidth, e
  15.524 -            print '%-*s %*s %s %*s' % formatParams
  15.525 -        print ' ' * nameWidth, ' ' * startWidth, matchSymbols
  15.526 -        print
  15.527 -    print '</pre>'
  15.528 -
  15.529 -##### Routines for reading MAF format: #####
  15.530 -
  15.531 -def mafInput(lines, isKeepComments):
  15.532 -    a = []
  15.533 -    s = []
  15.534 -    q = []
  15.535 -    p = []
  15.536 -    for i in lines:
  15.537 -        w = i.split()
  15.538 -        if not w:
  15.539 -            if a: yield Maf(a, s, q, p)
  15.540 -            a = []
  15.541 -            s = []
  15.542 -            q = []
  15.543 -            p = []
  15.544 -        elif w[0] == "a":
  15.545 -            a = w
  15.546 -        elif w[0] == "s":
  15.547 -            s.append(w)
  15.548 -        elif w[0] == "q":
  15.549 -            q.append(w)
  15.550 -        elif w[0] == "p":
  15.551 -            p.append(w)
  15.552 -        elif i[0] == "#" and isKeepComments:
  15.553 -            print i,
  15.554 -    if a: yield Maf(a, s, q, p)
  15.555 -
  15.556 -def isFormat(myString, myFormat):
  15.557 -    return myFormat.startswith(myString)
  15.558 -
  15.559 -def mafConvert(opts, args):
  15.560 -    format = args[0].lower()
  15.561 -    if isFormat(format, "sam"):
  15.562 -        if opts.dictionary: d = readSequenceLengths(fileinput.input(args[1:]))
  15.563 -        else: d = {}
  15.564 -        if opts.readgroup:
  15.565 -            readGroupItems = opts.readgroup.split()
  15.566 -            rg = "RG:Z:" + readGroupId(readGroupItems)
  15.567 -        else:
  15.568 -            readGroupItems = []
  15.569 -            rg = ""
  15.570 -        if not opts.noheader: writeSamHeader(d, opts.dictfile, readGroupItems)
  15.571 -    inputLines = fileinput.input(args[1:])
  15.572 -    if isFormat(format, "html") and not opts.noheader: writeHtmlHeader()
  15.573 -    isKeepCommentLines = isFormat(format, "tabular") and not opts.noheader
  15.574 -    oldQueryName = ""
  15.575 -    for maf in mafInput(inputLines, isKeepCommentLines):
  15.576 -        if   isFormat(format, "axt"): writeAxt(maf)
  15.577 -        elif isFormat(format, "blast"):
  15.578 -            writeBlast(maf, opts.linesize, oldQueryName)
  15.579 -            oldQueryName = maf.seqNames[1]
  15.580 -        elif isFormat(format, "html"): writeHtml(maf, opts.linesize)
  15.581 -        elif isFormat(format, "psl"): writePsl(maf, opts.protein)
  15.582 -        elif isFormat(format, "sam"): writeSam(maf, rg)
  15.583 -        elif isFormat(format, "tabular"): writeTab(maf)
  15.584 -        else: raise Exception("unknown format: " + format)
  15.585 -    if isFormat(format, "html") and not opts.noheader: print "</body></html>"
  15.586 -
  15.587 -if __name__ == "__main__":
  15.588 -    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  15.589 -
  15.590 -    usage = """
  15.591 -  %prog --help
  15.592 -  %prog axt mafFile(s)
  15.593 -  %prog blast mafFile(s)
  15.594 -  %prog html mafFile(s)
  15.595 -  %prog psl mafFile(s)
  15.596 -  %prog sam mafFile(s)
  15.597 -  %prog tab mafFile(s)"""
  15.598 -
  15.599 -    description = "Read MAF-format alignments & write them in another format."
  15.600 -
  15.601 -    op = optparse.OptionParser(usage=usage, description=description)
  15.602 -    op.add_option("-p", "--protein", action="store_true",
  15.603 -                  help="assume protein alignments, for psl match counts")
  15.604 -    op.add_option("-n", "--noheader", action="store_true",
  15.605 -                  help="omit any header lines from the output")
  15.606 -    op.add_option("-d", "--dictionary", action="store_true",
  15.607 -                  help="include dictionary of sequence lengths in sam format")
  15.608 -    op.add_option("-f", "--dictfile",
  15.609 -                  help="get sequence dictionary from DICTFILE")
  15.610 -    op.add_option("-r", "--readgroup",
  15.611 -                  help="read group info for sam format")
  15.612 -    op.add_option("-l", "--linesize", type="int", default=60, #metavar="CHARS",
  15.613 -                  help="line length for blast and html formats (default: %default)")
  15.614 -    (opts, args) = op.parse_args()
  15.615 -    if opts.linesize <= 0: op.error("option -l: should be >= 1")
  15.616 -    if opts.dictionary and opts.dictfile: op.error("can't use both -d and -f")
  15.617 -    if len(args) < 1: op.error("I need a format-name and some MAF alignments")
  15.618 -    if opts.dictionary and (len(args) == 1 or "-" in args[1:]):
  15.619 -        op.error("need file (not pipe) with option -d")
  15.620 -
  15.621 -    try: mafConvert(opts, args)
  15.622 -    except KeyboardInterrupt: pass  # avoid silly error message
  15.623 -    except Exception, e:
  15.624 -        prog = os.path.basename(sys.argv[0])
  15.625 -        sys.exit(prog + ": error: " + str(e))
    16.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    16.2 +++ b/scripts/maf-cull	Wed Sep 24 13:13:14 2014 +0900
    16.3 @@ -0,0 +1,112 @@
    16.4 +#! /usr/bin/env python
    16.5 +
    16.6 +# Read MAF-format alignments.  Write them, omitting alignments whose
    16.7 +# coordinates in the top-most sequence are contained in those of >=
    16.8 +# cullingLimit higher-scoring alignments.
    16.9 +
   16.10 +# Alignments on opposite strands are not considered to contain each
   16.11 +# other.
   16.12 +
   16.13 +# The alignments must be sorted by sequence name, then strand, then
   16.14 +# start coordinate.
   16.15 +
   16.16 +# This algorithm is not theoretically optimal, but it is simple and
   16.17 +# probably fast in practice.  Optimal algorithms are described in:
   16.18 +# Winnowing sequences from a database search.
   16.19 +# Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W.
   16.20 +# J Comput Biol. 2000 Feb-Apr;7(1-2):293-302.
   16.21 +# (Use a "priority search tree" or an "interval tree".)
   16.22 +
   16.23 +# Seems to work with Python 2.x, x>=4
   16.24 +
   16.25 +import fileinput, itertools, operator, optparse, os, signal, sys
   16.26 +
   16.27 +# The intervals must have size > 0.
   16.28 +
   16.29 +def isFresh(oldInterval, newInterval):
   16.30 +    return oldInterval.end > newInterval.start
   16.31 +
   16.32 +def freshIntervals(storedIntervals, newInterval):
   16.33 +    """Yields those storedIntervals that overlap newInterval."""
   16.34 +    return [i for i in storedIntervals if isFresh(i, newInterval)]
   16.35 +
   16.36 +def isDominated(dog, queen):
   16.37 +    return dog.score < queen.score and dog.end <= queen.end
   16.38 +
   16.39 +def isWanted(newInterval, storedIntervals, cullingLimit):
   16.40 +    """Is newInterval dominated by < cullingLimit storedIntervals?"""
   16.41 +    dominators = (i for i in storedIntervals if isDominated(newInterval, i))
   16.42 +    return len(list(dominators)) < cullingLimit
   16.43 +
   16.44 +# Check that the intervals are sorted by start position, and further
   16.45 +# sort them in descending order of score.
   16.46 +def sortedIntervals(intervals):
   16.47 +    oldStart = ()
   16.48 +    for k, v in itertools.groupby(intervals, operator.attrgetter("start")):
   16.49 +        if k < oldStart: raise Exception("the input is not sorted properly")
   16.50 +        oldStart = k
   16.51 +        for i in sorted(v, key=operator.attrgetter("score"), reverse=True):
   16.52 +            yield i
   16.53 +
   16.54 +def culledIntervals(intervals, cullingLimit):
   16.55 +    """Yield intervals contained in < cullingLimit higher-scoring intervals."""
   16.56 +    storedIntervals = []
   16.57 +    for i in sortedIntervals(intervals):
   16.58 +        storedIntervals = freshIntervals(storedIntervals, i)
   16.59 +        if isWanted(i, storedIntervals, cullingLimit):
   16.60 +            yield i
   16.61 +            storedIntervals.append(i)
   16.62 +
   16.63 +class Maf:
   16.64 +    def __init__(self, lines):
   16.65 +        self.lines = lines
   16.66 +        try:
   16.67 +            aLine = lines[0]
   16.68 +            aWords = aLine.split()
   16.69 +            scoreGen = (i for i in aWords if i.startswith("score="))
   16.70 +            scoreWord = scoreGen.next()
   16.71 +            self.score = float(scoreWord.split("=")[1])
   16.72 +        except: raise Exception("missing score")
   16.73 +        try:
   16.74 +            sLine = lines[1]
   16.75 +            sWords = sLine.split()
   16.76 +            seqName = sWords[1]
   16.77 +            alnStart = int(sWords[2])
   16.78 +            alnSize = int(sWords[3])
   16.79 +            strand = sWords[4]
   16.80 +            self.start = seqName, strand, alnStart
   16.81 +            self.end = seqName, strand, alnStart + alnSize
   16.82 +        except: raise Exception("can't interpret the MAF data")
   16.83 +
   16.84 +def mafInput(lines):
   16.85 +    for k, v in itertools.groupby(lines, str.isspace):
   16.86 +        if not k: yield Maf(list(v))
   16.87 +
   16.88 +def filterComments(lines):
   16.89 +    for i in lines:
   16.90 +        if i.startswith("#"): print i,
   16.91 +        else: yield i
   16.92 +
   16.93 +def mafCull(opts, args):
   16.94 +    inputLines = fileinput.input(args)
   16.95 +    inputMafs = mafInput(filterComments(inputLines))
   16.96 +    for maf in culledIntervals(inputMafs, opts.limit):
   16.97 +        for i in maf.lines: print i,
   16.98 +        print  # blank line after each alignment
   16.99 +
  16.100 +if __name__ == "__main__":
  16.101 +    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  16.102 +
  16.103 +    usage = "%prog [options] my-alignments.maf"
  16.104 +    description = "Cull alignments whose top-sequence coordinates are contained in LIMIT or more higher-scoring alignments."
  16.105 +    op = optparse.OptionParser(usage=usage, description=description)
  16.106 +    op.add_option("-l", "--limit", type="int", default=2,
  16.107 +                  help="culling limit (default: %default)")
  16.108 +    (opts, args) = op.parse_args()
  16.109 +    if opts.limit < 1: op.error("option -l: should be >= 1")
  16.110 +
  16.111 +    try: mafCull(opts, args)
  16.112 +    except KeyboardInterrupt: pass  # avoid silly error message
  16.113 +    except Exception, e:
  16.114 +        prog = os.path.basename(sys.argv[0])
  16.115 +        sys.exit(prog + ": error: " + str(e))
    17.1 --- a/scripts/maf-cull.py	Wed Sep 24 11:41:48 2014 +0900
    17.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    17.3 @@ -1,112 +0,0 @@
    17.4 -#! /usr/bin/env python
    17.5 -
    17.6 -# Read MAF-format alignments.  Write them, omitting alignments whose
    17.7 -# coordinates in the top-most sequence are contained in those of >=
    17.8 -# cullingLimit higher-scoring alignments.
    17.9 -
   17.10 -# Alignments on opposite strands are not considered to contain each
   17.11 -# other.
   17.12 -
   17.13 -# The alignments must be sorted by sequence name, then strand, then
   17.14 -# start coordinate.
   17.15 -
   17.16 -# This algorithm is not theoretically optimal, but it is simple and
   17.17 -# probably fast in practice.  Optimal algorithms are described in:
   17.18 -# Winnowing sequences from a database search.
   17.19 -# Berman P, Zhang Z, Wolf YI, Koonin EV, Miller W.
   17.20 -# J Comput Biol. 2000 Feb-Apr;7(1-2):293-302.
   17.21 -# (Use a "priority search tree" or an "interval tree".)
   17.22 -
   17.23 -# Seems to work with Python 2.x, x>=4
   17.24 -
   17.25 -import fileinput, itertools, operator, optparse, os, signal, sys
   17.26 -
   17.27 -# The intervals must have size > 0.
   17.28 -
   17.29 -def isFresh(oldInterval, newInterval):
   17.30 -    return oldInterval.end > newInterval.start
   17.31 -
   17.32 -def freshIntervals(storedIntervals, newInterval):
   17.33 -    """Yields those storedIntervals that overlap newInterval."""
   17.34 -    return [i for i in storedIntervals if isFresh(i, newInterval)]
   17.35 -
   17.36 -def isDominated(dog, queen):
   17.37 -    return dog.score < queen.score and dog.end <= queen.end
   17.38 -
   17.39 -def isWanted(newInterval, storedIntervals, cullingLimit):
   17.40 -    """Is newInterval dominated by < cullingLimit storedIntervals?"""
   17.41 -    dominators = (i for i in storedIntervals if isDominated(newInterval, i))
   17.42 -    return len(list(dominators)) < cullingLimit
   17.43 -
   17.44 -# Check that the intervals are sorted by start position, and further
   17.45 -# sort them in descending order of score.
   17.46 -def sortedIntervals(intervals):
   17.47 -    oldStart = ()
   17.48 -    for k, v in itertools.groupby(intervals, operator.attrgetter("start")):
   17.49 -        if k < oldStart: raise Exception("the input is not sorted properly")
   17.50 -        oldStart = k
   17.51 -        for i in sorted(v, key=operator.attrgetter("score"), reverse=True):
   17.52 -            yield i
   17.53 -
   17.54 -def culledIntervals(intervals, cullingLimit):
   17.55 -    """Yield intervals contained in < cullingLimit higher-scoring intervals."""
   17.56 -    storedIntervals = []
   17.57 -    for i in sortedIntervals(intervals):
   17.58 -        storedIntervals = freshIntervals(storedIntervals, i)
   17.59 -        if isWanted(i, storedIntervals, cullingLimit):
   17.60 -            yield i
   17.61 -            storedIntervals.append(i)
   17.62 -
   17.63 -class Maf:
   17.64 -    def __init__(self, lines):
   17.65 -        self.lines = lines
   17.66 -        try:
   17.67 -            aLine = lines[0]
   17.68 -            aWords = aLine.split()
   17.69 -            scoreGen = (i for i in aWords if i.startswith("score="))
   17.70 -            scoreWord = scoreGen.next()
   17.71 -            self.score = float(scoreWord.split("=")[1])
   17.72 -        except: raise Exception("missing score")
   17.73 -        try:
   17.74 -            sLine = lines[1]
   17.75 -            sWords = sLine.split()
   17.76 -            seqName = sWords[1]
   17.77 -            alnStart = int(sWords[2])
   17.78 -            alnSize = int(sWords[3])
   17.79 -            strand = sWords[4]
   17.80 -            self.start = seqName, strand, alnStart
   17.81 -            self.end = seqName, strand, alnStart + alnSize
   17.82 -        except: raise Exception("can't interpret the MAF data")
   17.83 -
   17.84 -def mafInput(lines):
   17.85 -    for k, v in itertools.groupby(lines, str.isspace):
   17.86 -        if not k: yield Maf(list(v))
   17.87 -
   17.88 -def filterComments(lines):
   17.89 -    for i in lines:
   17.90 -        if i.startswith("#"): print i,
   17.91 -        else: yield i
   17.92 -
   17.93 -def mafCull(opts, args):
   17.94 -    inputLines = fileinput.input(args)
   17.95 -    inputMafs = mafInput(filterComments(inputLines))
   17.96 -    for maf in culledIntervals(inputMafs, opts.limit):
   17.97 -        for i in maf.lines: print i,
   17.98 -        print  # blank line after each alignment
   17.99 -
  17.100 -if __name__ == "__main__":
  17.101 -    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  17.102 -
  17.103 -    usage = "%prog [options] my-alignments.maf"
  17.104 -    description = "Cull alignments whose top-sequence coordinates are contained in LIMIT or more higher-scoring alignments."
  17.105 -    op = optparse.OptionParser(usage=usage, description=description)
  17.106 -    op.add_option("-l", "--limit", type="int", default=2,
  17.107 -                  help="culling limit (default: %default)")
  17.108 -    (opts, args) = op.parse_args()
  17.109 -    if opts.limit < 1: op.error("option -l: should be >= 1")
  17.110 -
  17.111 -    try: mafCull(opts, args)
  17.112 -    except KeyboardInterrupt: pass  # avoid silly error message
  17.113 -    except Exception, e:
  17.114 -        prog = os.path.basename(sys.argv[0])
  17.115 -        sys.exit(prog + ": error: " + str(e))
    18.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    18.2 +++ b/scripts/maf-join	Wed Sep 24 13:13:14 2014 +0900
    18.3 @@ -0,0 +1,244 @@
    18.4 +#! /usr/bin/env python
    18.5 +
    18.6 +# Copyright 2009, 2010, 2011 Martin C. Frith
    18.7 +
    18.8 +# Join two or more sets of MAF-format multiple alignments into bigger
    18.9 +# multiple alignments.  The 'join field' is the top genome, which
   18.10 +# should be the same for each input.  Each input should be sorted by
   18.11 +# position in the top genome.
   18.12 +
   18.13 +# WARNING: Alignment columns with a gap in the top genome are joined
   18.14 +# arbitrarily!!!
   18.15 +
   18.16 +import sys, os, fileinput, optparse, signal
   18.17 +
   18.18 +signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # stop spurious error message
   18.19 +
   18.20 +class peekable:  # Adapted from Python Cookbook 2nd edition
   18.21 +    """An iterator that supports a peek operation."""
   18.22 +    def __init__(self, iterable):
   18.23 +        self.it = iter(iterable)
   18.24 +        self.cache = []
   18.25 +    def __iter__(self):
   18.26 +        return self
   18.27 +    def next(self):
   18.28 +        if self.cache: return self.cache.pop()
   18.29 +        else: return self.it.next()
   18.30 +    def peek(self):
   18.31 +        if not self.cache: self.cache.append(self.it.next())
   18.32 +        return self.cache[0]
   18.33 +
   18.34 +def maxLen(things): return max(map(len, things))
   18.35 +
   18.36 +class MafBlock:
   18.37 +    def __init__(self, chr, beg, end, strand, chrSize, seq, prob):
   18.38 +        self.chr = chr  # chromosome names
   18.39 +        self.beg = beg  # alignment begin coordinates
   18.40 +        self.end = end  # alignment end coordinates
   18.41 +        self.strand = strand
   18.42 +        self.chrSize = chrSize  # chromosome sizes
   18.43 +        self.seq = seq  # aligned sequences, including gaps
   18.44 +        self.prob = prob  # probabilities (may be empty)
   18.45 +
   18.46 +    def __nonzero__(self):
   18.47 +        return len(self.seq) > 0
   18.48 +
   18.49 +    def __cmp__(self, other):
   18.50 +        return cmp(self.chr[:1] + self.beg[:1], other.chr[:1] + other.beg[:1])
   18.51 +
   18.52 +    def before(self, other):
   18.53 +        return (self.chr[0], self.end[0]) <= (other.chr[0], other.beg[0])
   18.54 +
   18.55 +    def after(self, other):
   18.56 +        return (self.chr[0], self.beg[0]) >= (other.chr[0], other.end[0])
   18.57 +
   18.58 +    def addLine(self, line):
   18.59 +        words = line.split()
   18.60 +        if line.startswith('s'):
   18.61 +            self.chr.append(words[1])
   18.62 +            self.beg.append(int(words[2]))
   18.63 +            self.end.append(int(words[2]) + int(words[3]))
   18.64 +            self.strand.append(words[4])
   18.65 +            self.chrSize.append(words[5])
   18.66 +            self.seq.append(list(words[6]))
   18.67 +        elif line.startswith('p'):
   18.68 +            self.prob.append(words[1])
   18.69 +
   18.70 +    def write(self):
   18.71 +        beg = map(str, self.beg)
   18.72 +        size = [str(e-b) for b, e in zip(self.beg, self.end)]
   18.73 +        seq = [''.join(i) for i in self.seq]
   18.74 +        columns = self.chr, beg, size, self.strand, self.chrSize, seq
   18.75 +        widths = map(maxLen, columns)
   18.76 +        print 'a'
   18.77 +        for row in zip(*columns):
   18.78 +            widthsAndFields = zip(widths, row)
   18.79 +            field0 = "%-*s" % widthsAndFields[0]  # left-justify
   18.80 +            fields = ["%*s" % i for i in widthsAndFields[1:]]  # right-justify
   18.81 +            print 's', field0, ' '.join(fields)
   18.82 +        pad = ' '.join(' ' * i for i in widths[:-1])
   18.83 +        for i in self.prob:
   18.84 +            print 'p', pad, i
   18.85 +        print  # blank line afterwards
   18.86 +
   18.87 +def topSeqBeg(maf): return maf.beg[0]
   18.88 +def topSeqEnd(maf): return maf.end[0]
   18.89 +def emptyMaf(): return MafBlock([], [], [], [], [], [], [])
   18.90 +
   18.91 +def joinOnFirstItem(x, y):
   18.92 +    if x[0] != y[0]:
   18.93 +        raise ValueError('join fields not equal:\n'+str(x[0])+'\n'+str(y[0]))
   18.94 +    return x + y[1:]
   18.95 +
   18.96 +def mafEasyJoin(x, y):
   18.97 +    '''Join two MAF blocks on the top sequence.'''
   18.98 +    xJoin = zip(x.chr, x.beg, x.end, x.strand, x.chrSize, x.seq)
   18.99 +    yJoin = zip(y.chr, y.beg, y.end, y.strand, y.chrSize, y.seq)
  18.100 +    joined = joinOnFirstItem(xJoin, yJoin)
  18.101 +    chr, beg, end, strand, chrSize, seq = zip(*joined)
  18.102 +    prob = x.prob + y.prob
  18.103 +    return MafBlock(chr, beg, end, strand, chrSize, seq, prob)
  18.104 +
  18.105 +def countNonGaps(s): return len(s) - s.count('-')
  18.106 +
  18.107 +def nthNonGap(s, n):
  18.108 +    '''Get the start position of the n-th non-gap.'''
  18.109 +    for i, x in enumerate(s):
  18.110 +        if x != '-':
  18.111 +            if n == 0: return i
  18.112 +            n -= 1
  18.113 +    raise ValueError('non-gap not found')
  18.114 +
  18.115 +def nthLastNonGap(s, n):
  18.116 +    '''Get the end position of the n-th last non-gap.'''
  18.117 +    return len(s) - nthNonGap(s[::-1], n)
  18.118 +
  18.119 +def mafSlice(maf, alnBeg, alnEnd):
  18.120 +    '''Return a slice of a MAF block, using coordinates in the alignment.'''
  18.121 +    beg = [b + countNonGaps(s[:alnBeg]) for b, s in zip(maf.beg, maf.seq)]
  18.122 +    end = [e - countNonGaps(s[alnEnd:]) for e, s in zip(maf.end, maf.seq)]
  18.123 +    seq = [i[alnBeg:alnEnd] for i in maf.seq]
  18.124 +    prob = [i[alnBeg:alnEnd] for i in maf.prob]
  18.125 +    return MafBlock(maf.chr, beg, end, maf.strand, maf.chrSize, seq, prob)
  18.126 +
  18.127 +def mafSliceTopSeq(maf, newTopSeqBeg, newTopSeqEnd):
  18.128 +    '''Return a slice of a MAF block, using coordinates in the top sequence.'''
  18.129 +    lettersFromBeg = newTopSeqBeg - topSeqBeg(maf)
  18.130 +    lettersFromEnd = topSeqEnd(maf) - newTopSeqEnd
  18.131 +    alnBeg = nthNonGap(maf.seq[0], lettersFromBeg)
  18.132 +    alnEnd = nthLastNonGap(maf.seq[0], lettersFromEnd)
  18.133 +    return mafSlice(maf, alnBeg, alnEnd)
  18.134 +
  18.135 +def jumpGaps(sequence, index):
  18.136 +    '''Return the next index of the sequence where there is a non-gap.'''
  18.137 +    nextIndex = index
  18.138 +    while sequence[nextIndex] == '-': nextIndex += 1
  18.139 +    return nextIndex
  18.140 +
  18.141 +def gapsToAdd(sequences):
  18.142 +    '''Return new gaps and their positions, needed to align the non-gaps.'''
  18.143 +    gapInfo = [[] for i in sequences]
  18.144 +    gapBeg = [0 for i in sequences]
  18.145 +    try:
  18.146 +        while True:
  18.147 +            gapEnd = [jumpGaps(s, p) for s, p in zip(sequences, gapBeg)]
  18.148 +            gapSize = [e-b for b, e in zip(gapBeg, gapEnd)]
  18.149 +            maxGapSize = max(gapSize)
  18.150 +            for s, e, i in zip(gapSize, gapEnd, gapInfo):
  18.151 +                if s < maxGapSize:
  18.152 +                    newGap = maxGapSize - s
  18.153 +                    i.append((newGap, e))
  18.154 +            gapBeg = [e+1 for e in gapEnd]
  18.155 +    except IndexError: return gapInfo
  18.156 +
  18.157 +def chunksAndGaps(s, gapsAndPositions, oneGap):
  18.158 +    '''Yield chunks of "s" interspersed with gaps at given positions.'''
  18.159 +    oldPosition = 0
  18.160 +    for gapLen, position in gapsAndPositions:
  18.161 +        yield s[oldPosition:position]
  18.162 +        yield oneGap * gapLen
  18.163 +        oldPosition = position
  18.164 +    yield s[oldPosition:]
  18.165 +
  18.166 +def mafAddGaps(maf, gapsAndPositions):
  18.167 +    '''Add the given gaps at the given positions to a MAF block.'''
  18.168 +    maf.seq = [sum(chunksAndGaps(i, gapsAndPositions, ['-']), [])
  18.169 +               for i in maf.seq]
  18.170 +    maf.prob = [''.join(chunksAndGaps(i, gapsAndPositions, '~'))
  18.171 +                for i in maf.prob]
  18.172 +
  18.173 +def mafJoin(mafs):
  18.174 +    '''Intersect and join overlapping MAF blocks.'''
  18.175 +    newTopSeqBeg = max(map(topSeqBeg, mafs))
  18.176 +    newTopSeqEnd = min(map(topSeqEnd, mafs))
  18.177 +    mafs = [mafSliceTopSeq(i, newTopSeqBeg, newTopSeqEnd) for i in mafs]
  18.178 +    topSeqs = [i.seq[0] for i in mafs]
  18.179 +    gapInfo = gapsToAdd(topSeqs)
  18.180 +    for maf, gapsAndPositions in zip(mafs, gapInfo):
  18.181 +        mafAddGaps(maf, gapsAndPositions)
  18.182 +    return reduce(mafEasyJoin, mafs)
  18.183 +
  18.184 +def mafInput(lines):
  18.185 +    '''Read lines and yield MAF blocks.'''
  18.186 +    maf = emptyMaf()
  18.187 +    for line in lines:
  18.188 +        if line.isspace():
  18.189 +            if maf: yield maf
  18.190 +            maf = emptyMaf()
  18.191 +        else:
  18.192 +            maf.addLine(line)
  18.193 +    if maf: yield maf
  18.194 +
  18.195 +def sortedMafInput(lines):
  18.196 +    '''Read lines and yield MAF blocks, checking that they are in order.'''
  18.197 +    old = emptyMaf()
  18.198 +    for maf in mafInput(lines):
  18.199 +        if maf < old: sys.exit(progName + ": MAF blocks not sorted properly")
  18.200 +        yield maf
  18.201 +        old = maf
  18.202 +
  18.203 +def allOverlaps(sequences, beg, end):
  18.204 +    '''Yield all combinations of MAF blocks that overlap in the top genome.'''
  18.205 +    assert beg < end
  18.206 +    if not sequences:
  18.207 +        yield ()
  18.208 +        return
  18.209 +    for i in sequences[0]:
  18.210 +        if topSeqEnd(i) <= beg: continue
  18.211 +        if topSeqBeg(i) >= end: break  # assumes they're sorted by position
  18.212 +        newBeg = max(beg, topSeqBeg(i))
  18.213 +        newEnd = min(end, topSeqEnd(i))
  18.214 +        for j in allOverlaps(sequences[1:], newBeg, newEnd):
  18.215 +            yield (i,) + j
  18.216 +
  18.217 +def nextWindow(window, input, referenceMaf):
  18.218 +    '''Yield "relevant" MAF blocks, based on overlap with referenceMaf.'''
  18.219 +    for maf in window:
  18.220 +        if not maf.before(referenceMaf): yield maf
  18.221 +    try:
  18.222 +        while True:
  18.223 +            maf = input.peek()
  18.224 +            if maf.after(referenceMaf): break
  18.225 +            maf = input.next()
  18.226 +            if not maf.before(referenceMaf): yield maf
  18.227 +    except StopIteration: pass
  18.228 +
  18.229 +def overlappingMafs(sortedMafInputs):
  18.230 +    '''Yield all combinations of MAF blocks that overlap in the top genome.'''
  18.231 +    if not sortedMafInputs: return
  18.232 +    head, tail = sortedMafInputs[0], sortedMafInputs[1:]
  18.233 +    windows = [[] for t in tail]
  18.234 +    for h in head:  # iterate over MAF blocks in the first input
  18.235 +        windows = [list(nextWindow(w, t, h)) for w, t in zip(windows, tail)]
  18.236 +        for i in allOverlaps(windows, topSeqBeg(h), topSeqEnd(h)):
  18.237 +            yield (h,) + i
  18.238 +
  18.239 +op = optparse.OptionParser(usage="%prog sorted-file1.maf sorted-file2.maf ...")
  18.240 +(opts, args) = op.parse_args()
  18.241 +
  18.242 +progName = os.path.basename(sys.argv[0])
  18.243 +
  18.244 +inputs = [peekable(sortedMafInput(fileinput.input(i))) for i in args]
  18.245 +
  18.246 +for mafs in overlappingMafs(inputs):
  18.247 +    mafJoin(mafs).write()
    19.1 --- a/scripts/maf-join.py	Wed Sep 24 11:41:48 2014 +0900
    19.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    19.3 @@ -1,244 +0,0 @@
    19.4 -#! /usr/bin/env python
    19.5 -
    19.6 -# Copyright 2009, 2010, 2011 Martin C. Frith
    19.7 -
    19.8 -# Join two or more sets of MAF-format multiple alignments into bigger
    19.9 -# multiple alignments.  The 'join field' is the top genome, which
   19.10 -# should be the same for each input.  Each input should be sorted by
   19.11 -# position in the top genome.
   19.12 -
   19.13 -# WARNING: Alignment columns with a gap in the top genome are joined
   19.14 -# arbitrarily!!!
   19.15 -
   19.16 -import sys, os, fileinput, optparse, signal
   19.17 -
   19.18 -signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # stop spurious error message
   19.19 -
   19.20 -class peekable:  # Adapted from Python Cookbook 2nd edition
   19.21 -    """An iterator that supports a peek operation."""
   19.22 -    def __init__(self, iterable):
   19.23 -        self.it = iter(iterable)
   19.24 -        self.cache = []
   19.25 -    def __iter__(self):
   19.26 -        return self
   19.27 -    def next(self):
   19.28 -        if self.cache: return self.cache.pop()
   19.29 -        else: return self.it.next()
   19.30 -    def peek(self):
   19.31 -        if not self.cache: self.cache.append(self.it.next())
   19.32 -        return self.cache[0]
   19.33 -
   19.34 -def maxLen(things): return max(map(len, things))
   19.35 -
   19.36 -class MafBlock:
   19.37 -    def __init__(self, chr, beg, end, strand, chrSize, seq, prob):
   19.38 -        self.chr = chr  # chromosome names
   19.39 -        self.beg = beg  # alignment begin coordinates
   19.40 -        self.end = end  # alignment end coordinates
   19.41 -        self.strand = strand
   19.42 -        self.chrSize = chrSize  # chromosome sizes
   19.43 -        self.seq = seq  # aligned sequences, including gaps
   19.44 -        self.prob = prob  # probabilities (may be empty)
   19.45 -
   19.46 -    def __nonzero__(self):
   19.47 -        return len(self.seq) > 0
   19.48 -
   19.49 -    def __cmp__(self, other):
   19.50 -        return cmp(self.chr[:1] + self.beg[:1], other.chr[:1] + other.beg[:1])
   19.51 -
   19.52 -    def before(self, other):
   19.53 -        return (self.chr[0], self.end[0]) <= (other.chr[0], other.beg[0])
   19.54 -
   19.55 -    def after(self, other):
   19.56 -        return (self.chr[0], self.beg[0]) >= (other.chr[0], other.end[0])
   19.57 -
   19.58 -    def addLine(self, line):
   19.59 -        words = line.split()
   19.60 -        if line.startswith('s'):
   19.61 -            self.chr.append(words[1])
   19.62 -            self.beg.append(int(words[2]))
   19.63 -            self.end.append(int(words[2]) + int(words[3]))
   19.64 -            self.strand.append(words[4])
   19.65 -            self.chrSize.append(words[5])
   19.66 -            self.seq.append(list(words[6]))
   19.67 -        elif line.startswith('p'):
   19.68 -            self.prob.append(words[1])
   19.69 -
   19.70 -    def write(self):
   19.71 -        beg = map(str, self.beg)
   19.72 -        size = [str(e-b) for b, e in zip(self.beg, self.end)]
   19.73 -        seq = [''.join(i) for i in self.seq]
   19.74 -        columns = self.chr, beg, size, self.strand, self.chrSize, seq
   19.75 -        widths = map(maxLen, columns)
   19.76 -        print 'a'
   19.77 -        for row in zip(*columns):
   19.78 -            widthsAndFields = zip(widths, row)
   19.79 -            field0 = "%-*s" % widthsAndFields[0]  # left-justify
   19.80 -            fields = ["%*s" % i for i in widthsAndFields[1:]]  # right-justify
   19.81 -            print 's', field0, ' '.join(fields)
   19.82 -        pad = ' '.join(' ' * i for i in widths[:-1])
   19.83 -        for i in self.prob:
   19.84 -            print 'p', pad, i
   19.85 -        print  # blank line afterwards
   19.86 -
   19.87 -def topSeqBeg(maf): return maf.beg[0]
   19.88 -def topSeqEnd(maf): return maf.end[0]
   19.89 -def emptyMaf(): return MafBlock([], [], [], [], [], [], [])
   19.90 -
   19.91 -def joinOnFirstItem(x, y):
   19.92 -    if x[0] != y[0]:
   19.93 -        raise ValueError('join fields not equal:\n'+str(x[0])+'\n'+str(y[0]))
   19.94 -    return x + y[1:]
   19.95 -
   19.96 -def mafEasyJoin(x, y):
   19.97 -    '''Join two MAF blocks on the top sequence.'''
   19.98 -    xJoin = zip(x.chr, x.beg, x.end, x.strand, x.chrSize, x.seq)
   19.99 -    yJoin = zip(y.chr, y.beg, y.end, y.strand, y.chrSize, y.seq)
  19.100 -    joined = joinOnFirstItem(xJoin, yJoin)
  19.101 -    chr, beg, end, strand, chrSize, seq = zip(*joined)
  19.102 -    prob = x.prob + y.prob
  19.103 -    return MafBlock(chr, beg, end, strand, chrSize, seq, prob)
  19.104 -
  19.105 -def countNonGaps(s): return len(s) - s.count('-')
  19.106 -
  19.107 -def nthNonGap(s, n):
  19.108 -    '''Get the start position of the n-th non-gap.'''
  19.109 -    for i, x in enumerate(s):
  19.110 -        if x != '-':
  19.111 -            if n == 0: return i
  19.112 -            n -= 1
  19.113 -    raise ValueError('non-gap not found')
  19.114 -
  19.115 -def nthLastNonGap(s, n):
  19.116 -    '''Get the end position of the n-th last non-gap.'''
  19.117 -    return len(s) - nthNonGap(s[::-1], n)
  19.118 -
  19.119 -def mafSlice(maf, alnBeg, alnEnd):
  19.120 -    '''Return a slice of a MAF block, using coordinates in the alignment.'''
  19.121 -    beg = [b + countNonGaps(s[:alnBeg]) for b, s in zip(maf.beg, maf.seq)]
  19.122 -    end = [e - countNonGaps(s[alnEnd:]) for e, s in zip(maf.end, maf.seq)]
  19.123 -    seq = [i[alnBeg:alnEnd] for i in maf.seq]
  19.124 -    prob = [i[alnBeg:alnEnd] for i in maf.prob]
  19.125 -    return MafBlock(maf.chr, beg, end, maf.strand, maf.chrSize, seq, prob)
  19.126 -
  19.127 -def mafSliceTopSeq(maf, newTopSeqBeg, newTopSeqEnd):
  19.128 -    '''Return a slice of a MAF block, using coordinates in the top sequence.'''
  19.129 -    lettersFromBeg = newTopSeqBeg - topSeqBeg(maf)
  19.130 -    lettersFromEnd = topSeqEnd(maf) - newTopSeqEnd
  19.131 -    alnBeg = nthNonGap(maf.seq[0], lettersFromBeg)
  19.132 -    alnEnd = nthLastNonGap(maf.seq[0], lettersFromEnd)
  19.133 -    return mafSlice(maf, alnBeg, alnEnd)
  19.134 -
  19.135 -def jumpGaps(sequence, index):
  19.136 -    '''Return the next index of the sequence where there is a non-gap.'''
  19.137 -    nextIndex = index
  19.138 -    while sequence[nextIndex] == '-': nextIndex += 1
  19.139 -    return nextIndex
  19.140 -
  19.141 -def gapsToAdd(sequences):
  19.142 -    '''Return new gaps and their positions, needed to align the non-gaps.'''
  19.143 -    gapInfo = [[] for i in sequences]
  19.144 -    gapBeg = [0 for i in sequences]
  19.145 -    try:
  19.146 -        while True:
  19.147 -            gapEnd = [jumpGaps(s, p) for s, p in zip(sequences, gapBeg)]
  19.148 -            gapSize = [e-b for b, e in zip(gapBeg, gapEnd)]
  19.149 -            maxGapSize = max(gapSize)
  19.150 -            for s, e, i in zip(gapSize, gapEnd, gapInfo):
  19.151 -                if s < maxGapSize:
  19.152 -                    newGap = maxGapSize - s
  19.153 -                    i.append((newGap, e))
  19.154 -            gapBeg = [e+1 for e in gapEnd]
  19.155 -    except IndexError: return gapInfo
  19.156 -
  19.157 -def chunksAndGaps(s, gapsAndPositions, oneGap):
  19.158 -    '''Yield chunks of "s" interspersed with gaps at given positions.'''
  19.159 -    oldPosition = 0
  19.160 -    for gapLen, position in gapsAndPositions:
  19.161 -        yield s[oldPosition:position]
  19.162 -        yield oneGap * gapLen
  19.163 -        oldPosition = position
  19.164 -    yield s[oldPosition:]
  19.165 -
  19.166 -def mafAddGaps(maf, gapsAndPositions):
  19.167 -    '''Add the given gaps at the given positions to a MAF block.'''
  19.168 -    maf.seq = [sum(chunksAndGaps(i, gapsAndPositions, ['-']), [])
  19.169 -               for i in maf.seq]
  19.170 -    maf.prob = [''.join(chunksAndGaps(i, gapsAndPositions, '~'))
  19.171 -                for i in maf.prob]
  19.172 -
  19.173 -def mafJoin(mafs):
  19.174 -    '''Intersect and join overlapping MAF blocks.'''
  19.175 -    newTopSeqBeg = max(map(topSeqBeg, mafs))
  19.176 -    newTopSeqEnd = min(map(topSeqEnd, mafs))
  19.177 -    mafs = [mafSliceTopSeq(i, newTopSeqBeg, newTopSeqEnd) for i in mafs]
  19.178 -    topSeqs = [i.seq[0] for i in mafs]
  19.179 -    gapInfo = gapsToAdd(topSeqs)
  19.180 -    for maf, gapsAndPositions in zip(mafs, gapInfo):
  19.181 -        mafAddGaps(maf, gapsAndPositions)
  19.182 -    return reduce(mafEasyJoin, mafs)
  19.183 -
  19.184 -def mafInput(lines):
  19.185 -    '''Read lines and yield MAF blocks.'''
  19.186 -    maf = emptyMaf()
  19.187 -    for line in lines:
  19.188 -        if line.isspace():
  19.189 -            if maf: yield maf
  19.190 -            maf = emptyMaf()
  19.191 -        else:
  19.192 -            maf.addLine(line)
  19.193 -    if maf: yield maf
  19.194 -
  19.195 -def sortedMafInput(lines):
  19.196 -    '''Read lines and yield MAF blocks, checking that they are in order.'''
  19.197 -    old = emptyMaf()
  19.198 -    for maf in mafInput(lines):
  19.199 -        if maf < old: sys.exit(progName + ": MAF blocks not sorted properly")
  19.200 -        yield maf
  19.201 -        old = maf
  19.202 -
  19.203 -def allOverlaps(sequences, beg, end):
  19.204 -    '''Yield all combinations of MAF blocks that overlap in the top genome.'''
  19.205 -    assert beg < end
  19.206 -    if not sequences:
  19.207 -        yield ()
  19.208 -        return
  19.209 -    for i in sequences[0]:
  19.210 -        if topSeqEnd(i) <= beg: continue
  19.211 -        if topSeqBeg(i) >= end: break  # assumes they're sorted by position
  19.212 -        newBeg = max(beg, topSeqBeg(i))
  19.213 -        newEnd = min(end, topSeqEnd(i))
  19.214 -        for j in allOverlaps(sequences[1:], newBeg, newEnd):
  19.215 -            yield (i,) + j
  19.216 -
  19.217 -def nextWindow(window, input, referenceMaf):
  19.218 -    '''Yield "relevant" MAF blocks, based on overlap with referenceMaf.'''
  19.219 -    for maf in window:
  19.220 -        if not maf.before(referenceMaf): yield maf
  19.221 -    try:
  19.222 -        while True:
  19.223 -            maf = input.peek()
  19.224 -            if maf.after(referenceMaf): break
  19.225 -            maf = input.next()
  19.226 -            if not maf.before(referenceMaf): yield maf
  19.227 -    except StopIteration: pass
  19.228 -
  19.229 -def overlappingMafs(sortedMafInputs):
  19.230 -    '''Yield all combinations of MAF blocks that overlap in the top genome.'''
  19.231 -    if not sortedMafInputs: return
  19.232 -    head, tail = sortedMafInputs[0], sortedMafInputs[1:]
  19.233 -    windows = [[] for t in tail]
  19.234 -    for h in head:  # iterate over MAF blocks in the first input
  19.235 -        windows = [list(nextWindow(w, t, h)) for w, t in zip(windows, tail)]
  19.236 -        for i in allOverlaps(windows, topSeqBeg(h), topSeqEnd(h)):
  19.237 -            yield (h,) + i
  19.238 -
  19.239 -op = optparse.OptionParser(usage="%prog sorted-file1.maf sorted-file2.maf ...")
  19.240 -(opts, args) = op.parse_args()
  19.241 -
  19.242 -progName = os.path.basename(sys.argv[0])
  19.243 -
  19.244 -inputs = [peekable(sortedMafInput(fileinput.input(i))) for i in args]
  19.245 -
  19.246 -for mafs in overlappingMafs(inputs):
  19.247 -    mafJoin(mafs).write()
    20.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    20.2 +++ b/scripts/maf-sort	Wed Sep 24 13:13:14 2014 +0900
    20.3 @@ -0,0 +1,77 @@
    20.4 +#! /bin/sh
    20.5 +
    20.6 +# Sort MAF-format alignments by sequence name, then strand, then start
    20.7 +# position, then end position, of the top sequence.  Also, merge
    20.8 +# identical alignments.  Comment lines starting with "#" are written
    20.9 +# at the top, in unchanged order.  If option "-d" is specified, then
   20.10 +# alignments that appear only once are omitted (like uniq -d).
   20.11 +
   20.12 +# Minor flaws, that do not matter for typical MAF input:
   20.13 +# 1) It might not work if the input includes TABs.
   20.14 +# 2) Preceding whitespace is considered part of the sequence name.  I
   20.15 +# want to use sort -b, but it seems to be broken in different ways for
   20.16 +# different versions of sort!
   20.17 +# 3) Alignments with differences in whitespace are considered
   20.18 +# non-identical.
   20.19 +
   20.20 +# This script uses perl instead of specialized commands like uniq.
   20.21 +# The reason is that, on some systems (e.g. Mac OS X), uniq doesn't
   20.22 +# work with long lines.
   20.23 +
   20.24 +# Make "sort" use a standard ordering:
   20.25 +LC_ALL=C
   20.26 +export LC_ALL
   20.27 +
   20.28 +uniqOpt=1
   20.29 +whichSequence=1
   20.30 +while getopts hdn: opt
   20.31 +do
   20.32 +    case $opt in
   20.33 +	h)  cat <<EOF
   20.34 +Usage: $(basename $0) [options] my-alignments.maf
   20.35 +
   20.36 +Options:
   20.37 +  -h  show this help message and exit
   20.38 +  -d  only print duplicate alignments
   20.39 +  -n  sort by the n-th sequence (default: 1)
   20.40 +EOF
   20.41 +	    exit
   20.42 +	    ;;
   20.43 +	d)  uniqOpt=2
   20.44 +            ;;
   20.45 +	n)  whichSequence="$OPTARG"
   20.46 +	    ;;
   20.47 +    esac
   20.48 +done
   20.49 +shift $((OPTIND - 1))
   20.50 +
   20.51 +baseField=$((6 * $whichSequence))
   20.52 +a=$(($baseField - 4))
   20.53 +a=$a,$a
   20.54 +b=$(($baseField - 1))
   20.55 +b=$b,$b
   20.56 +c=$(($baseField - 3))
   20.57 +c=$c,$c
   20.58 +d=$(($baseField - 2))
   20.59 +d=$d,$d
   20.60 +
   20.61 +# 1) Add digits to "#" lines, so that sorting won't change their order.
   20.62 +# 2) Replace spaces, except in "s" lines.
   20.63 +# 3) Join each alignment into one big line.
   20.64 +perl -pe '
   20.65 +s/^#/sprintf("#%.9d",$c++)/e;
   20.66 +y/ /\a/ unless /^s/;
   20.67 +y/\n/\b/ if /^\w/;
   20.68 +' "$@" |
   20.69 +
   20.70 +sort -k$a -k$b -k${c}n -k${d}n |  # sort the lines
   20.71 +
   20.72 +# Print only the first (or second) of each run of identical lines:
   20.73 +perl -ne '$c = 0 if $x ne $_; $x = $_; print if ++$c == '$uniqOpt |
   20.74 +
   20.75 +# 1) Remove the digits from "#" lines.
   20.76 +# 2) Restore spaces and newlines.
   20.77 +perl -pe '
   20.78 +s/^#.{9}/#/;
   20.79 +y/\a\b/ \n/;
   20.80 +'
    21.1 --- a/scripts/maf-sort.sh	Wed Sep 24 11:41:48 2014 +0900
    21.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    21.3 @@ -1,77 +0,0 @@
    21.4 -#! /bin/sh
    21.5 -
    21.6 -# Sort MAF-format alignments by sequence name, then strand, then start
    21.7 -# position, then end position, of the top sequence.  Also, merge
    21.8 -# identical alignments.  Comment lines starting with "#" are written
    21.9 -# at the top, in unchanged order.  If option "-d" is specified, then
   21.10 -# alignments that appear only once are omitted (like uniq -d).
   21.11 -
   21.12 -# Minor flaws, that do not matter for typical MAF input:
   21.13 -# 1) It might not work if the input includes TABs.
   21.14 -# 2) Preceding whitespace is considered part of the sequence name.  I
   21.15 -# want to use sort -b, but it seems to be broken in different ways for
   21.16 -# different versions of sort!
   21.17 -# 3) Alignments with differences in whitespace are considered
   21.18 -# non-identical.
   21.19 -
   21.20 -# This script uses perl instead of specialized commands like uniq.
   21.21 -# The reason is that, on some systems (e.g. Mac OS X), uniq doesn't
   21.22 -# work with long lines.
   21.23 -
   21.24 -# Make "sort" use a standard ordering:
   21.25 -LC_ALL=C
   21.26 -export LC_ALL
   21.27 -
   21.28 -uniqOpt=1
   21.29 -whichSequence=1
   21.30 -while getopts hdn: opt
   21.31 -do
   21.32 -    case $opt in
   21.33 -	h)  cat <<EOF
   21.34 -Usage: $(basename $0) [options] my-alignments.maf
   21.35 -
   21.36 -Options:
   21.37 -  -h  show this help message and exit
   21.38 -  -d  only print duplicate alignments
   21.39 -  -n  sort by the n-th sequence (default: 1)
   21.40 -EOF
   21.41 -	    exit
   21.42 -	    ;;
   21.43 -	d)  uniqOpt=2
   21.44 -            ;;
   21.45 -	n)  whichSequence="$OPTARG"
   21.46 -	    ;;
   21.47 -    esac
   21.48 -done
   21.49 -shift $((OPTIND - 1))
   21.50 -
   21.51 -baseField=$((6 * $whichSequence))
   21.52 -a=$(($baseField - 4))
   21.53 -a=$a,$a
   21.54 -b=$(($baseField - 1))
   21.55 -b=$b,$b
   21.56 -c=$(($baseField - 3))
   21.57 -c=$c,$c
   21.58 -d=$(($baseField - 2))
   21.59 -d=$d,$d
   21.60 -
   21.61 -# 1) Add digits to "#" lines, so that sorting won't change their order.
   21.62 -# 2) Replace spaces, except in "s" lines.
   21.63 -# 3) Join each alignment into one big line.
   21.64 -perl -pe '
   21.65 -s/^#/sprintf("#%.9d",$c++)/e;
   21.66 -y/ /\a/ unless /^s/;
   21.67 -y/\n/\b/ if /^\w/;
   21.68 -' "$@" |
   21.69 -
   21.70 -sort -k$a -k$b -k${c}n -k${d}n |  # sort the lines
   21.71 -
   21.72 -# Print only the first (or second) of each run of identical lines:
   21.73 -perl -ne '$c = 0 if $x ne $_; $x = $_; print if ++$c == '$uniqOpt |
   21.74 -
   21.75 -# 1) Remove the digits from "#" lines.
   21.76 -# 2) Restore spaces and newlines.
   21.77 -perl -pe '
   21.78 -s/^#.{9}/#/;
   21.79 -y/\a\b/ \n/;
   21.80 -'
    22.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
    22.2 +++ b/scripts/maf-swap	Wed Sep 24 13:13:14 2014 +0900
    22.3 @@ -0,0 +1,140 @@
    22.4 +#! /usr/bin/env python
    22.5 +
    22.6 +# Read MAF-format alignments, and write them, after moving the Nth
    22.7 +# sequence to the top in each alignment.
    22.8 +
    22.9 +# Before writing, if the top sequence would be on the - strand, then
   22.10 +# flip all the strands.  But don't do this if the top sequence is
   22.11 +# translated DNA.
   22.12 +
   22.13 +# Seems to work with Python 2.x, x>=4
   22.14 +
   22.15 +import fileinput, itertools, optparse, os, signal, string, sys
   22.16 +
   22.17 +def filterComments(lines):
   22.18 +    for i in lines:
   22.19 +        if i.startswith("#"): print i,
   22.20 +        else: yield i
   22.21 +
   22.22 +def mafInput(lines):
   22.23 +    for k, v in itertools.groupby(lines, str.isspace):
   22.24 +        if not k: yield list(v)
   22.25 +
   22.26 +def indexOfNthSequence(mafLines, n):
   22.27 +    for i, line in enumerate(mafLines):
   22.28 +        if line.startswith("s"):
   22.29 +            if n == 1: return i
   22.30 +            n -= 1
   22.31 +    raise Exception("encountered an alignment with too few sequences")
   22.32 +
   22.33 +def rangeOfNthSequence(mafLines, n):
   22.34 +    """Get the range of lines associated with the Nth sequence."""
   22.35 +    start = indexOfNthSequence(mafLines, n)
   22.36 +    stop = start + 1
   22.37 +    while stop < len(mafLines):
   22.38 +        line = mafLines[stop]
   22.39 +        if not (line.startswith("q") or line.startswith("i")): break
   22.40 +        stop += 1
   22.41 +    return start, stop
   22.42 +
   22.43 +complement = string.maketrans('ACGTNSWRYKMBDHVacgtnswrykmbdhv',
   22.44 +                              'TGCANSWYRMKVHDBtgcanswyrmkvhdb')
   22.45 +# doesn't handle "U" in RNA sequences
   22.46 +def revcomp(seq):
   22.47 +    return seq[::-1].translate(complement)
   22.48 +
   22.49 +def flippedMafS(words):
   22.50 +    alnStart = int(words[2])
   22.51 +    alnSize = int(words[3])
   22.52 +    strand = words[4]
   22.53 +    seqSize = int(words[5])
   22.54 +    alnString = words[6]
   22.55 +    newStart = seqSize - alnStart - alnSize
   22.56 +    if strand == "-": newStrand = "+"
   22.57 +    else:             newStrand = "-"
   22.58 +    newString = revcomp(alnString)
   22.59 +    out = words[0], words[1], newStart, alnSize, newStrand, seqSize, newString
   22.60 +    return map(str, out)
   22.61 +
   22.62 +def flippedMafP(words):
   22.63 +    flippedString = words[1][::-1]
   22.64 +    return words[:1] + [flippedString]
   22.65 +
   22.66 +def flippedMafQ(words):
   22.67 +    qualityString = words[2]
   22.68 +    flippedString = qualityString[::-1]
   22.69 +    return words[:2] + [flippedString]
   22.70 +
   22.71 +def flippedMafLine(mafLine):
   22.72 +    words = mafLine.split()
   22.73 +    if   words[0] == "s": return flippedMafS(words)
   22.74 +    elif words[0] == "p": return flippedMafP(words)
   22.75 +    elif words[0] == "q": return flippedMafQ(words)
   22.76 +    else: return words
   22.77 +
   22.78 +def maxlen(s):
   22.79 +    return max(map(len, s))
   22.80 +
   22.81 +def sLineFieldWidths(mafLines):
   22.82 +    sLines = (i for i in mafLines if i[0] == "s")
   22.83 +    sColumns = zip(*sLines)
   22.84 +    return map(maxlen, sColumns)
   22.85 +
   22.86 +def joinedMafS(words, fieldWidths):
   22.87 +    formatParams = itertools.chain(*zip(fieldWidths, words))
   22.88 +    return "%*s %-*s %*s %*s %*s %*s %*s\n" % tuple(formatParams)
   22.89 +
   22.90 +def joinedMafLine(words, fieldWidths):
   22.91 +    if words[0] == "s":
   22.92 +        return joinedMafS(words, fieldWidths)
   22.93 +    elif words[0] == "q":
   22.94 +        words = words[:2] + [""] * 4 + words[2:]
   22.95 +        return joinedMafS(words, fieldWidths)
   22.96 +    elif words[0] == "p":
   22.97 +        words = words[:1] + [""] * 5 + words[1:]
   22.98 +        return joinedMafS(words, fieldWidths)
   22.99 +    else:
  22.100 +        return " ".join(words) + "\n"
  22.101 +
  22.102 +def flippedMaf(mafLines):
  22.103 +    flippedLines = map(flippedMafLine, mafLines)
  22.104 +    fieldWidths = sLineFieldWidths(flippedLines)
  22.105 +    return (joinedMafLine(i, fieldWidths) for i in flippedLines)
  22.106 +
  22.107 +def isCanonicalStrand(mafLine):
  22.108 +    words = mafLine.split()
  22.109 +    strand = words[4]
  22.110 +    if strand == "+": return True
  22.111 +    alnString = words[6]
  22.112 +    if "/" in alnString or "\\" in alnString: return True  # frameshifts
  22.113 +    alnSize = int(words[3])
  22.114 +    gapCount = alnString.count("-")
  22.115 +    if len(alnString) - gapCount < alnSize: return True  # translated DNA
  22.116 +    return False
  22.117 +
  22.118 +def mafSwap(opts, args):
  22.119 +    inputLines = fileinput.input(args)
  22.120 +    for mafLines in mafInput(filterComments(inputLines)):
  22.121 +        start, stop = rangeOfNthSequence(mafLines, opts.n)
  22.122 +        mafLines[1:stop] = mafLines[start:stop] + mafLines[1:start]
  22.123 +        if not isCanonicalStrand(mafLines[1]):
  22.124 +            mafLines = flippedMaf(mafLines)
  22.125 +        for i in mafLines: print i,
  22.126 +        print  # blank line after each alignment
  22.127 +
  22.128 +if __name__ == "__main__":
  22.129 +    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  22.130 +
  22.131 +    usage = "%prog [options] my-alignments.maf"
  22.132 +    description = "Change the order of sequences in MAF-format alignments."
  22.133 +    op = optparse.OptionParser(usage=usage, description=description)
  22.134 +    op.add_option("-n", type="int", default=2,
  22.135 +                  help="move the Nth sequence to the top (default: %default)")
  22.136 +    (opts, args) = op.parse_args()
  22.137 +    if opts.n < 1: op.error("option -n: should be >= 1")
  22.138 +
  22.139 +    try: mafSwap(opts, args)
  22.140 +    except KeyboardInterrupt: pass  # avoid silly error message
  22.141 +    except Exception, e:
  22.142 +        prog = os.path.basename(sys.argv[0])
  22.143 +        sys.exit(prog + ": error: " + str(e))
    23.1 --- a/scripts/maf-swap.py	Wed Sep 24 11:41:48 2014 +0900
    23.2 +++ /dev/null	Thu Jan 01 00:00:00 1970 +0000
    23.3 @@ -1,140 +0,0 @@
    23.4 -#! /usr/bin/env python
    23.5 -
    23.6 -# Read MAF-format alignments, and write them, after moving the Nth
    23.7 -# sequence to the top in each alignment.
    23.8 -
    23.9 -# Before writing, if the top sequence would be on the - strand, then
   23.10 -# flip all the strands.  But don't do this if the top sequence is
   23.11 -# translated DNA.
   23.12 -
   23.13 -# Seems to work with Python 2.x, x>=4
   23.14 -
   23.15 -import fileinput, itertools, optparse, os, signal, string, sys
   23.16 -
   23.17 -def filterComments(lines):
   23.18 -    for i in lines:
   23.19 -        if i.startswith("#"): print i,
   23.20 -        else: yield i
   23.21 -
   23.22 -def mafInput(lines):
   23.23 -    for k, v in itertools.groupby(lines, str.isspace):
   23.24 -        if not k: yield list(v)
   23.25 -
   23.26 -def indexOfNthSequence(mafLines, n):
   23.27 -    for i, line in enumerate(mafLines):
   23.28 -        if line.startswith("s"):
   23.29 -            if n == 1: return i
   23.30 -            n -= 1
   23.31 -    raise Exception("encountered an alignment with too few sequences")
   23.32 -
   23.33 -def rangeOfNthSequence(mafLines, n):
   23.34 -    """Get the range of lines associated with the Nth sequence."""
   23.35 -    start = indexOfNthSequence(mafLines, n)
   23.36 -    stop = start + 1
   23.37 -    while stop < len(mafLines):
   23.38 -        line = mafLines[stop]
   23.39 -        if not (line.startswith("q") or line.startswith("i")): break
   23.40 -        stop += 1
   23.41 -    return start, stop
   23.42 -
   23.43 -complement = string.maketrans('ACGTNSWRYKMBDHVacgtnswrykmbdhv',
   23.44 -                              'TGCANSWYRMKVHDBtgcanswyrmkvhdb')
   23.45 -# doesn't handle "U" in RNA sequences
   23.46 -def revcomp(seq):
   23.47 -    return seq[::-1].translate(complement)
   23.48 -
   23.49 -def flippedMafS(words):
   23.50 -    alnStart = int(words[2])
   23.51 -    alnSize = int(words[3])
   23.52 -    strand = words[4]
   23.53 -    seqSize = int(words[5])
   23.54 -    alnString = words[6]
   23.55 -    newStart = seqSize - alnStart - alnSize
   23.56 -    if strand == "-": newStrand = "+"
   23.57 -    else:             newStrand = "-"
   23.58 -    newString = revcomp(alnString)
   23.59 -    out = words[0], words[1], newStart, alnSize, newStrand, seqSize, newString
   23.60 -    return map(str, out)
   23.61 -
   23.62 -def flippedMafP(words):
   23.63 -    flippedString = words[1][::-1]
   23.64 -    return words[:1] + [flippedString]
   23.65 -
   23.66 -def flippedMafQ(words):
   23.67 -    qualityString = words[2]
   23.68 -    flippedString = qualityString[::-1]
   23.69 -    return words[:2] + [flippedString]
   23.70 -
   23.71 -def flippedMafLine(mafLine):
   23.72 -    words = mafLine.split()
   23.73 -    if   words[0] == "s": return flippedMafS(words)
   23.74 -    elif words[0] == "p": return flippedMafP(words)
   23.75 -    elif words[0] == "q": return flippedMafQ(words)
   23.76 -    else: return words
   23.77 -
   23.78 -def maxlen(s):
   23.79 -    return max(map(len, s))
   23.80 -
   23.81 -def sLineFieldWidths(mafLines):
   23.82 -    sLines = (i for i in mafLines if i[0] == "s")
   23.83 -    sColumns = zip(*sLines)
   23.84 -    return map(maxlen, sColumns)
   23.85 -
   23.86 -def joinedMafS(words, fieldWidths):
   23.87 -    formatParams = itertools.chain(*zip(fieldWidths, words))
   23.88 -    return "%*s %-*s %*s %*s %*s %*s %*s\n" % tuple(formatParams)
   23.89 -
   23.90 -def joinedMafLine(words, fieldWidths):
   23.91 -    if words[0] == "s":
   23.92 -        return joinedMafS(words, fieldWidths)
   23.93 -    elif words[0] == "q":
   23.94 -        words = words[:2] + [""] * 4 + words[2:]
   23.95 -        return joinedMafS(words, fieldWidths)
   23.96 -    elif words[0] == "p":
   23.97 -        words = words[:1] + [""] * 5 + words[1:]
   23.98 -        return joinedMafS(words, fieldWidths)
   23.99 -    else:
  23.100 -        return " ".join(words) + "\n"
  23.101 -
  23.102 -def flippedMaf(mafLines):
  23.103 -    flippedLines = map(flippedMafLine, mafLines)
  23.104 -    fieldWidths = sLineFieldWidths(flippedLines)
  23.105 -    return (joinedMafLine(i, fieldWidths) for i in flippedLines)
  23.106 -
  23.107 -def isCanonicalStrand(mafLine):
  23.108 -    words = mafLine.split()
  23.109 -    strand = words[4]
  23.110 -    if strand == "+": return True
  23.111 -    alnString = words[6]
  23.112 -    if "/" in alnString or "\\" in alnString: return True  # frameshifts
  23.113 -    alnSize = int(words[3])
  23.114 -    gapCount = alnString.count("-")
  23.115 -    if len(alnString) - gapCount < alnSize: return True  # translated DNA
  23.116 -    return False
  23.117 -
  23.118 -def mafSwap(opts, args):
  23.119 -    inputLines = fileinput.input(args)
  23.120 -    for mafLines in mafInput(filterComments(inputLines)):
  23.121 -        start, stop = rangeOfNthSequence(mafLines, opts.n)
  23.122 -        mafLines[1:stop] = mafLines[start:stop] + mafLines[1:start]
  23.123 -        if not isCanonicalStrand(mafLines[1]):
  23.124 -            mafLines = flippedMaf(mafLines)
  23.125 -        for i in mafLines: print i,
  23.126 -        print  # blank line after each alignment
  23.127 -
  23.128 -if __name__ == "__main__":
  23.129 -    signal.signal(signal.SIGPIPE, signal.SIG_DFL)  # avoid silly error message
  23.130 -
  23.131 -    usage = "%prog [options] my-alignments.maf"
  23.132 -    description = "Change the order of sequences in MAF-format alignments."
  23.133 -    op = optparse.OptionParser(usage=usage, description=description)
  23.134 -    op.add_option("-n", type="int", default=2,
  23.135 -                  help="move the Nth sequence to the top (default: %default)")
  23.136 -    (opts, args) = op.parse_args()
  23.137 -    if opts.n < 1: op.error("option -n: should be >= 1")
  23.138 -
  23.139 -    try: mafSwap(opts, args)
  23.140 -    except KeyboardInterrupt: pass  # avoid silly error message
  23.141 -    except Exception, e:
  23.142 -        prog = os.path.basename(sys.argv[0])
  23.143 -        sys.exit(prog + ": error: " + str(e))
    24.1 --- a/test/maf-convert-test.out	Wed Sep 24 11:41:48 2014 +0900
    24.2 +++ b/test/maf-convert-test.out	Wed Sep 24 13:13:14 2014 +0900
    24.3 @@ -1,11 +1,11 @@
    24.4  Usage: 
    24.5 -  maf-convert.py --help
    24.6 -  maf-convert.py axt mafFile(s)
    24.7 -  maf-convert.py blast mafFile(s)
    24.8 -  maf-convert.py html mafFile(s)
    24.9 -  maf-convert.py psl mafFile(s)
   24.10 -  maf-convert.py sam mafFile(s)
   24.11 -  maf-convert.py tab mafFile(s)
   24.12 +  maf-convert --help
   24.13 +  maf-convert axt mafFile(s)
   24.14 +  maf-convert blast mafFile(s)
   24.15 +  maf-convert html mafFile(s)
   24.16 +  maf-convert psl mafFile(s)
   24.17 +  maf-convert sam mafFile(s)
   24.18 +  maf-convert tab mafFile(s)
   24.19  
   24.20  Read MAF-format alignments & write them in another format.
   24.21  
    25.1 --- a/test/maf-convert-test.sh	Wed Sep 24 11:41:48 2014 +0900
    25.2 +++ b/test/maf-convert-test.sh	Wed Sep 24 13:13:14 2014 +0900
    25.3 @@ -6,7 +6,7 @@
    25.4  
    25.5  PATH=../scripts:$PATH
    25.6  
    25.7 -r=maf-convert.py
    25.8 +r=maf-convert
    25.9  
   25.10  maf1=SRR359290-1k.maf
   25.11  maf2=bs100.maf