doc/last-dotplot.txt
author Martin C. Frith
Wed Aug 21 14:32:10 2019 +0900 (11 months ago)
changeset 984 fb43f850eba2
parent 980 27c26371226b
permissions -rw-r--r--
dotplot: change satellite-repeat color to purple
     1 last-dotplot
     2 ============
     3 
     4 This script makes a dotplot, a.k.a. Oxford Grid, of pair-wise sequence
     5 alignments in MAF or LAST tabular format.  It requires the `Python
     6 Imaging Library <https://pillow.readthedocs.io/>`_ to be installed.
     7 It can be used like this::
     8 
     9   last-dotplot my-alignments my-plot.png
    10 
    11 The output can be in any format supported by the Imaging Library::
    12 
    13   last-dotplot alns alns.gif
    14 
    15 Terminology
    16 -----------
    17 
    18 last-dotplot shows alignments of one set of sequences against another
    19 set of sequences.  This document calls a "set of sequences" a
    20 "genome", though it need not actually be a genome.
    21 
    22 Options
    23 -------
    24 
    25   -h, --help
    26       Show a help message, with default option values, and exit.
    27   -v, --verbose
    28       Show progress messages & data about the plot.
    29   -x INT, --width=INT
    30       Maximum width in pixels.
    31   -y INT, --height=INT
    32       Maximum height in pixels.
    33   -m M, --maxseqs=M
    34       Maximum number of horizontal or vertical sequences.  If there
    35       are >M sequences, the smallest ones (after cutting) will be
    36       discarded.
    37   -1 PATTERN, --seq1=PATTERN
    38       Which sequences to show from the 1st (horizontal) genome.
    39   -2 PATTERN, --seq2=PATTERN
    40       Which sequences to show from the 2nd (vertical) genome.
    41   -c COLOR, --forwardcolor=COLOR
    42       Color for forward alignments.
    43   -r COLOR, --reversecolor=COLOR
    44       Color for reverse alignments.
    45   --alignments=FILE
    46       Read secondary alignments.  For example: we could use primary
    47       alignment data with one human DNA read aligned to the human
    48       genome, and secondary alignment data with the whole chimpanzee
    49       versus human genomes.  last-dotplot will show the parts of the
    50       secondary alignments that are near the primary alignments.
    51   --sort1=N
    52       Put the 1st genome's sequences left-to-right in order of: their
    53       appearance in the input (0), their names (1), their lengths (2),
    54       the top-to-bottom order of (the midpoints of) their alignments
    55       (3).  You can use two colon-separated values, e.g. "2:1" to
    56       specify 2 for primary and 1 for secondary alignments.
    57   --sort2=N
    58       Put the 2nd genome's sequences top-to-bottom in order of: their
    59       appearance in the input (0), their names (1), their lengths (2),
    60       the left-to-right order of (the midpoints of) their alignments
    61       (3).
    62   --strands1=N
    63       Put the 1st genome's sequences: in forwards orientation (0), in
    64       the orientation of most of their aligned bases (1).  In the
    65       latter case, the labels will be colored (in the same way as the
    66       alignments) to indicate the sequences' orientations.  You can
    67       use two colon-separated values for primary and secondary
    68       alignments.
    69   --strands2=N
    70       Put the 2nd genome's sequences: in forwards orientation (0), in
    71       the orientation of most of their aligned bases (1).
    72   --max-gap1=FRAC
    73       Maximum unaligned gap in the 1st genome.  For example, if two
    74       parts of one DNA read align to widely-separated parts of one
    75       chromosome, it's probably best to cut the intervening region
    76       from the dotplot.  FRAC is a fraction of the length of the
    77       (primary) alignments.  You can specify "inf" to keep all
    78       unaligned gaps.  You can use two comma-separated values,
    79       e.g. "0.5,3" to specify 0.5 for end-gaps (unaligned sequence
    80       ends) and 3 for mid-gaps (between alignments).  You can use two
    81       colon-separated values (each of which may be comma-separated)
    82       for primary and secondary alignments.
    83   --max-gap2=FRAC
    84       Maximum unaligned gap in the 2nd genome.
    85   --pad=FRAC
    86       Length of pad to leave when cutting unaligned gaps.
    87   -j N, --join=N
    88       Draw horizontal or vertical lines joining adjacent alignments.
    89       0 means don't join, 1 means draw vertical lines joining
    90       alignments that are adjacent in the 1st (horizontal) genome, 2
    91       means draw horizontal lines joining alignments that are adjacent
    92       in the 2nd (vertical) genome.
    93   --border-pixels=INT
    94       Number of pixels between sequences.
    95   --border-color=COLOR
    96       Color for pixels between sequences.
    97   --margin-color=COLOR
    98       Color for the margins.
    99 
   100 Text options
   101 ~~~~~~~~~~~~
   102 
   103   -f FILE, --fontfile=FILE
   104       TrueType or OpenType font file.
   105   -s SIZE, --fontsize=SIZE
   106       TrueType or OpenType font size.
   107   --labels1=N
   108       Label the displayed regions of the 1st genome with their:
   109       sequence name (0), name:length (1), name:start:length (2),
   110       name:start-end (3).
   111   --labels2=N
   112       Label the displayed regions of the 2nd genome with their:
   113       sequence name (0), name:length (1), name:start:length (2),
   114       name:start-end (3).
   115   --rot1=ROT
   116       Text rotation for the 1st genome: h(orizontal) or v(ertical).
   117   --rot2=ROT
   118       Text rotation for the 2nd genome: h(orizontal) or v(ertical).
   119 
   120 Annotation options
   121 ~~~~~~~~~~~~~~~~~~
   122 
   123 These options read annotations of sequence segments, and draw them as
   124 colored horizontal or vertical stripes.  This looks good only if the
   125 annotations are reasonably sparse: e.g. you can't sensibly view 20000
   126 gene annotations in one small dotplot.
   127 
   128   --bed1=FILE
   129       Read `BED-format
   130       <https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_
   131       annotations for the 1st genome.  They are drawn as stripes, with
   132       coordinates given by the first three BED fields.  The color is
   133       specified by the RGB field if present, else pale red if the
   134       strand is "+", pale blue if "-", or pale purple.
   135   --bed2=FILE
   136       Read BED-format annotations for the 2nd genome.
   137   --rmsk1=FILE
   138       Read repeat annotations for the 1st genome, in RepeatMasker .out
   139       or rmsk.txt format.  The color is pale purple for "low
   140       complexity", "simple repeats", and "satellites", else pale red
   141       for "+" strand and pale blue for "-" strand.
   142   --rmsk2=FILE
   143       Read repeat annotations for the 2nd genome.
   144 
   145 Gene options
   146 ~~~~~~~~~~~~
   147 
   148   --genePred1=FILE
   149       Read gene annotations for the 1st genome in `genePred format
   150       <https://genome.ucsc.edu/FAQ/FAQformat.html#format9>`_.
   151   --genePred2=FILE
   152       Read gene annotations for the 2nd genome.
   153   --exon-color=COLOR
   154       Color for exons.
   155   --cds-color=COLOR
   156       Color for protein-coding regions.
   157 
   158 Unsequenced gap options
   159 ~~~~~~~~~~~~~~~~~~~~~~~
   160 
   161 Note: these "gaps" are *not* alignment gaps (indels): they are regions
   162 of unknown sequence.
   163 
   164   --gap1=FILE
   165       Read unsequenced gaps in the 1st genome from an agp or gap file.
   166   --gap2=FILE
   167       Read unsequenced gaps in the 2nd genome from an agp or gap file.
   168   --bridged-color=COLOR
   169       Color for bridged gaps.
   170   --unbridged-color=COLOR
   171       Color for unbridged gaps.
   172 
   173 An unsequenced gap will be shown only if it covers at least one whole
   174 pixel.
   175 
   176 Choosing sequences
   177 ------------------
   178 
   179 For example, you can exclude sequences with names like
   180 "chrUn_random522" like this::
   181 
   182   last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
   183 
   184 Option "-1" selects sequences from the 1st (horizontal) genome, and
   185 "-2" selects sequences from the 2nd (vertical) genome.  'chr[!U]*' is
   186 a *pattern* that specifies names starting with "chr", followed by any
   187 character except U, followed by anything.
   188 
   189 ==========  =============================
   190 Pattern     Meaning
   191 ----------  -----------------------------
   192 ``*``       zero or more of any character
   193 ``?``       any single character
   194 ``[abc]``   any character in abc
   195 ``[!abc]``  any character not in abc
   196 ==========  =============================
   197 
   198 If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
   199 compared to both the whole name and the part after the dot.
   200 
   201 You can specify more than one pattern, e.g. this gets sequences with
   202 names starting in "chr" followed by one or two characters::
   203 
   204   last-dotplot -1 'chr?' -1 'chr??' alns alns.png
   205 
   206 You can also specify a sequence range; for example this gets the first
   207 1000 bases of chr9::
   208 
   209   last-dotplot -1 chr9:0-1000 alns alns.png
   210 
   211 Text font
   212 ---------
   213 
   214 You can improve the font quality by increasing its size, e.g. to 20
   215 points:
   216 
   217   last-dotplot -s20 my-alignments my-plot.png
   218 
   219 last-dotplot tries to find a nice font on your computer, but may fail
   220 and use an ugly font.  You can specify a font like this::
   221 
   222   last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
   223 
   224 Colors
   225 ------
   226 
   227 Colors can be specified in `various ways described here
   228 <http://effbot.org/imagingbook/imagecolor.htm>`_.