doc/last-dotplot.txt
author Martin C. Frith
Wed Aug 21 14:32:10 2019 +0900 (11 months ago)
changeset 984 fb43f850eba2
parent 980 27c26371226b
permissions -rw-r--r--
dotplot: change satellite-repeat color to purple
Martin@652
     1
last-dotplot
Martin@652
     2
============
Martin@652
     3
Martin@652
     4
This script makes a dotplot, a.k.a. Oxford Grid, of pair-wise sequence
Martin@878
     5
alignments in MAF or LAST tabular format.  It requires the `Python
Martin@878
     6
Imaging Library <https://pillow.readthedocs.io/>`_ to be installed.
Martin@878
     7
It can be used like this::
Martin@652
     8
Martin@652
     9
  last-dotplot my-alignments my-plot.png
Martin@652
    10
Martin@652
    11
The output can be in any format supported by the Imaging Library::
Martin@652
    12
Martin@652
    13
  last-dotplot alns alns.gif
Martin@652
    14
Martin@903
    15
Terminology
Martin@903
    16
-----------
Martin@652
    17
Martin@903
    18
last-dotplot shows alignments of one set of sequences against another
Martin@903
    19
set of sequences.  This document calls a "set of sequences" a
Martin@903
    20
"genome", though it need not actually be a genome.
Martin@652
    21
Martin@652
    22
Options
Martin@652
    23
-------
Martin@652
    24
Martin@652
    25
  -h, --help
Martin@652
    26
      Show a help message, with default option values, and exit.
Martin@866
    27
  -v, --verbose
Martin@866
    28
      Show progress messages & data about the plot.
Martin@961
    29
  -x INT, --width=INT
Martin@961
    30
      Maximum width in pixels.
Martin@961
    31
  -y INT, --height=INT
Martin@961
    32
      Maximum height in pixels.
Martin@961
    33
  -m M, --maxseqs=M
Martin@961
    34
      Maximum number of horizontal or vertical sequences.  If there
Martin@961
    35
      are >M sequences, the smallest ones (after cutting) will be
Martin@961
    36
      discarded.
Martin@652
    37
  -1 PATTERN, --seq1=PATTERN
Martin@850
    38
      Which sequences to show from the 1st (horizontal) genome.
Martin@652
    39
  -2 PATTERN, --seq2=PATTERN
Martin@850
    40
      Which sequences to show from the 2nd (vertical) genome.
Martin@652
    41
  -c COLOR, --forwardcolor=COLOR
Martin@652
    42
      Color for forward alignments.
Martin@652
    43
  -r COLOR, --reversecolor=COLOR
Martin@652
    44
      Color for reverse alignments.
Martin@911
    45
  --alignments=FILE
Martin@911
    46
      Read secondary alignments.  For example: we could use primary
Martin@911
    47
      alignment data with one human DNA read aligned to the human
Martin@911
    48
      genome, and secondary alignment data with the whole chimpanzee
Martin@911
    49
      versus human genomes.  last-dotplot will show the parts of the
Martin@911
    50
      secondary alignments that are near the primary alignments.
Martin@851
    51
  --sort1=N
Martin@851
    52
      Put the 1st genome's sequences left-to-right in order of: their
Martin@914
    53
      appearance in the input (0), their names (1), their lengths (2),
Martin@914
    54
      the top-to-bottom order of (the midpoints of) their alignments
Martin@914
    55
      (3).  You can use two colon-separated values, e.g. "2:1" to
Martin@914
    56
      specify 2 for primary and 1 for secondary alignments.
Martin@851
    57
  --sort2=N
Martin@851
    58
      Put the 2nd genome's sequences top-to-bottom in order of: their
Martin@914
    59
      appearance in the input (0), their names (1), their lengths (2),
Martin@914
    60
      the left-to-right order of (the midpoints of) their alignments
Martin@914
    61
      (3).
Martin@916
    62
  --strands1=N
Martin@916
    63
      Put the 1st genome's sequences: in forwards orientation (0), in
Martin@916
    64
      the orientation of most of their aligned bases (1).  In the
Martin@916
    65
      latter case, the labels will be colored (in the same way as the
Martin@916
    66
      alignments) to indicate the sequences' orientations.  You can
Martin@916
    67
      use two colon-separated values for primary and secondary
Martin@916
    68
      alignments.
Martin@916
    69
  --strands2=N
Martin@916
    70
      Put the 2nd genome's sequences: in forwards orientation (0), in
Martin@916
    71
      the orientation of most of their aligned bases (1).
Martin@911
    72
  --max-gap1=FRAC
Martin@911
    73
      Maximum unaligned gap in the 1st genome.  For example, if two
Martin@911
    74
      parts of one DNA read align to widely-separated parts of one
Martin@911
    75
      chromosome, it's probably best to cut the intervening region
Martin@911
    76
      from the dotplot.  FRAC is a fraction of the length of the
Martin@911
    77
      (primary) alignments.  You can specify "inf" to keep all
Martin@911
    78
      unaligned gaps.  You can use two comma-separated values,
Martin@911
    79
      e.g. "0.5,3" to specify 0.5 for end-gaps (unaligned sequence
Martin@911
    80
      ends) and 3 for mid-gaps (between alignments).  You can use two
Martin@911
    81
      colon-separated values (each of which may be comma-separated)
Martin@911
    82
      for primary and secondary alignments.
Martin@911
    83
  --max-gap2=FRAC
Martin@911
    84
      Maximum unaligned gap in the 2nd genome.
Martin@911
    85
  --pad=FRAC
Martin@911
    86
      Length of pad to leave when cutting unaligned gaps.
Martin@980
    87
  -j N, --join=N
Martin@980
    88
      Draw horizontal or vertical lines joining adjacent alignments.
Martin@980
    89
      0 means don't join, 1 means draw vertical lines joining
Martin@980
    90
      alignments that are adjacent in the 1st (horizontal) genome, 2
Martin@980
    91
      means draw horizontal lines joining alignments that are adjacent
Martin@980
    92
      in the 2nd (vertical) genome.
Martin@852
    93
  --border-pixels=INT
Martin@852
    94
      Number of pixels between sequences.
Martin@852
    95
  --border-color=COLOR
Martin@852
    96
      Color for pixels between sequences.
Martin@898
    97
  --margin-color=COLOR
Martin@898
    98
      Color for the margins.
Martin@652
    99
Martin@850
   100
Text options
Martin@850
   101
~~~~~~~~~~~~
Martin@850
   102
Martin@850
   103
  -f FILE, --fontfile=FILE
Martin@850
   104
      TrueType or OpenType font file.
Martin@850
   105
  -s SIZE, --fontsize=SIZE
Martin@850
   106
      TrueType or OpenType font size.
Martin@898
   107
  --labels1=N
Martin@898
   108
      Label the displayed regions of the 1st genome with their:
Martin@898
   109
      sequence name (0), name:length (1), name:start:length (2),
Martin@898
   110
      name:start-end (3).
Martin@898
   111
  --labels2=N
Martin@898
   112
      Label the displayed regions of the 2nd genome with their:
Martin@898
   113
      sequence name (0), name:length (1), name:start:length (2),
Martin@898
   114
      name:start-end (3).
Martin@878
   115
  --rot1=ROT
Martin@878
   116
      Text rotation for the 1st genome: h(orizontal) or v(ertical).
Martin@878
   117
  --rot2=ROT
Martin@878
   118
      Text rotation for the 2nd genome: h(orizontal) or v(ertical).
Martin@850
   119
Martin@860
   120
Annotation options
Martin@860
   121
~~~~~~~~~~~~~~~~~~
Martin@860
   122
Martin@860
   123
These options read annotations of sequence segments, and draw them as
Martin@860
   124
colored horizontal or vertical stripes.  This looks good only if the
Martin@860
   125
annotations are reasonably sparse: e.g. you can't sensibly view 20000
Martin@860
   126
gene annotations in one small dotplot.
Martin@860
   127
Martin@860
   128
  --bed1=FILE
Martin@860
   129
      Read `BED-format
Martin@860
   130
      <https://genome.ucsc.edu/FAQ/FAQformat.html#format1>`_
Martin@860
   131
      annotations for the 1st genome.  They are drawn as stripes, with
Martin@860
   132
      coordinates given by the first three BED fields.  The color is
Martin@860
   133
      specified by the RGB field if present, else pale red if the
Martin@860
   134
      strand is "+", pale blue if "-", or pale purple.
Martin@860
   135
  --bed2=FILE
Martin@860
   136
      Read BED-format annotations for the 2nd genome.
Martin@860
   137
  --rmsk1=FILE
Martin@860
   138
      Read repeat annotations for the 1st genome, in RepeatMasker .out
Martin@860
   139
      or rmsk.txt format.  The color is pale purple for "low
Martin@984
   140
      complexity", "simple repeats", and "satellites", else pale red
Martin@984
   141
      for "+" strand and pale blue for "-" strand.
Martin@860
   142
  --rmsk2=FILE
Martin@860
   143
      Read repeat annotations for the 2nd genome.
Martin@860
   144
Martin@860
   145
Gene options
Martin@860
   146
~~~~~~~~~~~~
Martin@860
   147
Martin@860
   148
  --genePred1=FILE
Martin@860
   149
      Read gene annotations for the 1st genome in `genePred format
Martin@860
   150
      <https://genome.ucsc.edu/FAQ/FAQformat.html#format9>`_.
Martin@860
   151
  --genePred2=FILE
Martin@860
   152
      Read gene annotations for the 2nd genome.
Martin@860
   153
  --exon-color=COLOR
Martin@860
   154
      Color for exons.
Martin@860
   155
  --cds-color=COLOR
Martin@860
   156
      Color for protein-coding regions.
Martin@860
   157
Martin@652
   158
Unsequenced gap options
Martin@652
   159
~~~~~~~~~~~~~~~~~~~~~~~
Martin@652
   160
Martin@652
   161
Note: these "gaps" are *not* alignment gaps (indels): they are regions
Martin@652
   162
of unknown sequence.
Martin@652
   163
Martin@652
   164
  --gap1=FILE
Martin@652
   165
      Read unsequenced gaps in the 1st genome from an agp or gap file.
Martin@652
   166
  --gap2=FILE
Martin@652
   167
      Read unsequenced gaps in the 2nd genome from an agp or gap file.
Martin@652
   168
  --bridged-color=COLOR
Martin@652
   169
      Color for bridged gaps.
Martin@652
   170
  --unbridged-color=COLOR
Martin@652
   171
      Color for unbridged gaps.
Martin@652
   172
Martin@652
   173
An unsequenced gap will be shown only if it covers at least one whole
Martin@652
   174
pixel.
Martin@860
   175
Martin@961
   176
Choosing sequences
Martin@961
   177
------------------
Martin@961
   178
Martin@961
   179
For example, you can exclude sequences with names like
Martin@961
   180
"chrUn_random522" like this::
Martin@961
   181
Martin@961
   182
  last-dotplot -1 'chr[!U]*' -2 'chr[!U]*' alns alns.png
Martin@961
   183
Martin@961
   184
Option "-1" selects sequences from the 1st (horizontal) genome, and
Martin@961
   185
"-2" selects sequences from the 2nd (vertical) genome.  'chr[!U]*' is
Martin@961
   186
a *pattern* that specifies names starting with "chr", followed by any
Martin@961
   187
character except U, followed by anything.
Martin@961
   188
Martin@961
   189
==========  =============================
Martin@961
   190
Pattern     Meaning
Martin@961
   191
----------  -----------------------------
Martin@961
   192
``*``       zero or more of any character
Martin@961
   193
``?``       any single character
Martin@961
   194
``[abc]``   any character in abc
Martin@961
   195
``[!abc]``  any character not in abc
Martin@961
   196
==========  =============================
Martin@961
   197
Martin@961
   198
If a sequence name has a dot (e.g. "hg19.chr7"), the pattern is
Martin@961
   199
compared to both the whole name and the part after the dot.
Martin@961
   200
Martin@961
   201
You can specify more than one pattern, e.g. this gets sequences with
Martin@961
   202
names starting in "chr" followed by one or two characters::
Martin@961
   203
Martin@961
   204
  last-dotplot -1 'chr?' -1 'chr??' alns alns.png
Martin@961
   205
Martin@961
   206
You can also specify a sequence range; for example this gets the first
Martin@961
   207
1000 bases of chr9::
Martin@961
   208
Martin@961
   209
  last-dotplot -1 chr9:0-1000 alns alns.png
Martin@961
   210
Martin@961
   211
Text font
Martin@961
   212
---------
Martin@961
   213
Martin@961
   214
You can improve the font quality by increasing its size, e.g. to 20
Martin@961
   215
points:
Martin@961
   216
Martin@961
   217
  last-dotplot -s20 my-alignments my-plot.png
Martin@961
   218
Martin@961
   219
last-dotplot tries to find a nice font on your computer, but may fail
Martin@961
   220
and use an ugly font.  You can specify a font like this::
Martin@961
   221
Martin@961
   222
  last-dotplot -f /usr/share/fonts/liberation/LiberationSans-Regular.ttf alns alns.png
Martin@961
   223
Martin@860
   224
Colors
Martin@903
   225
------
Martin@860
   226
Martin@860
   227
Colors can be specified in `various ways described here
Martin@860
   228
<http://effbot.org/imagingbook/imagecolor.htm>`_.