Utilities
=========

An overview of the utility scripts provided to conduct analysis on fused reads.

whale_watch.py
--------------
.. code-block:: bash

    Parse sequencing_summary.txt files and .paf files to find split reads in an
    Oxford Nanopore Dataset

    General options:
      -h, --help         Show this help and exit
      -d , --distance    Specify the maximum distance between consecutive
                         mappings. This is the difference between 'Target Start'
                         and 'Target End' in the paf file. Defaults to 10000
      -t , --top         Specify how many top processed reads to display. Default
                         is 10
      -D, --debug        Write debug file

    Input sources:
      -s , --summary     A sequencing summary file generated by albacore
      -p , --paf         A paf file generated by minimap2

    Output files:
      -F , --out-fused   Specify name of the fused_read file. This file only
                         contains chains of reads. Defaults to 'fused_reads.txt'


Output format
^^^^^^^^^^^^^
.. csv-table::
    :header: "Field", "Description", "Example"

    "coords", "bulkvis position coordinates", "231:30782-32296"
    "run_id", "The run that these reads came from", "8093748fc82dc4c5cc441125d76432dd658c27c8"
    "channel", "Channel that sequenced these reads", "231"
    "start_time", "Time, in seconds, that the (first) incorrectly split read starting sequencing", "30782.8425"
    "duration", "Time, in seconds, it took for the incorrectly split read to pass through the channel", "1512.46425"
    "combined_length", "Number of bases in the combined reads", "611531"
    "target_name", "The mapping target, determined by minimap", "chr7"
    "strand", "'+' if query and target on the same strand; '-' if opposite", "\+"
    "start_match", "Start coordinate on the original strand", "46731340"
    "end_match", "End coordinate on the original strand", "46791591"
    "cat_read_id", "Read ids of all the reads in this group", "82eed45a-7774-4778-8f8a-eb17d7010116|6e9c7720-b7a3-47cc-8f42-30e2219add4b"
    "count", "Number of reads in this group", "2"


whale_merge.py
--------------
.. code-block:: bash

    Parse sequencing_summary.txt files and .paf files to find chained reads in an
    Oxford Nanopore Dataset and output fused fastq files

    General options:
      -h, --help         Show this help and exit
      -d , --distance    Specify the maximum distance between consecutive
                         mappings. This is the difference between 'Target Start'
                         and 'Target End' in the paf file. Defaults to 10000

    Input sources:
      -s , --summary     A sequencing summary file generated by albacore
      -p , --paf         A paf file generated by minimap2
      -f , --readfiles   Full path to the folder containing fastq files you wish
                         to join

    Output files:
      -o , --out-fused   Specify name of the fused_read fastq file. This file will
                         contain fused reads and the remaining singleton reads.
                         Defaults to 'fused_reads.fastq'
      -W                 Outputs just the fused reads


set_config.py
-------------
.. code-block:: bash

    Generate a configuration file required for bulkvis to run

    General options:
      -h, --help          Show this help and exit

    Input sources:
      -b , --bulkfile     A bulk-fast5 file to get labels from
      -i , --input-dir    The path to tbe folder containing bulk-files for
                          visualisation
      -e , --export-dir   The path to tbe folder where read-files will be written
                          by bulkvis

    Output:
      -c , --config       Path to the config.ini file in your bulkvis installation


Figure scripts
--------------
whale_plot.py
^^^^^^^^^^^^^
.. code-block:: bash

    Parse sequencing_summary.txt, .paf, and bulk fast5 files to generate CSV files
    containing the distributions of MinKNOW events around read starts and ends.
    These are divided into unique reads, split reads and internal reads. The R
    script, whale.R, is called to generate the plot; this requires the packages:
    ggplot2, tidyr, and dplyr. Note: of the MinKNOW classifications only above,
    adapter, pore, transition, unblocking, and unclassified are included.

    General options:
      -h, --help            Show this help and exit
      -d DISTANCE, --distance DISTANCE
                            Specify the maximum distance, in bases, between
                            consecutive mappings. This is the difference between
                            'Target Start' and 'Target End' in a paf file
                            (default: 10000)
      -V, --verbose         Print verbose output to terminal (default: False)

    Input sources:
      -b BULK_FILE, --bulk-file BULK_FILE
                            An ONT bulk fast5 file containing raw signal (default:
                            None)
      -s SUMMARY, --summary SUMMARY
                            A sequencing summary file generated by albacore
                            (default: None)
      -p PAF, --paf PAF     A paf file generated by minimap2 (default: None)
      -t TIME, --time TIME  +/- time around a strand event in seconds (default:
                            10)

    Output files:
      --no-generate-plot    If set, do not generate density plot (default: False)
      -A A                  CSV of MinKNOW events occurring before and after
                            correctly called read starts (default:
                            unique_read_start.csv)
      -B B                  CSV of MinKNOW events occurring before and after
                            correctly called read ends (default:
                            unique_read_end.csv)
      -C C                  CSV of MinKNOW events occurring before and after the
                            start of the first incorrectly split read in a group
                            (default: split_read_start.csv)
      -D D                  CSV of MinKNOW events occurring before and after
                            incorrectly called read starts, within a group of
                            incorrectly split reads (default:
                            internal_read_start.csv)
      -E E                  CSV of MinKNOW events occurring before and after
                            incorrectly called read ends, within a group of
                            incorrectly split reads (default:
                            internal_read_end.csv)
      -F F                  CSV of MinKNOW events occurring before and after the
                            end of the first incorrectly split read in a group
                            (default: split_read_end.csv)
      --out OUT             Specify the output filename for the plot. File
                            extension must be one of [.eps, .ps, .tex, .pdf,
                            .jpeg, .tiff, .png, .bmp, .svg, .wmf] (default:
                            classification_count.pdf)


Example plot:
"""""""""""""
.. figure:: _static/images/utilities/01_plot.png
    :class: figure
    :alt: Example whale_plot.py output, showing a six columns: unique read start, unique read end, split read start, internal read start, internal read end, split read end. Each column shows the count of different classifications (above, adapter, pore, transition, unblocking, unclassified) around read starts and ends.

    Example plot from whale_plot.py

whale.R
^^^^^^^

This R script is called by whale_plot.py to produce the above plot, it requires `Rscript` to run and can be run independently. To run:

.. code-block:: bash

    $ Rscript whale.R col_A.csv col_B.csv col_C.csv col_D.csv col_E.csv col_F.csv <<output filename>> <<run id>>

The order arguments is given is essential in this script, otherwise labels will not match.
The output filename must include a file extesion from `[.eps, .ps, .tex, .pdf, .jpeg, .tiff, .png, .bmp, .svg, .wmf]`.
Run id is not required to execute this script.

pod_plot.py
^^^^^^^^^^^
.. code-block:: bash

    Generate plots for all reads in a fused_reads.txt file. This uses bokeh to
    render a plot and requires selenium, phantomjs, and Pillow to be installed.
    These are available via conda/pip.

    General options:
      -h, --help         Show this help and exit

    Input sources:
      -f , --fused       A fused read file generated by whale_watch.py
      -b , --bulk-file   An ONT bulk-fast5-file

    Output files:
      -D , --out-dir     Specify the output directory where plots will be saved.
                         Defaults to current working directory

bulk_info.py
-------------
.. code-block:: bash

    Given a directory containing bulk fast5 files output a csv containing the run
    information for them

    General options:
      -h, --help   Show this help and exit

    Input sources:
      -d , --dir   A directory containing bulk-fast5-files

    Output sources:
      -o , --out   Output csv filename

Other scripts
-------------

channelmaps.py
^^^^^^^^^^^^^^
`channelmaps.py` is a utility script that is designed to be called by other scripts. It contains the physical layout of
ONT minION flowcells and allows lookup by channel number, reverse lookup by coordinates, and can return a list of
surrounding channels.

stitch.py
^^^^^^^^^
`stitch.py` is a utility script that is called from bulkvis, it will produce the read fast5 file from the squiggle data.