Utilities

An overview of the utility scripts provided to conduct analysis on fused reads.

whale_watch.py

Parse sequencing_summary.txt files and .paf files to find split reads in an
Oxford Nanopore Dataset

General options:
  -h, --help         Show this help and exit
  -d , --distance    Specify the maximum distance between consecutive
                     mappings. This is the difference between 'Target Start'
                     and 'Target End' in the paf file. Defaults to 10000
  -t , --top         Specify how many top processed reads to display. Default
                     is 10
  -D, --debug        Write debug file

Input sources:
  -s , --summary     A sequencing summary file generated by albacore
  -p , --paf         A paf file generated by minimap2

Output files:
  -F , --out-fused   Specify name of the fused_read file. This file only
                     contains chains of reads. Defaults to 'fused_reads.txt'

Output format

Field Description Example
coords bulkvis position coordinates 231:30782-32296
run_id The run that these reads came from 8093748fc82dc4c5cc441125d76432dd658c27c8
channel Channel that sequenced these reads 231
start_time Time, in seconds, that the (first) incorrectly split read starting sequencing 30782.8425
duration Time, in seconds, it took for the incorrectly split read to pass through the channel 1512.46425
combined_length Number of bases in the combined reads 611531
target_name The mapping target, determined by minimap chr7
strand ‘+’ if query and target on the same strand; ‘-‘ if opposite +
start_match Start coordinate on the original strand 46731340
end_match End coordinate on the original strand 46791591
cat_read_id Read ids of all the reads in this group 82eed45a-7774-4778-8f8a-eb17d7010116|6e9c7720-b7a3-47cc-8f42-30e2219add4b
count Number of reads in this group 2

whale_merge.py

Parse sequencing_summary.txt files and .paf files to find chained reads in an
Oxford Nanopore Dataset and output fused fastq files

General options:
  -h, --help         Show this help and exit
  -d , --distance    Specify the maximum distance between consecutive
                     mappings. This is the difference between 'Target Start'
                     and 'Target End' in the paf file. Defaults to 10000

Input sources:
  -s , --summary     A sequencing summary file generated by albacore
  -p , --paf         A paf file generated by minimap2
  -f , --readfiles   Full path to the folder containing fastq files you wish
                     to join

Output files:
  -o , --out-fused   Specify name of the fused_read fastq file. This file will
                     contain fused reads and the remaining singleton reads.
                     Defaults to 'fused_reads.fastq'
  -W                 Outputs just the fused reads

set_config.py

Generate a configuration file required for bulkvis to run

General options:
  -h, --help          Show this help and exit

Input sources:
  -b , --bulkfile     A bulk-fast5 file to get labels from
  -i , --input-dir    The path to tbe folder containing bulk-files for
                      visualisation
  -e , --export-dir   The path to tbe folder where read-files will be written
                      by bulkvis

Output:
  -c , --config       Path to the config.ini file in your bulkvis installation

Figure scripts

whale_plot.py

Parse sequencing_summary.txt, .paf, and bulk fast5 files to generate CSV files
containing the distributions of MinKNOW events around read starts and ends.
These are divided into unique reads, split reads and internal reads. The R
script, whale.R, is called to generate the plot; this requires the packages:
ggplot2, tidyr, and dplyr. Note: of the MinKNOW classifications only above,
adapter, pore, transition, unblocking, and unclassified are included.

General options:
  -h, --help            Show this help and exit
  -d DISTANCE, --distance DISTANCE
                        Specify the maximum distance, in bases, between
                        consecutive mappings. This is the difference between
                        'Target Start' and 'Target End' in a paf file
                        (default: 10000)
  -V, --verbose         Print verbose output to terminal (default: False)

Input sources:
  -b BULK_FILE, --bulk-file BULK_FILE
                        An ONT bulk fast5 file containing raw signal (default:
                        None)
  -s SUMMARY, --summary SUMMARY
                        A sequencing summary file generated by albacore
                        (default: None)
  -p PAF, --paf PAF     A paf file generated by minimap2 (default: None)
  -t TIME, --time TIME  +/- time around a strand event in seconds (default:
                        10)

Output files:
  --no-generate-plot    If set, do not generate density plot (default: False)
  -A A                  CSV of MinKNOW events occurring before and after
                        correctly called read starts (default:
                        unique_read_start.csv)
  -B B                  CSV of MinKNOW events occurring before and after
                        correctly called read ends (default:
                        unique_read_end.csv)
  -C C                  CSV of MinKNOW events occurring before and after the
                        start of the first incorrectly split read in a group
                        (default: split_read_start.csv)
  -D D                  CSV of MinKNOW events occurring before and after
                        incorrectly called read starts, within a group of
                        incorrectly split reads (default:
                        internal_read_start.csv)
  -E E                  CSV of MinKNOW events occurring before and after
                        incorrectly called read ends, within a group of
                        incorrectly split reads (default:
                        internal_read_end.csv)
  -F F                  CSV of MinKNOW events occurring before and after the
                        end of the first incorrectly split read in a group
                        (default: split_read_end.csv)
  --out OUT             Specify the output filename for the plot. File
                        extension must be one of [.eps, .ps, .tex, .pdf,
                        .jpeg, .tiff, .png, .bmp, .svg, .wmf] (default:
                        classification_count.pdf)

Example plot:

Example whale_plot.py output, showing a six columns: unique read start, unique read end, split read start, internal read start, internal read end, split read end. Each column shows the count of different classifications (above, adapter, pore, transition, unblocking, unclassified) around read starts and ends.

Example plot from whale_plot.py

whale.R

This R script is called by whale_plot.py to produce the above plot, it requires Rscript to run and can be run independently. To run:

$ Rscript whale.R col_A.csv col_B.csv col_C.csv col_D.csv col_E.csv col_F.csv <<output filename>> <<run id>>

The order arguments is given is essential in this script, otherwise labels will not match. The output filename must include a file extesion from [.eps, .ps, .tex, .pdf, .jpeg, .tiff, .png, .bmp, .svg, .wmf]. Run id is not required to execute this script.

pod_plot.py

Generate plots for all reads in a fused_reads.txt file. This uses bokeh to
render a plot and requires selenium, phantomjs, and Pillow to be installed.
These are available via conda/pip.

General options:
  -h, --help         Show this help and exit

Input sources:
  -f , --fused       A fused read file generated by whale_watch.py
  -b , --bulk-file   An ONT bulk-fast5-file

Output files:
  -D , --out-dir     Specify the output directory where plots will be saved.
                     Defaults to current working directory

gen_bmf.py

Parse sequencing_summary.txt files and .paf files to format mapping info for
bulkvis

General options:
  -h, --help       Show this help and exit

Input sources:
  -s , --summary   A sequencing summary file generated by albacore
  -p , --paf       A paf file generated by minimap2

Output::
  --bmf            Specify the output folder, where files will be written as
                   <run_id>.bmf. This should be the 'map' path specified in
                   the config.ini

bulk_info.py

Given a directory containing bulk fast5 files output a csv containing the run
information for them

General options:
  -h, --help   Show this help and exit

Input sources:
  -d , --dir   A directory containing bulk-fast5-files

Output sources:
  -o , --out   Output csv filename

Other scripts

channelmaps.py

channelmaps.py is a utility script that is designed to be called by other scripts. It contains the physical layout of ONT minION flowcells and allows lookup by channel number, reverse lookup by coordinates, and can return a list of surrounding channels.

stitch.py

stitch.py is a utility script that is called from bulkvis, it will produce the read fast5 file from the squiggle data.