Utilities¶
An overview of the utility scripts provided to conduct analysis on fused reads.
whale_watch.py¶
Parse sequencing_summary.txt files and .paf files to find split reads in an
Oxford Nanopore Dataset
General options:
-h, --help Show this help and exit
-d , --distance Specify the maximum distance between consecutive
mappings. This is the difference between 'Target Start'
and 'Target End' in the paf file. Defaults to 10000
-t , --top Specify how many top processed reads to display. Default
is 10
-D, --debug Write debug file
Input sources:
-s , --summary A sequencing summary file generated by albacore
-p , --paf A paf file generated by minimap2
Output files:
-F , --out-fused Specify name of the fused_read file. This file only
contains chains of reads. Defaults to 'fused_reads.txt'
Output format¶
Field | Description | Example |
---|---|---|
coords | bulkvis position coordinates | 231:30782-32296 |
run_id | The run that these reads came from | 8093748fc82dc4c5cc441125d76432dd658c27c8 |
channel | Channel that sequenced these reads | 231 |
start_time | Time, in seconds, that the (first) incorrectly split read starting sequencing | 30782.8425 |
duration | Time, in seconds, it took for the incorrectly split read to pass through the channel | 1512.46425 |
combined_length | Number of bases in the combined reads | 611531 |
target_name | The mapping target, determined by minimap | chr7 |
strand | ‘+’ if query and target on the same strand; ‘-‘ if opposite | + |
start_match | Start coordinate on the original strand | 46731340 |
end_match | End coordinate on the original strand | 46791591 |
cat_read_id | Read ids of all the reads in this group | 82eed45a-7774-4778-8f8a-eb17d7010116|6e9c7720-b7a3-47cc-8f42-30e2219add4b |
count | Number of reads in this group | 2 |
whale_merge.py¶
Parse sequencing_summary.txt files and .paf files to find chained reads in an
Oxford Nanopore Dataset and output fused fastq files
General options:
-h, --help Show this help and exit
-d , --distance Specify the maximum distance between consecutive
mappings. This is the difference between 'Target Start'
and 'Target End' in the paf file. Defaults to 10000
Input sources:
-s , --summary A sequencing summary file generated by albacore
-p , --paf A paf file generated by minimap2
-f , --readfiles Full path to the folder containing fastq files you wish
to join
Output files:
-o , --out-fused Specify name of the fused_read fastq file. This file will
contain fused reads and the remaining singleton reads.
Defaults to 'fused_reads.fastq'
-W Outputs just the fused reads
set_config.py¶
Generate a configuration file required for bulkvis to run
General options:
-h, --help Show this help and exit
Input sources:
-b , --bulkfile A bulk-fast5 file to get labels from
-i , --input-dir The path to tbe folder containing bulk-files for
visualisation
-e , --export-dir The path to tbe folder where read-files will be written
by bulkvis
Output:
-c , --config Path to the config.ini file in your bulkvis installation
Figure scripts¶
whale_plot.py¶
Parse sequencing_summary.txt, .paf, and bulk fast5 files to generate CSV files
containing the distributions of MinKNOW events around read starts and ends.
These are divided into unique reads, split reads and internal reads. The R
script, whale.R, is called to generate the plot; this requires the packages:
ggplot2, tidyr, and dplyr. Note: of the MinKNOW classifications only above,
adapter, pore, transition, unblocking, and unclassified are included.
General options:
-h, --help Show this help and exit
-d DISTANCE, --distance DISTANCE
Specify the maximum distance, in bases, between
consecutive mappings. This is the difference between
'Target Start' and 'Target End' in a paf file
(default: 10000)
-V, --verbose Print verbose output to terminal (default: False)
Input sources:
-b BULK_FILE, --bulk-file BULK_FILE
An ONT bulk fast5 file containing raw signal (default:
None)
-s SUMMARY, --summary SUMMARY
A sequencing summary file generated by albacore
(default: None)
-p PAF, --paf PAF A paf file generated by minimap2 (default: None)
-t TIME, --time TIME +/- time around a strand event in seconds (default:
10)
Output files:
--no-generate-plot If set, do not generate density plot (default: False)
-A A CSV of MinKNOW events occurring before and after
correctly called read starts (default:
unique_read_start.csv)
-B B CSV of MinKNOW events occurring before and after
correctly called read ends (default:
unique_read_end.csv)
-C C CSV of MinKNOW events occurring before and after the
start of the first incorrectly split read in a group
(default: split_read_start.csv)
-D D CSV of MinKNOW events occurring before and after
incorrectly called read starts, within a group of
incorrectly split reads (default:
internal_read_start.csv)
-E E CSV of MinKNOW events occurring before and after
incorrectly called read ends, within a group of
incorrectly split reads (default:
internal_read_end.csv)
-F F CSV of MinKNOW events occurring before and after the
end of the first incorrectly split read in a group
(default: split_read_end.csv)
--out OUT Specify the output filename for the plot. File
extension must be one of [.eps, .ps, .tex, .pdf,
.jpeg, .tiff, .png, .bmp, .svg, .wmf] (default:
classification_count.pdf)
Example plot:¶
whale.R¶
This R script is called by whale_plot.py to produce the above plot, it requires Rscript to run and can be run independently. To run:
$ Rscript whale.R col_A.csv col_B.csv col_C.csv col_D.csv col_E.csv col_F.csv <<output filename>> <<run id>>
The order arguments is given is essential in this script, otherwise labels will not match. The output filename must include a file extesion from [.eps, .ps, .tex, .pdf, .jpeg, .tiff, .png, .bmp, .svg, .wmf]. Run id is not required to execute this script.
pod_plot.py¶
Generate plots for all reads in a fused_reads.txt file. This uses bokeh to
render a plot and requires selenium, phantomjs, and Pillow to be installed.
These are available via conda/pip.
General options:
-h, --help Show this help and exit
Input sources:
-f , --fused A fused read file generated by whale_watch.py
-b , --bulk-file An ONT bulk-fast5-file
Output files:
-D , --out-dir Specify the output directory where plots will be saved.
Defaults to current working directory
gen_bmf.py¶
Parse sequencing_summary.txt files and .paf files to format mapping info for
bulkvis
General options:
-h, --help Show this help and exit
Input sources:
-s , --summary A sequencing summary file generated by albacore
-p , --paf A paf file generated by minimap2
Output::
--bmf Specify the output folder, where files will be written as
<run_id>.bmf. This should be the 'map' path specified in
the config.ini
bulk_info.py¶
Given a directory containing bulk fast5 files output a csv containing the run
information for them
General options:
-h, --help Show this help and exit
Input sources:
-d , --dir A directory containing bulk-fast5-files
Output sources:
-o , --out Output csv filename
Other scripts¶
channelmaps.py¶
channelmaps.py is a utility script that is designed to be called by other scripts. It contains the physical layout of ONT minION flowcells and allows lookup by channel number, reverse lookup by coordinates, and can return a list of surrounding channels.
stitch.py¶
stitch.py is a utility script that is called from bulkvis, it will produce the read fast5 file from the squiggle data.