Long v1.3.0 Updates (24 October 2023)
Note: as of v1.3.0, if still you want the old canu-based plassembler long
method implemented in v1.2.0 for some reason, you can do this with --canu_flag
.
While I was mostly happy with with the plassembler long
update in v 1.2.0, while testing it a lot of real life data benchmarking hybracter, I found some strange instances where it would give way too many contigs as output - seemingly, it would sometimes assemble the same plasmid multiple times with slight variations (probably due to various factors or too deep read sets). In any case, I wanted a better automated solution.
Inspired by this tweet by Ryan Wick, I decided to experiment with treating long reads as both short reads (in the sense of creating a de Brujin graph based assembly) and long reads (for scaffolding) in Unicycler - and the results were great.
As you can see in the results, I am confident that assuming you have good quality R10 Nanopore data, plassembler long
should now recover small plasmids.
plassembler long
now implements the following steps (after the Flye assembly and getting the plasmid long reads as previous):
- Removes extremely low entropy repetitive reads
- Runs
canu -correct
to simultaneously error correct the long reads and subsample to 100x estimated depth (so as not to give Unicycler too much read depth - see this and this and this) - major thanks to Ryan Wick for suggesting suggesting this! - Runs Unicycler as follows:
unicycler -s {error_corrected_longreads} -l {all_longreads}
Results
I tested plassembler long
on 60x simulated reads generated using badread with the nanopore2023
model on the same isolates
from Wick et al, C222 (Houtak et al) and Cav1217 (Mathers et al) used in the Plassembler manuscript.
Overall, it seems to work almost perfectly and quite performantly. In particular, the new approach seems to provide a speed-up vs v1.2.0 for more complicated assemblies (e.g. Klebsiella variicola INF345 ) as the Unicycler step is quite fast.
See here for the old benchmarking data and times run on the same Macbook machine, although note they were done with different read sets (simulated R9 vs R10) so it's not truly a fair comparison.
The only misassembly is of the linear plasmid in K variicola , which is a known issue of the Unicycler approach. If you have linear plasmids, please use a long-read assembly appraoch e.g. with Flye.
Everything was run on my Macbook Pro M1 (2020) with 8 threads. I was too lazy to run QUAST but I would recommend polishing the output with your favourite long read polisher (e.g. Medaka) anyway as implemented in hybracter.
By Isolate
(c) = Unicycler marked the plasmid as circular
Isolate | Ground Truth | plassembler long v1.3.0 |
Time (s) | Flye (Within plassembler long ) |
---|---|---|---|---|
C222 | 2473 | 2473 (c) | 889 | Nothing - missed 2473 |
Acinetobacter baumannii J9 | 145059; 6078 | 145058 (c); 6078 (c) | 1179 | 145059; 6077 |
CAV1217 | 181436; 70606; 44015; 9294 | 181435 (c); 70605 (c); 44015 (c); 9293 (c) | 1582 | 181433; 70609; 44015; 9294 |
Citrobacter koseri MINF 9D | 64962; 9294 | 64961 (c); 9294 (c) | 1328 | 64962; 18088 |
Enterobacter kobei MSB1 1B | 136482; 108411; 4665; 3715; 2370 | 136480 (c); 108410 (c); 4665 (c); 3715 (c); 2368 (c) | 1579 | 136481; 108410 - missed 3 small plasmids |
Klebsiella oxytoca MSB1 2C | 118161; 58472; 9975; 4574 | 118160 (c); 58471 (c); 9975 (c); 4574 (c) | 1273 | 118161; 58472; 9975 - missed 4574 |
Klebsiella variicola INF345 | 250980; 243620; 31780 (linear); 5783; 3514 | 250976 (c); 243612 (c); 30408 (linear incomplete); 5783 (c); 3514 (c) | 1206 | 250979; 243618; 31742 - missed 5783 + 3514 |
Overall
Total | Missed Small Plasmids | Missassembled |
---|---|---|
plassembler long v1.3.0 |
0 | 1 K oxtyoca linear plasmid |
Flye (Within plassembler ) |
5 | 0 |
Summary
While I'd still recommend short reads if you can get them, I am now confident that if your isolate has small plasmids, plassembler long
should find them.
Usage: plassembler long [OPTIONS]
Plassembler with long reads only
Options:
-h, --help Show this message and exit.
-V, --version Show the version and exit.
-d, --database PATH Directory of PLSDB database. [required]
-l, --longreads PATH FASTQ file of long reads. [required]
-c, --chromosome INTEGER Approximate lower-bound chromosome length of
bacteria (in base pairs). [default: 1000000]
-o, --outdir PATH Directory to write the output to. [default:
plassembler.output/]
-m, --min_length TEXT minimum length for filtering long reads with
chopper. [default: 500]
-q, --min_quality TEXT minimum quality q-score for filtering long
reads with chopper. [default: 9]
-t, --threads TEXT Number of threads. [default: 1]
-f, --force Force overwrites the output directory.
-p, --prefix TEXT Prefix for output files. This is not required.
[default: plassembler]
--skip_qc Skips qc (chopper and fastp).
--pacbio_model TEXT Pacbio model for Flye. Must be one of pacbio-
raw, pacbio-corr or pacbio-hifi. Use pacbio-
raw for PacBio regular CLR reads (<20 percent
error), pacbio-corr for PacBio reads that were
corrected with other methods (<3 percent
error) or pacbio-hifi for PacBio HiFi reads
(<1 percent error).
-r, --raw_flag Use --nano-raw for Flye. Designed for Guppy
fast configuration reads. By default, Flye
will assume SUP or HAC reads and use --nano-
hq.
--keep_chromosome If you want to keep the chromosome assembly.
--canu_flag Runs canu instead of Unicycler (aka replicates
v1.2.0). As of v1.3.0, Unicycler is the
assembler for long reads. Canu is only
recommended if you have low quality reads
(e.g. ONT R9).
--corrected_error_rate FLOAT Corrected error rate parameter for canu
-correct. For advanced users only.
--flye_directory PATH Directory containing Flye long read assembly.
Needs to contain assembly_info.txt and
assembly_info.fasta. Allows Plassembler to
Skip Flye assembly step.
--flye_assembly PATH Path to file containing Flye long read
assembly FASTA. Allows Plassembler to Skip
Flye assembly step in conjunction with
--flye_info.
--flye_info PATH Path to file containing Flye long read
assembly info text file. Allows Plassembler to
Skip Flye assembly step in conjunction with
--flye_assembly.
--no_chromosome Run Plassembler assuming no chromosome can be
assembled. Use this if your reads only contain
plasmids that you would like to assemble.