Long

Note: This method has been superceded by v1.3.0 but I am retaining the below out of interest. If you want the old plassembler long method implemented from v1.3.0 onwards for some reason, you can do this with --canu_flag.

After reading this paper by Johnson et al, it seemed interesting that while most assemblers failed to recover small <10kbp> plasmids, Canu always did - albeit with multiplication. However, multplicated contigs in Canu indicate the smallest repeating sequence unit in the header, and so can be trimmed easily (see e.g. Ryan Wick's script in Trycycler).

Therefore, from v1.2.0, plassembler long will follow all the same steps as plassembler run (i.e. hybrid mode), but instead of Unicycler, it will run Canu to recover plasmids in the unmapped reads.

Of course, another big issue with long read is size selection - if your read set doesn't have many reads small enough to be a part of your <10kbp plasmids>, then nothing will work - your best bet is still to get some short read sequencing data.

But assuming they are in the read set, plassembler long should hopefully recover your small plasmids.

Results

I tested plassembler long on the 60x simulated reads I generated for benchmarking plassembler for the manuscript and can be found here. from Wick et al, C222 (Houtak et al) and Cav1217 (Mathers et al) (find more information here and here).

Overall, it seems to work pretty well - not perfect (it missed a linear and small plasmid in K variicola and also tends to assemble some non-circular chimeric contigs), but it seems anything circular is a real plasmid from the ground truth.

Other Options

You could try running Canu on the entire assembly and use Ryan Wick's script to trim the multiplicated plasmid contigs.

Ryan also thinks Canu is a pretty good assembler generally (it is perhaps more accurate than Flye). The downside is time though, I have found that it takes 5-10x more time to run than Flye on my isolates.

Results

Everything was run on my Macbook Pro M1 (2020) with 8 threads. I was too lazy to run QUAST but I would recommend polishing the output with your favourite long read polisher (Medaka) anyway. Below are the results (in terms of lengths).

By Isolate

(c) = Canu marked the plasmid as circular

Isolate Ground Truth plassembler long v1.2.0 Time (s) Flye (Within plassembler long )
C222 2473 2473 1917 Nothing - missed 2473
Acinetobacter baumannii J9 145059; 6078 145058 (c); 6078 (c) 967 145059; 10771
CAV1217 181436;  70606; 44015; 9294 181429 (c);  70603 (c); 44015 (c); 9294 (c) 1230 181433;  70605; 44015; 9294
Citrobacter koseri MINF 9D 64962; 9294 93661 (chimera includes 64962 plasmid); 9294 (c) 1196 64961; 9294
Enterobacter kobei MSB1 1B 136482;  108411;  4665;  3715;  2370 136477 (c);  108402 (c);  4665 (c);  3715 (c);  2367 (c) 1374 136477; 108408 - missed 3 small plasmids
Klebsiella oxytoca MSB1 2C 118161;  58472; 9975; 4574 118159 (c);  58467 (c); 9973 (c); 4573 (c); extra 18290 (chimera) 1646 118161;  58472; 9975 - missed 4574
Klebsiella variicola INF345 250980;  243620;  31780 (linear);  5783;  3514 250968 (c);  243616 (c);  3514 (c) - missing linear plasmid + 5783 3544 249712; 243617; 31645 - missed 5783 + 3514

Overall

Total Missed Small Plasmids Missassembled
plassembler long v1.2.0 1 + linear plasmid 2 (C. koseri 64962) and K oxtyoca chimera
Flye (Within plassembler) 6 1 (A. baumannii 6078)

Summary

It's pretty good! Not perfect and I'd still recommend short reads if you can get them, but not too bad. Flye still seems the best for linear plasmids as well.

Usage: plassembler long [OPTIONS]
Plassembler with long reads only

Options:
  -h, --help                Show this message and exit.
  -V, --version             Show the version and exit.
  -d, --database PATH       Directory of PLSDB database.  [required]
  -l, --longreads PATH      FASTQ file of long reads.  [required]
  -c, --chromosome INTEGER  Approximate lower-bound chromosome length of
                            bacteria (in base pairs).  [default: 1000000]git
  -o, --outdir PATH         Directory to write the output to.  [default:
                            plassembler.output/]
  -m, --min_length TEXT     minimum length for filtering long reads with
                            chopper.  [default: 500]
  -q, --min_quality TEXT    minimum quality q-score for filtering long reads
                            with chopper.  [default: 9]
  -t, --threads TEXT        Number of threads.  [default: 1]
  -f, --force               Force overwrites the output directory.
  -p, --prefix TEXT         Prefix for output files. This is not required.
                            [default: plassembler]
  --skip_qc                 Skips qc (chopper and fastp).
  --pacbio_model TEXT       Pacbio model for Flye.  Must be one of pacbio-raw,
                            pacbio-corr or pacbio-hifi.  Use pacbio-raw for
                            PacBio regular CLR reads (<20 percent error),
                            pacbio-corr for PacBio reads that were corrected
                            with other methods (<3 percent error) or pacbio-
                            hifi for PacBio HiFi reads (<1 percent error).
  -r, --raw_flag            Use --nano-raw for Flye.  Designed for Guppy fast
                            configuration reads.  By default, Flye will assume
                            SUP or HAC reads and use --nano-hq.
  --keep_chromosome         If you want to keep the chromosome assembly.
  --flye_directory PATH     Directory containing Flye long read assembly.
                            Needs to contain assembly_info.txt and
                            assembly_info.fasta. Allows Plassembler to Skip
                            Flye assembly step.