Plassembler Benchmarking

FASTQ Set Up for Wick et al

Firstly, I downloaded the tarballs from this link.
These contained tech_rep_1_rapid_reads.fastq.gz, tech_rep_2_rapid_reads.fastq.gz, tech_rep_1_ligation_reads.fastq.gz, tech_rep_2_ligation_reads.fastq.gz, tech_rep_1_illumina_reads and tech_rep_2_illumina_reads tarballs.

These are the barcodes linked to the species

barcode01 == Acinetobacter_baumannii_J9
barcode02 == Citrobacter_koseri_MINF_9D
barcode03 == Enterobacter_kobei_MSB1_1B
barcode04 == Haemophilus_unknown_M1C132_1
barcode05 == Klebsiella_oxytoca_MSB1_2C
barcode07 == Klebsiella_variicola_INF345

Serratia Marscecens

barcode08 Serratia Marcescens was excluded from benchmarking due to how devilishly difficult it was for Ryan Wick to assemble it, even with Trycycler - see his comments.
I tried assembling it in different ways with different read subsets, but kept getting different results every time depending on what read subset I chose or even how I randomly sampled the reads, indicating there was enormous heterogeneity in this sample. Even when I used all reads (over 400x chromosome coverage), the Flye step in Plassembler could not resolve the 184447bp plasmid which was concerning! So I excluded it from the benchmarking.

Recovering Barcodes

These files contained the reads for all barcodes pooled together. The barcode was indicated in the FASTQ header.
Accordingly, I split the files using grep as follows - run from within the directory of each unzipped tarball:

### tech rep 1
# rapid
gunzip -c tech_rep_1_rapid_reads.fastq.gz | grep -A3 "barcode=barcode01"  | gzip >  barcode01.fastq.gz
gunzip -c tech_rep_1_rapid_reads.fastq.gz | grep -A3 "barcode=barcode02"   | gzip >  barcode02.fastq.gz
gunzip -c tech_rep_1_rapid_reads.fastq.gz | grep -A3 "barcode=barcode03"  | gzip >  barcode03.fastq.gz
gunzip -c tech_rep_1_rapid_reads.fastq.gz | grep -A3 "barcode=barcode04"  | gzip >  barcode04.fastq.gz
gunzip -c tech_rep_1_rapid_reads.fastq.gz | grep -A3 "barcode=barcode05" | gzip >  barcode05.fastq.gz
gunzip -c tech_rep_1_rapid_reads.fastq.gz | grep -A3 "barcode=barcode07" | gzip >  barcode07.fastq.gz


# ligation
gunzip -c tech_rep_1_ligation_reads.fastq.gz | grep -A3 "barcode=barcode01"  | gzip >  barcode01.fastq.gz
gunzip -c tech_rep_1_ligation_reads.fastq.gz | grep -A3 "barcode=barcode02"   | gzip >  barcode02.fastq.gz
gunzip -c tech_rep_1_ligation_reads.fastq.gz | grep -A3 "barcode=barcode03"  | gzip >  barcode03.fastq.gz
gunzip -c tech_rep_1_ligation_reads.fastq.gz | grep -A3 "barcode=barcode04"  | gzip >  barcode04.fastq.gz
gunzip -c tech_rep_1_ligation_reads.fastq.gz | grep -A3 "barcode=barcode05" | gzip >  barcode05.fastq.gz
gunzip -c tech_rep_1_ligation_reads.fastq.gz | grep -A3 "barcode=barcode07" | gzip >  barcode07.fastq.gz


### tech rep 2
# rapid
gunzip -c tech_rep_2_rapid_reads.fastq.gz | grep -A3 "barcode=barcode01" | gzip >  barcode01.fastq.gz
gunzip -c tech_rep_2_rapid_reads.fastq.gz | grep -A3 "barcode=barcode02" | gzip >  barcode02.fastq.gz
gunzip -c tech_rep_2_rapid_reads.fastq.gz | grep -A3 "barcode=barcode03" | gzip >  barcode03.fastq.gz
gunzip -c tech_rep_2_rapid_reads.fastq.gz | grep -A3 "barcode=barcode04" | gzip >  barcode04.fastq.gz
gunzip -c tech_rep_2_rapid_reads.fastq.gz | grep -A3 "barcode=barcode05" | gzip >  barcode05.fastq.gz
gunzip -c tech_rep_2_rapid_reads.fastq.gz | grep -A3 "barcode=barcode07" | gzip >  barcode07.fastq.gz


# ligation
gunzip -c tech_rep_2_ligation_reads.fastq.gz | grep -A3 "barcode=barcode01" | gzip >  barcode01.fastq.gz
gunzip -c tech_rep_2_ligation_reads.fastq.gz | grep -A3 "barcode=barcode02" | gzip >  barcode02.fastq.gz
gunzip -c tech_rep_2_ligation_reads.fastq.gz | grep -A3 "barcode=barcode03" | gzip >  barcode03.fastq.gz
gunzip -c tech_rep_2_ligation_reads.fastq.gz | grep -A3 "barcode=barcode04" | gzip >  barcode04.fastq.gz
gunzip -c tech_rep_2_ligation_reads.fastq.gz | grep -A3 "barcode=barcode05" | gzip >  barcode05.fastq.gz
gunzip -c tech_rep_2_ligation_reads.fastq.gz | grep -A3 "barcode=barcode07" | gzip >  barcode07.fastq.gz

Repairing fastqs

There were some issues with the resulting long read FASTQ files as lengths of quality and sequence did not match for some reads
There seemed to be some stray "--" characters. Accordingly I wrote a little script called validate_fastq.py to get all the validated fastq reads. You can find this in the Zenodo output directory here.

# repeat  for all 4 read sets

validate_fastq.py -i barcode01.fastq.gz -o barcode01_fixed.fastq.gz
validate_fastq.py -i barcode02.fastq.gz -o barcode02_fixed.fastq.gz
validate_fastq.py -i barcode03.fastq.gz -o barcode03_fixed.fastq.gz
validate_fastq.py -i barcode04.fastq.gz -o barcode04_fixed.fastq.gz
validate_fastq.py -i barcode05.fastq.gz -o barcode05_fixed.fastq.gz
validate_fastq.py -i barcode07.fastq.gz -o barcode07_fixed.fastq.gz

This recovered around 99% of the original reads (eyeballing by file size) indicating only a few reads were corrupt (sp not enough to impact downstream given how deeply these were sequenced).

Subsampling Fastqs

I ran rasusa to get 50x coverage on each fixed fastq file to reduce the read sizes and to make sure when I pooled the reads, some subsets with higher depth did not predominate.

conda create -n rasusa rasusa
conda install rasusa

# repeat for all 4 long read sets within the directory

rasusa --coverage 50 --genome-size 4mb --input barcode01_fixed.fastq.gz | gzip > barcode01_subsampled.fastq.gz
rasusa --coverage 50 --genome-size 5mb --input barcode02_fixed.fastq.gz | gzip > barcode02_subsampled.fastq.gz
rasusa --coverage 50 --genome-size 5mb --input barcode03_fixed.fastq.gz | gzip > barcode03_subsampled.fastq.gz
rasusa --coverage 50 --genome-size 2.1mb --input barcode04_fixed.fastq.gz | gzip > barcode04_subsampled.fastq.gz
rasusa --coverage 50 --genome-size 6mb --input barcode05_fixed.fastq.gz | gzip > barcode05_subsampled.fastq.gz
rasusa --coverage 50 --genome-size 6mb --input barcode07_fixed.fastq.gz | gzip > barcode07_subsampled.fastq.gz
conda deactivate

Creating pooled reads

# pool the different library preps

mkdir -p tech_rep_1_pooled

cat tech_rep_1_rapid_reads/barcode01_subsampled.fastq.gz tech_rep_1_ligation_reads/barcode01_subsampled.fastq.gz > tech_rep_1_pooled/barcode01.fastq.gz
cat tech_rep_1_rapid_reads/barcode02_subsampled.fastq.gz tech_rep_1_ligation_reads/barcode02_subsampled.fastq.gz > tech_rep_1_pooled/barcode02.fastq.gz
cat tech_rep_1_rapid_reads/barcode03_subsampled.fastq.gz tech_rep_1_ligation_reads/barcode03_subsampled.fastq.gz > tech_rep_1_pooled/barcode03.fastq.gz
cat tech_rep_1_rapid_reads/barcode04_subsampled.fastq.gz tech_rep_1_ligation_reads/barcode04_subsampled.fastq.gz > tech_rep_1_pooled/barcode04.fastq.gz
cat tech_rep_1_rapid_reads/barcode05_subsampled.fastq.gz tech_rep_1_ligation_reads/barcode05_subsampled.fastq.gz > tech_rep_1_pooled/barcode05.fastq.gz
cat tech_rep_1_rapid_reads/barcode07_subsampled.fastq.gz tech_rep_1_ligation_reads/barcode07_subsampled.fastq.gz > tech_rep_1_pooled/barcode07.fastq.gz


mkdir -p tech_rep_2_pooled

cat tech_rep_2_rapid_reads/barcode01_subsampled.fastq.gz tech_rep_2_ligation_reads/barcode01_subsampled.fastq.gz > tech_rep_2_pooled/barcode01.fastq.gz
cat tech_rep_2_rapid_reads/barcode02_subsampled.fastq.gz tech_rep_2_ligation_reads/barcode02_subsampled.fastq.gz > tech_rep_2_pooled/barcode02.fastq.gz
cat tech_rep_2_rapid_reads/barcode03_subsampled.fastq.gz tech_rep_2_ligation_reads/barcode03_subsampled.fastq.gz > tech_rep_2_pooled/barcode03.fastq.gz
cat tech_rep_2_rapid_reads/barcode04_subsampled.fastq.gz tech_rep_2_ligation_reads/barcode04_subsampled.fastq.gz > tech_rep_2_pooled/barcode04.fastq.gz
cat tech_rep_2_rapid_reads/barcode05_subsampled.fastq.gz tech_rep_2_ligation_reads/barcode05_subsampled.fastq.gz > tech_rep_2_pooled/barcode05.fastq.gz
cat tech_rep_2_rapid_reads/barcode07_subsampled.fastq.gz tech_rep_2_ligation_reads/barcode07_subsampled.fastq.gz > tech_rep_2_pooled/barcode07.fastq.gz


# pool the tech reps
# then I combined the technical reps

mkdir -p pooled_long_reads

cat tech_rep_1_pooled/barcode01.fastq.gz tech_rep_2_pooled/barcode01.fastq.gz > pooled_long_reads/Acinetobacter_baumannii.fastq.gz 
cat tech_rep_1_pooled/barcode02.fastq.gz tech_rep_2_pooled/barcode02.fastq.gz > pooled_long_reads/Citrobacter_koseri.fastq.gz 
cat tech_rep_1_pooled/barcode03.fastq.gz tech_rep_2_pooled/barcode03.fastq.gz > pooled_long_reads/Enterobacter_kobei.fastq.gz 
cat tech_rep_1_pooled/barcode04.fastq.gz tech_rep_2_pooled/barcode04.fastq.gz > pooled_long_reads/Haemophilus_unknown.fastq.gz 
cat tech_rep_1_pooled/barcode05.fastq.gz tech_rep_2_pooled/barcode05.fastq.gz > pooled_long_reads/Klebsiella_oxytoca.fastq.gz 
cat tech_rep_1_pooled/barcode07.fastq.gz tech_rep_2_pooled/barcode07.fastq.gz > pooled_long_reads/Klebsiella_variicola.fastq.gz

Finally, I subsampled the reads to 60x coverage with rasusa

# from inside the pooled_long_reads directory

cd pooled_long_reads

rasusa --coverage 60 --genome-size 4mb --input Acinetobacter_baumannii.fastq.gz | gzip > Acinetobacter_baumannii_subsampled60x.fastq.gz 
rasusa --coverage 60 --genome-size 5mb --input Citrobacter_koseri.fastq.gz | gzip > Citrobacter_koseri_subsampled60x.fastq.gz
rasusa --coverage 60 --genome-size 5mb --input Enterobacter_kobei.fastq.gz | gzip > Enterobacter_kobei_subsampled60x.fastq.gz
rasusa --coverage 60 --genome-size 2.1mb --input Haemophilus_unknown.fastq.gz | gzip > Haemophilus_unknown_subsampled60x.fastq.gz
rasusa --coverage 60 --genome-size 6mb --input Klebsiella_oxytoca.fastq.gz | gzip > Klebsiella_oxytoca_subsampled60x.fastq.gz
rasusa --coverage 60 --genome-size 6mb --input Klebsiella_variicola.fastq.gz | gzip > Klebsiella_variicola_subsampled60x.fastq.gz

Creating Illumina pooled reads

mkdir -p pooled_illumina

cat tech_rep_2_illumina_reads/Acinetobacter_baumannii_J9/J9_S142_L001_R1_001.fastq.gz tech_rep_1_illumina_reads/Acinetobacter_baumannii_J9/plasmids-J9_1.fastq.gz > pooled_illumina/Acinetobacter_baumannii_1.fastq.gz
cat tech_rep_2_illumina_reads/Acinetobacter_baumannii_J9/J9_S142_L001_R2_001.fastq.gz tech_rep_1_illumina_reads/Acinetobacter_baumannii_J9/plasmids-J9_2.fastq.gz > pooled_illumina/Acinetobacter_baumannii_2.fastq.gz

cat tech_rep_2_illumina_reads/Citrobacter_koseri_MINF_9D/MINF-9D_S143_L001_R1_001.fastq.gz tech_rep_1_illumina_reads/Citrobacter_koseri_MINF_9D/plasmids-MINF-9D_1.fastq.gz > pooled_illumina/Citrobacter_koseri_1.fastq.gz
cat tech_rep_2_illumina_reads/Citrobacter_koseri_MINF_9D/MINF-9D_S143_L001_R2_001.fastq.gz tech_rep_1_illumina_reads/Citrobacter_koseri_MINF_9D/plasmids-MINF-9D_2.fastq.gz > pooled_illumina/Citrobacter_koseri_2.fastq.gz

cat tech_rep_2_illumina_reads/Enterobacter_kobei_MSB1_1B/MSB1-1B_S144_L001_R1_001.fastq.gz tech_rep_1_illumina_reads/Enterobacter_kobei_MSB1_1B/plasmids-MSB1-1B_1.fastq.gz > pooled_illumina/Enterobacter_kobei_1.fastq.gz
cat tech_rep_2_illumina_reads/Enterobacter_kobei_MSB1_1B/MSB1-1B_S144_L001_R2_001.fastq.gz tech_rep_1_illumina_reads/Enterobacter_kobei_MSB1_1B/plasmids-MSB1-1B_2.fastq.gz > pooled_illumina/Enterobacter_kobei_2.fastq.gz

cat tech_rep_2_illumina_reads/Haemophilus_unknown_M1C132_1/M1C132-1_S145_L001_R1_001.fastq.gz tech_rep_1_illumina_reads/Haemophilus_unknown_M1C132_1/plasmids-MIC132-1_1.fastq.gz > pooled_illumina/Haemophilus_unknown_1.fastq.gz
cat tech_rep_2_illumina_reads/Haemophilus_unknown_M1C132_1/M1C132-1_S145_L001_R2_001.fastq.gz tech_rep_1_illumina_reads/Haemophilus_unknown_M1C132_1/plasmids-MIC132-1_2.fastq.gz > pooled_illumina/Haemophilus_unknown_2.fastq.gz

cat tech_rep_2_illumina_reads/Klebsiella_oxytoca_MSB1_2C/MSB1-2C_S146_L001_R1_001.fastq.gz  tech_rep_1_illumina_reads/Klebsiella_oxytoca_MSB1_2C/plasmids-MSB1-2C_1.fastq.gz  > pooled_illumina/Klebsiella_oxytoca_1.fastq.gz
cat tech_rep_2_illumina_reads/Klebsiella_oxytoca_MSB1_2C/MSB1-2C_S146_L001_R2_001.fastq.gz  tech_rep_1_illumina_reads/Klebsiella_oxytoca_MSB1_2C/plasmids-MSB1-2C_2.fastq.gz  > pooled_illumina/Klebsiella_oxytoca_2.fastq.gz

cat tech_rep_2_illumina_reads/Klebsiella_variicola_INF345/INF345_S147_L001_R1_001.fastq.gz  tech_rep_1_illumina_reads/Klebsiella_variicola_INF345/plasmids-INF345_1.fastq.gz   > pooled_illumina/Klebsiella_variicola_1.fastq.gz
cat tech_rep_2_illumina_reads/Klebsiella_variicola_INF345/INF345_S147_L001_R2_001.fastq.gz  tech_rep_1_illumina_reads/Klebsiella_variicola_INF345/plasmids-INF345_2.fastq.gz   > pooled_illumina/Klebsiella_variicola_2.fastq.gz

Plassembler Benchmarking

FASTQ Set Up for Wick et al

Recovering Barcodes

Repairing fastqs

Subsampling Fastqs

Creating pooled reads

Creating Illumina pooled reads

Other Read Sets

De Maio

CAV1217 Assemblies