Recommended customisation
If you're running Hecatomb on a HPC cluster, we absolutely recommend setting up a Snakemake profile.
Changing the Hecatomb configuration
The Hecatomb configuration file contains settings related to resources and cutoffs for various stages of the pipeline. The different config settings are outlined further on. You can permanently change the behaviour of your Hecatomb installation by modifying the values in the system config file.
Alternatively, you can copy and modify the default config file. Before Hecatomb runs, it will copy the system default config file to the output directory and use it for your analysis. To customise your run, you can copy the system default config file before running Hecatomb like so:
hecatomb config
You can then edit your new hecatomb.out/hecatomb.config.yaml
file to suit your needs.
It will automatically be used in your Hecatomb run, or if you rename it you can specify the file with --configfile
:
hecatomb run --configfile myRenamedHecatomb.config.yaml
Database location
The databases are large (~55 GB) and if your Hecatomb installation is on a partition with limited on space,
you might want to specify a new location to house the database files.
By default, this config setting is blank and the pipeline will use the install location.
You can specify the directory in the Hecatomb system config file under args: databases:
,
e.g:
args:
databases: /scratch/HecatombDatabases
and rerun the installation
hecatomb install
Default resources
The Hecatomb config file contains some sensible defaults for resources.
While these should work for most datasets, they may fail for larger ones.
You may also have more CPUs etc at your disposal and want to minimise runtime of the pipeline.
Currently, the slowest steps are the MMSeqs searches (under resources: big:
);
increasing the CPUs and RAM can significantly improve runtime.
The other settings (under resources: med:
) will yield more modest improvement.
resources:
big:
mem: 64000 # Memory for big (mostly mmseqs) jobs in megabytes (e.g 64GB = 64000, recommend >= 64000)
cpu: 24 # Threads (recommend >= 16)
time: 1440 # Max runtime in minutes (allows to set max time for the scheduler via snakemake profiles)
med:
mem: 32000 # Memory for most jobs in megabytes (recommend >= 32000)
cpu: 16 # CPUs for most jobs in megabytes (recommend >= 16)
ram:
mem: 16000 # Memory for slightly RAM-hungry jobs in megabytes (recommend >= 16000)
cpu: 2 # CPUs for slightly RAM-hungry jobs (recommend >= 2)
Preprocessing settings
Fastp settings can be modified in the config under qc: fastp:
.
Refer to Fastp's documentation for more details.
qc:
compression: 1 # Compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest. Default is 1, based on assumption of large scratch space
fastp:
--qualified_quality_phred 15
--length_required 90
--cut_tail
--cut_tail_window_size 25
--cut_tail_mean_quality 15
--dedup
--dup_calc_accuracy 4
--trim_poly_x
Alignment filtering
Hecatomb has settings for the various MMSeqs2 steps.
Read-clustering parameters using mmseqs' linclust can be modified under mmseqs: linclustParams:
.
For annotation with MMSeqs2, alignment filtering can be modified for each of the primary (viral) and secondary
(multi-kingdom) searches under filtAAprimary
, filtAAsecondary
, filtNTprimary
, and filtNTsecondary
.
Search sensitivity parameters when running Hecatomb with --search fast
are specified under perfAAfast
and perfNTfast
,
and with the default --search sensitive
under perfAA
and perfNT
.
Refer to the MMSeqs2 documentation for more information on these settings.
Finally, you can specify taxIDs to ignore (i.e. defer to the best hit) under mmseqs: taxIdIgnore:
.
mmseqs:
linclustParams:
--kmer-per-seq-scale 0.3
-c 0.8
--cov-mode 1
--min-seq-id 0.97
--alignment-mode 3
filtAAprimary:
--min-length 30
-e 1e-3
filtAAsecondary:
--min-length 30
-e 1e-5
filtNTprimary:
--min-length 90
-e 1e-10
filtNTsecondary:
--min-length 90
-e 1e-20
perfAA:
--start-sens 1
--sens-steps 3
-s 7
--lca-mode 2
--shuffle 0
perfAAfast:
-s 4.0
--lca-mode 2
--shuffle 0
perfNT:
--start-sens 2
-s 7
--sens-steps 3
perfNTfast:
-s 4.0
taxIdIgnore: 0 1 2 10239 131567 12429 2759
Assembly settings
All assembly settings are specified under assembly:
.
Canu assembler is used for --preprocess longreads
and megahit is used for shortreads.
Flye is used to collapse individual sample contigs into a population assembly.
Refer to Canu, Flye, and Megahit's documentation for more information on these settings.
assembly:
canu:
correctedErrorRate=0.16
maxInputCoverage=10000
minInputCoverage=0
corOutCoverage=10000
corMhapSensitivity=high
corMinCoverage=0
useGrid=False
stopOnLowCoverage=False
genomeSize=10M
-nanopore
# -pacbio
# -pacbio-hifi
megahit:
--k-min 45
--k-max 225
--k-step 26
--min-count 2
--min-contig-len 1000
flye:
-g 1g
Additional Snakemake commands
As mentioned, Hecatomb is powered by Snakemake but runs via a launcher for your convenience. Snakemake itself has many command line options, and the launcher can pass additional commands on to Snakemake.
One such example is if you're not production ready you might wish to do a 'dry-run', where the run is simulated but no
jobs are submitted, just to see if everything is configured correctly.
To do that, Snakemake needs the dry run flag (--dry-run
, --dryrun
, or -n
).
In Hecatomb, simply tack it on to the end of your command:
hecatomb run --reads fasq/ --profile slurm --dry-run
Hecatomb prints the Snakemake command to the terminal window before running and you should see these additional options
added to the Snakemake command. Have a look at the full list of available Snakemake options with snakemake --help
.
Any unrecognised command will be passed on to Snakemake verbatim, so use with caution :p