Commands
hecatomb install
- Install the databases (you should only need to do this once)hecatomb run
- Run the pipelinehecatomb test
- Run the test datasethecatomb config
- Copy the default config file to the current directory (for use with--configfile
)hecatomb listHosts
- List the currently-available host genomeshecatomb addHost
- Add your own host genomehecatomb combine
- Combine output from multiple Hecatomb runs
Input
When working with paired-end reads,
you can either specify a directory of reads, and Hecatomb will infer the sample names and forward/reverse files, or,
you can specify a TSV file to explicitly assign sample names and point to the corresponding read files.
In either case you just use --reads
and Hecatomb will figure out if it's a file or directory.
When you specify a directory of reads, e.g. hecatomb run --reads readDir/
,
Hecatomb expects paired sequencing reads in the format sampleName_R1/R2.fastq(.gz),
and uses the wildcards to match around the _R1 flag like so: {sampleName}_R1{fileExtension}
. e.g.
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz
This allows for some variation, such as sample1_R1.L001.fastq.gz
or sample1_R1.fq.gz
,
but you might run into problems if you mix and match file extensions,
or if you have _R1
as part of a sample's name.
When you specify a TSV file, e.g. hecatomb run --reads samples.tsv
,
Hecatomb expects a 3-column tab separated file with the first column specifying the sample name,
and the other columns the relative or full paths to the forward and reverse read files. e.g.
sample1 /path/to/reads/sample1.1.fastq.gz /path/to/reads/sample1.2.fastq.gz
sample2 /path/to/reads/sample2.1.fastq.gz /path/to/reads/sample2.2.fastq.gz
For single-end sequencing (long or short reads), Hecatomb can accept either FASTA or FASTQ format files.
When specifying a directory of reads, the file extensions must be either fasta.gz
or fastq.gz
,
and they must be consistent across samples.
Hecatomb matches with constrained wildcards like so: {sampleName}.{extension,fast[aq].gz}
.
Alternatively you can pass a TSV file just like before, only it will be a 2-column tab separated file, but you can mix and match FASTAs and FASTQs to your heart's content.
Read annotation + assembly
By default, Hecatomb will annotate your reads and perform an assembly.
If you have more than 32 threads available, you can increase the threads provided to the pipeline with --threads
:
hecatomb run --reads fastq/ --threads 64
If you're running on a HPC cluster, you should first set up a
Snakemake Profile.
More info and example for Hecatomb here.
Then you would specify your profile name when running Hecatomb.
Assuming your profile is called slurm
:
hecatomb run --reads fastq/ --profile slurm
Running Hecatomb on a HPC with a Snakemake profile is THE BEST WAY to run the pipeline.
But if you're feeling lazy, just submit a single job with the max resources and use --threads
.
Run specific stages
Optionally skip specific stages of Hecatomb by specifying the stages you want to run. For instance, skip assembly and run the read-annotations only:
hecatomb run --reads fastq/ --profile slurm preprocessing annotations
View the stages that are available to run:
hecatomb run print_stages
Quicker read annotation
The main pipeline bottleneck is the MMSeqs searches.
Use the --search fast
flag to run Hecatomb with less sensitive settings for MMSeqs.
In limited testing, we find it performs almost as well but with considerable runtime improvements.
hecatomb run --reads fastq/ --profile slurm --search fast
Specifying a host genome
Hecatomb includes a thorough host read removal step which utilises a processed host genome. You can specify a host, or add your own.
By default, Hecatomb will use the human genome. If your sample is from a different source you will need to specify the host genome for your sample source.
To see what host genomes are available:
hecatomb listHosts
The following should be available by default: bat, mouse, camel, celegans, macaque, rat, dog, cat, tick, mosquito, cow, human
So if you are working with mouse samples you would run:
hecatomb run --reads fastq/ --host mouse
Add your own host genome
If the genome for the host you're working with isn't included in the available hosts, or you have a reference genome
which you think is better, you can add it with addHost
.
This script will mask viral-like regions from your genome and add it to your Hecatomb host database.
You will need to specify the host genome FASTA file, as well as a name for this host.
Assuming you want to add the llama genome and the FASTA genome file is called llama.fasta
:
hecatomb addHost --host llama --hostfa llama.fasta
You will then be able to run Hecatomb with your new host genome:
hecatomb run --reads fastq/ --host llama
Combine multiple Hecatomb runs
You can now combine multiple Hecatomb runs. For some files such as the seqtable and the bigtable you can simply concatenate them (but removing duplicate headers). However, the assembly files need to be coalesced with FlyE, and the assembly-associated files need to be regenerated.
To run:
hecatomb combine --comb hecOutDir1/ --comb hecOutDir2/
and use --threads
or --profile
like you normally would.