This tutorial will walk through the process of running Hecatomb and performing some preliminary plots and analyses in R/Rstudio. This first section details the Hecatomb run.
You won't need to actually run Hecatomb, you can simply download the files and continue with the rest of the tutorial.
For this tutorial I'll be running Hecatomb on a 16-core/32-thread workstation with 64 GB of RAM running Ubuntu 18.04 LTS. While this is fine for smaller datasets, larger datasets will require HPC resources.
# create new conda env and install hecatomb conda create -n hecatomb -c conda-forge -c bioconda hecatomb # activate env conda activate hecatomb # install the database files hecatomb install
We will run hecatomb on the test dataset, which is a subset of the dataset described in Handley et al (2016). We use the fast MMSeqs settings with the default 32 threads and generate a summary report. This will give us an assembly and some read annotations.
hecatomb test --search fast
We should now have all the files we need!
Download the files
Download the following (gzipped) files generated by Hecatomb:
I also prepared a metadata file. The samples in the test dataset are the samples for the deceased Macaques from the original study. The metadata contains the individuals' gender and the vaccine that was administered.
krona.html are automatically-generated plots.
You can have a look at these files but don't need to do anything with them for the tutorial.
bigtable.tsv.gz is the read-based annotations and will form the basis for most of the subsequent analysis.
contigSeqTable.tsv.gz combines the read-based annotations with the contig mapping coordinates.
taxonLevelCounts.tsv.gz file summarises the sequence counts for each sample, at every taxon level,
and will make it easy to perform some preliminary comparisons.
Optional: prefilter the bigtable.tsv
The dataset in this tutorial is small and manageable, but if you're working on a large dataset, the size of the bigtable can be too big for loading into R. If you run into this situation check out prefiltering the bigtable.
Setting up R/Rstudio
The remainder of the tutorial will be in R and Rstudio, and will use the packages dplyr, tidyr, and ggplot2. The installation and packages for this tutorial have been tested with a fresh installation of R (4.1.2) and RStudio (2021.09.1+372) for Windows.
Assuming you have installed R and Rstudio for your operating system, you should be able to install the packages like so:
# in R/Rstudio install.packages("tidyr") install.packages("dplyr") install.packages("ggplot2")
Alternatively, you can install the whole tidyverse with:
We'll also need 'rstatix' for some of the statistical tests:
Once installed, load the packages:
library(tidyr) library(dplyr) library(ggplot2) library(rstatix)
Load the data
Make sure you set your working directory and have downloaded the files into this directory.
We will load the
bigtable.tsv.gz into a dataframe called
The file is tab-separated and contains a header.
data = read.csv('bigtable.tsv.gz',header=T,sep='\t')
Next, we will load the
metadata.tsv.gz file into a dataframe called
meta in the same way.
meta = read.csv('metadata.tsv.gz',header=T,sep='\t')
taxonLevelCounts.tsv.gz while we're at it.
taxonCounts = read.csv('taxonLevelCounts.tsv.gz',header=T,sep='\t')
contigSeqTable.tsv.gz file handy, we'll load that into R as well later on.
You can now move onto Part 2: filtering annotations.