Part 1: Preparation

This tutorial will walk through the process of running Hecatomb and performing some preliminary plots and analyses in R/Rstudio. This first section details the Hecatomb run.

You won't need to actually run Hecatomb, you can simply download the files and continue with the rest of the tutorial.

System information

For this tutorial I'll be running Hecatomb on a 16-core/32-thread workstation with 64 GB of RAM running Ubuntu 18.04 LTS. While this is fine for smaller datasets, larger datasets will require HPC resources.

New install

# create new conda env and install hecatomb
conda create -n hecatomb -c conda-forge -c bioconda hecatomb

# activate env
conda activate hecatomb

# install the database files
hecatomb install

Run Hecatomb

We will run hecatomb on the test dataset, which is a subset of the dataset described in Handley et al (2016). We use the fast MMSeqs settings with the default 32 threads and generate a summary report. This will give us an assembly and some read annotations.

hecatomb test --search fast

We should now have all the files we need!

Download the files

Download the following (gzipped) files generated by Hecatomb:

I also prepared a metadata file. The samples in the test dataset are the samples for the deceased Macaques from the original study. The metadata contains the individuals' gender and the vaccine that was administered.

metadata.tsv.gz

The Sankey.svg and krona.html are automatically-generated plots. You can have a look at these files but don't need to do anything with them for the tutorial. The bigtable.tsv.gz is the read-based annotations and will form the basis for most of the subsequent analysis. The contigSeqTable.tsv.gz combines the read-based annotations with the contig mapping coordinates. The taxonLevelCounts.tsv.gz file summarises the sequence counts for each sample, at every taxon level, and will make it easy to perform some preliminary comparisons.

Optional: prefilter the bigtable.tsv

The dataset in this tutorial is small and manageable, but if you're working on a large dataset, the size of the bigtable can be too big for loading into R. If you run into this situation check out prefiltering the bigtable.

Setting up R/Rstudio

The remainder of the tutorial will be in R and Rstudio, and will use the packages dplyr, tidyr, and ggplot2. The installation and packages for this tutorial have been tested with a fresh installation of R (4.1.2) and RStudio (2021.09.1+372) for Windows.

Assuming you have installed R and Rstudio for your operating system, you should be able to install the packages like so:

# in R/Rstudio
install.packages("tidyr")
install.packages("dplyr")
install.packages("ggplot2")

Alternatively, you can install the whole tidyverse with:

install.packages("tidyverse")

We'll also need 'rstatix' for some of the statistical tests:

install.packages("rstatix")

Once installed, load the packages:

library(tidyr)
library(dplyr)
library(ggplot2)
library(rstatix)

Load the data

Make sure you set your working directory and have downloaded the files into this directory. We will load the bigtable.tsv.gz into a dataframe called data with load.csv(). The file is tab-separated and contains a header.

data = read.csv('bigtable.tsv.gz',header=T,sep='\t')

Next, we will load the metadata.tsv.gz file into a dataframe called meta in the same way.

meta = read.csv('metadata.tsv.gz',header=T,sep='\t')

We'll load taxonLevelCounts.tsv.gz while we're at it.

taxonCounts = read.csv('taxonLevelCounts.tsv.gz',header=T,sep='\t')

Keep the contigSeqTable.tsv.gz file handy, we'll load that into R as well later on.

You can now move onto Part 2: filtering annotations.