Part 1: Preparation

This tutorial will walk through the process of running Hecatomb and performing some preliminary plots and analyses in Python/PyCharm. This first section details the Hecatomb run.

You won't need to actually run Hecatomb, you can simply download the files and continue with the rest of the tutorial.

System information

For this tutorial I'll be running Hecatomb on a 16-core/32-thread workstation with 64 GB of RAM running Ubuntu 18.04 LTS. While this is fine for smaller datasets, larger datasets will require HPC resources.

New install

# create new conda env and install hecatomb
conda create -n hecatomb -c conda-forge -c bioconda hecatomb

# activate env
conda activate hecatomb

# install the database files
hecatomb install

Run Hecatomb

We will run hecatomb on the test dataset, which is a subset of the dataset described in Handley et al (2016). We use the fast MMSeqs settings with the default 32 threads and generate a summary report. This will give us an assembly and some read annotations.

hecatomb test --search fast

We should now have all the files we need!

Download the files

Download the following (gzipped) files generated by Hecatomb:

I also prepared a metadata file. The samples in the test dataset are the samples for the deceased Macaques from the original study. The metadata contains the individuals' gender and the vaccine that was administered.

metadata.tsv.gz

The Sankey.svg and krona.html are automatically-generated plots. You can have a look at these files but don't need to do anything with them for the tutorial. The bigtable.tsv.gz is the read-based annotations and will form the basis for most of the subsequent analysis. The contigSeqTable.tsv.gz combines the read-based annotations with the contig mapping coordinates. The taxonLevelCounts.tsv.gz file summarises the sequence counts for each sample, at every taxon level, and will make it easy to perform some preliminary comparisons.

Optional: prefilter the bigtable.tsv

The dataset in this tutorial is small and manageable, but if you're working on a large dataset, the size of the bigtable can be too big for loading into R. If you run into this situation check out prefiltering the bigtable.

Setting up Python with PyCharm

The remainder of the tutorial will be in Python using the PyCharm IDE, but you could use a Jupyter notebook, or a plain old Python shell. We will use the packages Pandas and Seaborn which come with Anaconda, or they can be installed with pip or conda. In addition, for the Dunn's test we will install the scikit-posthocs package scikit-posthocs. The installation and packages for this tutorial have been tested with a fresh installation of Python (3.9.7) and PyCharm (2021.2.3 Professional Edition Build PY-212.5457.59) for Mac.

Assuming you have installed Python and PyCharm for your operating system, you should be able to install the packages like so:

# in PyCharm or bash
pip install seaborn
pip install pandas
pip install numpy
pip install scikit-posthocs

If you run into an error message 'matplotlib: RuntimeError: Python is not installed as a framework' on a Mac, the creation of a file in ~/.matplotlib/matplotlibr as per this article fixed this issue.

# in PyCharm/python
import matplotlib.pyplot as plt

You should also check in PyCharm what Python version is being used for the project as some of the functions will rely on Python 3.6+

On Windows the options should be under File -> Settings -> Preferences -> Project Interpreter-> Python Interpreters

On a Mac the options would be under PyCharm -> Preferences -> Project Name -> Python Interpreters

Load the data

Make sure you have created a new Python project and have downloaded the Hecatomb files into this directory.

Always make sure to include the following at the start to load the necessary libraries required.

import scipy.stats as stats
import statsmodels.stats.api as sms
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

We will load the bigtable.tsv.gz into a dataframe called data with pd.read_csv. The file is tab-separated and contains a header.

data = pd.read_csv('bigtable.tsv.gz',compression='gzip',header=0,sep='\t')

Next, we will load the metadata.tsv.gz file into a dataframe called meta in the same way.

meta = pd.read_csv('metadata.tsv.gz',compression='gzip',header=0,sep='\t')

We'll load taxonLevelCounts.tsv.gz while we're at it.

taxonCounts = pd.read_csv('taxonLevelCounts.tsv.gz',compression='gzip',header=0,sep='\t')

Keep the contigSeqTable.tsv.gz file handy, we'll load that into python as well later on.

You can now move onto Part 2: filtering annotations.