This tutorial will walk through the process of running Hecatomb and performing some preliminary plots and analyses in Python/PyCharm. This first section details the Hecatomb run.
You won't need to actually run Hecatomb, you can simply download the files and continue with the rest of the tutorial.
For this tutorial I'll be running Hecatomb on a 16-core/32-thread workstation with 64 GB of RAM running Ubuntu 18.04 LTS. While this is fine for smaller datasets, larger datasets will require HPC resources.
# create new conda env and install hecatomb conda create -n hecatomb -c conda-forge -c bioconda hecatomb # activate env conda activate hecatomb # install the database files hecatomb install
We will run hecatomb on the test dataset, which is a subset of the dataset described in Handley et al (2016). We use the fast MMSeqs settings with the default 32 threads and generate a summary report. This will give us an assembly and some read annotations.
hecatomb test --search fast
We should now have all the files we need!
Download the files
Download the following (gzipped) files generated by Hecatomb:
I also prepared a metadata file. The samples in the test dataset are the samples for the deceased Macaques from the original study. The metadata contains the individuals' gender and the vaccine that was administered.
krona.html are automatically-generated plots.
You can have a look at these files but don't need to do anything with them for the tutorial.
bigtable.tsv.gz is the read-based annotations and will form the basis for most of the subsequent analysis.
contigSeqTable.tsv.gz combines the read-based annotations with the contig mapping coordinates.
taxonLevelCounts.tsv.gz file summarises the sequence counts for each sample, at every taxon level,
and will make it easy to perform some preliminary comparisons.
Optional: prefilter the bigtable.tsv
The dataset in this tutorial is small and manageable, but if you're working on a large dataset, the size of the bigtable can be too big for loading into R. If you run into this situation check out prefiltering the bigtable.
Setting up Python with PyCharm
The remainder of the tutorial will be in Python using the PyCharm IDE, but you could use a Jupyter notebook, or a plain old Python shell. We will use the packages Pandas and Seaborn which come with Anaconda, or they can be installed with pip or conda. In addition, for the Dunn's test we will install the scikit-posthocs package scikit-posthocs. The installation and packages for this tutorial have been tested with a fresh installation of Python (3.9.7) and PyCharm (2021.2.3 Professional Edition Build PY-212.5457.59) for Mac.
Assuming you have installed Python and PyCharm for your operating system, you should be able to install the packages like so:
# in PyCharm or bash pip install seaborn pip install pandas pip install numpy pip install scikit-posthocs
If you run into an error message 'matplotlib: RuntimeError: Python is not installed as a framework' on a Mac, the creation of a file in ~/.matplotlib/matplotlibr as per this article fixed this issue.
# in PyCharm/python import matplotlib.pyplot as plt
You should also check in PyCharm what Python version is being used for the project as some of the functions will rely on Python 3.6+
On Windows the options should be under File -> Settings -> Preferences -> Project Interpreter-> Python Interpreters
On a Mac the options would be under PyCharm -> Preferences -> Project Name -> Python Interpreters
Load the data
Make sure you have created a new Python project and have downloaded the Hecatomb files into this directory.
Always make sure to include the following at the start to load the necessary libraries required.
import scipy.stats as stats import statsmodels.stats.api as sms import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns
We will load the
bigtable.tsv.gz into a dataframe called
The file is tab-separated and contains a header.
data = pd.read_csv('bigtable.tsv.gz',compression='gzip',header=0,sep='\t')
Next, we will load the
metadata.tsv.gz file into a dataframe called
meta in the same way.
meta = pd.read_csv('metadata.tsv.gz',compression='gzip',header=0,sep='\t')
taxonLevelCounts.tsv.gz while we're at it.
taxonCounts = pd.read_csv('taxonLevelCounts.tsv.gz',compression='gzip',header=0,sep='\t')
contigSeqTable.tsv.gz file handy, we'll load that into python as well later on.
You can now move onto Part 2: filtering annotations.