This section assumes you have completed Tutorial Part3.
Compare viral loads
In part 3, we compared the viral counts between the two sample groups for Podoviridae, and it appeared as though group B had more viral sequence hits on average than group A. We can compare the normalised counts for these two groups to see if they're significantly different.
Student's T-test
Let's check out the dataframe we made earlier that we'll be using for the test:
View(podoCounts)
We'll use the base-r function t.test()
, which takes two vectors--one with the
group A counts and one with the group B counts.
We can use the filter()
and pull()
functions within the t.test()
function like so:
t.test(
podoCounts %>%
filter(sampleGroup=='A') %>%
pull(n),
podoCounts %>%
filter(sampleGroup=='B') %>%
pull(n),
alternative='two.sided',
paired=F,
var.equal=T)
Two Sample t-test
data: podoCounts %>% filter(sampleGroup == "A") %>% pull(n) and podoCounts %>% filter(sampleGroup == "B") %>% pull(n)
t = -5.3033, df = 8, p-value = 0.0007255
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-615.5267 -242.4563
sample estimates:
mean of x mean of y
81.11595 510.10744
Wilcoxon test
You might prefer to perform a Wilcoxon test; the syntax is very similar to the t.test:
wilcox.test(
podoCounts %>%
filter(sampleGroup=='A') %>%
pull(n),
podoCounts %>%
filter(sampleGroup=='B') %>%
pull(n),
alternative='t',
paired=F)
Wilcoxon rank sum exact test
data: podoCounts %>% filter(sampleGroup == "A") %>% pull(n) and podoCounts %>% filter(sampleGroup == "B") %>% pull(n)
W = 0, p-value = 0.007937
alternative hypothesis: true location shift is not equal to 0
Dunn's test
Let's use Dunn's test to check all the major families at the same time. Dunn's is good for if you have three or more categories for a metadata field, such as our vaccine column. First find out what the major families are by summing the hits for each family and sorting the table.
# collect the family counts
viralFamCounts = virusesFiltered %>%
group_by(family) %>%
summarise(n=sum(percent)) %>%
arrange(desc(n))
# update factor levels to the sorted order
viralFamCounts$family = factor(viralFamCounts$family,levels=viralFamCounts$family)
and plot:
ggplot(viralFamCounts) +
geom_bar(aes(x=family,y=n),stat='identity') +
coord_flip()
Let's focus on Siphoviridae, Adenoviridae, Podoviridae, and Microviridae. Collect summary counts for these families for each sample and include the metadata we want to use:
viralMajorFamCounts = viruses %>%
filter(family %in% c('Siphoviridae','Adenoviridae','Podoviridae','Microviridae')) %>%
group_by(sampleID,family,vaccine) %>%
summarise(n=sum(percent))
Now let's do the dunn's test for these families:
viralMajorFamCounts %>%
group_by(family) %>%
dunn_test(n ~ vaccine,p.adjust.method='holm') %>%
add_significance()
# A tibble: 12 × 10
family .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 Adenoviridae n Ad_alone Ad_protein 3 2 0.467 0.641 1 ns
2 Adenoviridae n Ad_alone sham 3 4 0.438 0.661 1 ns
3 Adenoviridae n Ad_protein sham 2 4 -0.105 0.916 1 ns
4 Microviridae n Ad_alone Ad_protein 4 2 -1.43 0.153 0.458 ns
5 Microviridae n Ad_alone sham 4 4 -0.584 0.559 0.681 ns
6 Microviridae n Ad_protein sham 2 4 0.953 0.340 0.681 ns
7 Podoviridae n Ad_alone Ad_protein 4 2 0.477 0.634 0.681 ns
8 Podoviridae n Ad_alone sham 4 4 1.75 0.0798 0.240 ns
9 Podoviridae n Ad_protein sham 2 4 0.953 0.340 0.681 ns
10 Siphoviridae n Ad_alone Ad_protein 4 2 2.48 0.0132 0.0395 *
11 Siphoviridae n Ad_alone sham 4 4 1.40 0.161 0.322 ns
12 Siphoviridae n Ad_protein sham 2 4 -1.33 0.182 0.322 ns
There's only one comparison that is significant. Let's put it on a plot
ggplot(viralMajorFamCounts) +
geom_jitter(aes(x=vaccine,y=n)) +
facet_wrap(~family)
Compare presence/absence
You might not care about viral loads and instead are just interested in comparing the presence or absence of viruses. For this you could use a Fisher's exact test. To perform this test you need to assign a presence '1' or absence '0' for each viral family/genus/etc for each sample. What number of hits you use for deciding if a virus is present is up to you.
I want to be sure about the alignments, so I'll apply some stringent filtering cutoffs. Then I'll assign anything with any hits as 'present' for that viral family. Let's look at Myoviridae ... for no particular reason.
# apply a stringent filter
virusesStringent = viruses %>%
filter(evalue<1e-30,alnlen>150,pident>75,alnType=='aa')
# count hits for Myoviridae and score presence
# NOTE: the absent samples WONT be in the table yet, we have to add them in after
myovirPresAbs = virusesStringent %>%
filter(family=='Myoviridae') %>%
group_by(sampleID) %>%
summarise(n=sum(percent)) %>%
mutate(present=ifelse(n>0,1,0))
# merge in the metadata
# Using all=T will do an outer join and bring in the absent samples
myovirPresAbs = merge(myovirPresAbs,meta,by='sampleID',all=T)
# convert na's to zeros
myovirPresAbs[is.na(myovirPresAbs)] = 0
To do the Fisher's exact test we need to specify a 2x2 grid; The first column will be the number with Myoviridae for each group. The second column will be the numbers without for each group.
# matrix rows
mtxGroupA = c(
myovirPresAbs %>%
filter(sampleGroup=='A',present==1) %>%
summarise(n=n()) %>%
pull(n),
myovirPresAbs %>%
filter(sampleGroup=='A',present==0) %>%
summarise(n=n()) %>%
pull(n))
mtxGroupB = c(
myovirPresAbs %>%
filter(sampleGroup=='B',present==1) %>%
summarise(n=n()) %>%
pull(n),
myovirPresAbs %>%
filter(sampleGroup=='B',present==0) %>%
summarise(n=n()) %>%
pull(n))
# create the 2x2 matrix
myovirFishMtx = matrix(c(mtxGroupA,mtxGroupB),nrow = 2)
# this bit is not necessary, but lets add row and col names to illustrate the matrix layout
colnames(myovirFishMtx) = c('GroupA','GroupB')
row.names(myovirFishMtx) = c('present','absent')
# view
View(myovirFishMtx)
# Run Fisher's exact test
fisher.test(myovirFishMtx)
Fisher's Exact Test for Count Data
data: myovirFishMtx
p-value = 0.04762
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.000000 0.975779
sample estimates:
odds ratio
0
We don't have many samples, so our significance won't be great regardless.
In Part 5 we will look at the contigs' read-based annotations.