This section assumes you have completed Tutorial Part3.

# Compare viral loads

In part 3, we compared the viral counts between the two sample groups for *Podoviridae*,
and it appeared as though group B had more viral sequence hits on average than group A.
We can compare the normalised counts for these two groups to see if they're significantly different.

**Student's T-test**

Let's check out the dataframe we made earlier that we'll be using for the test:

```
View(podoCounts)
```

We'll use the base-r function `t.test()`

, which takes two vectors--one with the
group A counts and one with the group B counts.
We can use the `filter()`

and `pull()`

functions within the `t.test()`

function like so:

```
t.test(
podoCounts %>%
filter(sampleGroup=='A') %>%
pull(n),
podoCounts %>%
filter(sampleGroup=='B') %>%
pull(n),
alternative='two.sided',
paired=F,
var.equal=T)
```

```
Two Sample t-test
data: podoCounts %>% filter(sampleGroup == "A") %>% pull(n) and podoCounts %>% filter(sampleGroup == "B") %>% pull(n)
t = -5.3033, df = 8, p-value = 0.0007255
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-615.5267 -242.4563
sample estimates:
mean of x mean of y
81.11595 510.10744
```

**Wilcoxon test**

You might prefer to perform a Wilcoxon test; the syntax is very similar to the t.test:

```
wilcox.test(
podoCounts %>%
filter(sampleGroup=='A') %>%
pull(n),
podoCounts %>%
filter(sampleGroup=='B') %>%
pull(n),
alternative='t',
paired=F)
```

```
Wilcoxon rank sum exact test
data: podoCounts %>% filter(sampleGroup == "A") %>% pull(n) and podoCounts %>% filter(sampleGroup == "B") %>% pull(n)
W = 0, p-value = 0.007937
alternative hypothesis: true location shift is not equal to 0
```

**Dunn's test**

Let's use Dunn's test to check all the major families at the same time. Dunn's is good for if you have three or more categories for a metadata field, such as our vaccine column. First find out what the major families are by summing the hits for each family and sorting the table.

```
# collect the family counts
viralFamCounts = virusesFiltered %>%
group_by(family) %>%
summarise(n=sum(percent)) %>%
arrange(desc(n))
# update factor levels to the sorted order
viralFamCounts$family = factor(viralFamCounts$family,levels=viralFamCounts$family)
```

and plot:

```
ggplot(viralFamCounts) +
geom_bar(aes(x=family,y=n),stat='identity') +
coord_flip()
```

Let's focus on *Siphoviridae*, *Adenoviridae*, *Podoviridae*, and *Microviridae*.
Collect summary counts for these families for each sample and include the metadata we want to use:

```
viralMajorFamCounts = viruses %>%
filter(family %in% c('Siphoviridae','Adenoviridae','Podoviridae','Microviridae')) %>%
group_by(sampleID,family,vaccine) %>%
summarise(n=sum(percent))
```

Now let's do the dunn's test for these families:

```
viralMajorFamCounts %>%
group_by(family) %>%
dunn_test(n ~ vaccine,p.adjust.method='holm') %>%
add_significance()
```

```
# A tibble: 12 × 10
family .y. group1 group2 n1 n2 statistic p p.adj p.adj.signif
* <chr> <chr> <chr> <chr> <int> <int> <dbl> <dbl> <dbl> <chr>
1 Adenoviridae n Ad_alone Ad_protein 3 2 0.467 0.641 1 ns
2 Adenoviridae n Ad_alone sham 3 4 0.438 0.661 1 ns
3 Adenoviridae n Ad_protein sham 2 4 -0.105 0.916 1 ns
4 Microviridae n Ad_alone Ad_protein 4 2 -1.43 0.153 0.458 ns
5 Microviridae n Ad_alone sham 4 4 -0.584 0.559 0.681 ns
6 Microviridae n Ad_protein sham 2 4 0.953 0.340 0.681 ns
7 Podoviridae n Ad_alone Ad_protein 4 2 0.477 0.634 0.681 ns
8 Podoviridae n Ad_alone sham 4 4 1.75 0.0798 0.240 ns
9 Podoviridae n Ad_protein sham 2 4 0.953 0.340 0.681 ns
10 Siphoviridae n Ad_alone Ad_protein 4 2 2.48 0.0132 0.0395 *
11 Siphoviridae n Ad_alone sham 4 4 1.40 0.161 0.322 ns
12 Siphoviridae n Ad_protein sham 2 4 -1.33 0.182 0.322 ns
```

There's only one comparison that is significant. Let's put it on a plot

```
ggplot(viralMajorFamCounts) +
geom_jitter(aes(x=vaccine,y=n)) +
facet_wrap(~family)
```

# Compare presence/absence

You might not care about viral loads and instead are just interested in comparing the presence or absence of viruses. For this you could use a Fisher's exact test. To perform this test you need to assign a presence '1' or absence '0' for each viral family/genus/etc for each sample. What number of hits you use for deciding if a virus is present is up to you.

I want to be sure about the alignments, so I'll apply some stringent filtering cutoffs.
Then I'll assign anything with *any* hits as 'present' for that viral family.
Let's look at *Myoviridae* ... for no particular reason.

```
# apply a stringent filter
virusesStringent = viruses %>%
filter(evalue<1e-30,alnlen>150,pident>75,alnType=='aa')
# count hits for Myoviridae and score presence
# NOTE: the absent samples WONT be in the table yet, we have to add them in after
myovirPresAbs = virusesStringent %>%
filter(family=='Myoviridae') %>%
group_by(sampleID) %>%
summarise(n=sum(percent)) %>%
mutate(present=ifelse(n>0,1,0))
# merge in the metadata
# Using all=T will do an outer join and bring in the absent samples
myovirPresAbs = merge(myovirPresAbs,meta,by='sampleID',all=T)
# convert na's to zeros
myovirPresAbs[is.na(myovirPresAbs)] = 0
```

To do the Fisher's exact test we need to specify a 2x2 grid;
The first column will be the number with *Myoviridae* for each group.
The second column will be the numbers without for each group.

```
# matrix rows
mtxGroupA = c(
myovirPresAbs %>%
filter(sampleGroup=='A',present==1) %>%
summarise(n=n()) %>%
pull(n),
myovirPresAbs %>%
filter(sampleGroup=='A',present==0) %>%
summarise(n=n()) %>%
pull(n))
mtxGroupB = c(
myovirPresAbs %>%
filter(sampleGroup=='B',present==1) %>%
summarise(n=n()) %>%
pull(n),
myovirPresAbs %>%
filter(sampleGroup=='B',present==0) %>%
summarise(n=n()) %>%
pull(n))
# create the 2x2 matrix
myovirFishMtx = matrix(c(mtxGroupA,mtxGroupB),nrow = 2)
# this bit is not necessary, but lets add row and col names to illustrate the matrix layout
colnames(myovirFishMtx) = c('GroupA','GroupB')
row.names(myovirFishMtx) = c('present','absent')
# view
View(myovirFishMtx)
```

```
# Run Fisher's exact test
fisher.test(myovirFishMtx)
```

```
Fisher's Exact Test for Count Data
data: myovirFishMtx
p-value = 0.04762
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
0.000000 0.975779
sample estimates:
odds ratio
0
```

We don't have many samples, so our significance won't be great regardless.

In Part 5 we will look at the contigs' read-based annotations.