Contents

Estimating multiple infections from variant data obtained using massively parallel sequencing experiments can be done in several ways using moimix.

The core function in moimix is binomix which is an interface to a binomial mixture model estimation procedure using the R package flexmix. The input to this function is a matrix of read counts supporting the reference and alternate alleles for all the variants in data.

In this vignette, we will introduce some key functions in moimix and how to set up your variant data to estimate some measures of multiplicity of infection.

0.1 Key functions in moimix

Function Description
binommix estimate MOI using binomial mixture model
alleleCounts extract a matrix of read counts
getBAF estimate B-allele frequencies from read counts
plotBAF plot the B-allele frequency spectra genome-wide
getMAF esimate minor allele frequencies from read counts
getFws estimate within isolate diversity according to Manske et. al, 2012
callMajor call major haplotypes using B-allele frequencies
extractCOIL Output a COIL barcode file
extractPED Create haploid PLINK PED file

0.2 Data Formats

moimix assumes that variants have been called either from whole-genome/exome or RNA sequencing and the variant calling algorithm used outputs a variant call format (VCF) version 4.1 file. We also assume that each sample in the VCF file has been annotated with a AD tag at the variant sites for both the alternate and reference alleles. Currently, we have tested the functions using output from GATK UnifiedGenotyper and HaplotypeCaller algorithms.

moimix requires that the VCF file is then converted to the Genomic Data Storage (GDS) format. The GDS format provides a fast and convienent way of manipulating large VCF files in R. We make use of the SeqArray package for both VCF conversion and manipulating GDS files.

If you are just interested in running the binomix procedure, then you do not need to have a GDS file. For each sample construct a matrix with 2 columns correpsonding the read counts for the alternate and referene alleles. Then pass this matrix to the binomix function.

0.2.1 Converting a VCF to GDS format

To convert a tabixed gzipped VCF file to the GDS format use the SeqArray package.

library(SeqArray)
seqVCF2GDS("/path/to/my/file.vcf.gz", "/path/to/my/file.gds")

To read the GDS file into R, use the seqOpen function and to check that its output matches the orginal VCF file use seqSummary.

my_vcf <-seqOpen("/path/to/my/file.gds")
seqSummary(my_vcf)

1 Case study: Plasmodium falciparum isolates from The Gambia.

Here we present an example analysis from data obtained from the Malaria Genomics Consoritum community project. The data consists of 65 Plasmodium falciparum isolates from The Gambia with approximately 78,000 high-quality biallelic SNVs on 14 nuclear chromosomes and the apicoplast.

For details on how the data was processed please see the README file located at ftp://ngs.sanger.ac.uk/production/pf3k/release_4/

First, we use SeqArray to load our GDS file and look at it.

library(SeqArray)
## Loading required package: gdsfmt
gds_file <- system.file("extdata", "gambian_isolates.gds", package = "moimix")
isolates <- seqOpen(gds_file)
seqSummary(isolates)
## File: /Users/lee.s/Library/R/3.2/library/moimix/extdata/gambian_isolates.gds
## Format Version: v1.0
## Reference: file:///lustre/scratch109/malaria/pf3k_methods/resources/Pfalciparum.genome.fasta
## Ploidy: 2
## Number of samples: 65
## Number of variants: 77241
## Chromosomes:
##  Pf3D7_01_v3  Pf3D7_02_v3  Pf3D7_03_v3  Pf3D7_04_v3  Pf3D7_05_v3 
##         2040         2781         3256         4518         4540 
##  Pf3D7_06_v3  Pf3D7_07_v3  Pf3D7_08_v3  Pf3D7_09_v3  Pf3D7_10_v3 
##         3942         4881         4709         4791         5595 
##  Pf3D7_11_v3  Pf3D7_12_v3  Pf3D7_13_v3  Pf3D7_14_v3 Pf3D7_API_v3 
##         6915         6905        10117        12205           46 
##         <NA> 
##            0 
## Number of alleles per site:
##     2 
## 77241 
## Annotation, INFO variable(s):
##  AC, A, Integer, Allele count in genotypes, for each ALT allele, in the same order as listed
##  AF, A, Float, Allele Frequency, for each ALT allele, in the same order as listed
##  AN, 1, Integer, Total number of alleles in called genotypes
##  BaseQRankSum, 1, Float, Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities
##  ClippingRankSum, 1, Float, Z-score From Wilcoxon rank sum test of Alt vs. Ref number of hard clipped bases
##  DP, 1, Integer, Approximate read depth; some reads may have been filtered
##  DS, 0, Flag, Were any of the samples downsampled?
##  END, 1, Integer, Stop position of the interval
##  FS, 1, Float, Phred-scaled p-value using Fisher's exact test to detect strand bias
##  GC, 1, Float, GC content around the variant (see docs for window size details)
##  HaplotypeScore, 1, Float, Consistency of the site with at most two segregating haplotypes
##  InbreedingCoeff, 1, Float, Inbreeding coefficient as estimated from the genotype likelihoods per-sample when compared against the Hardy-Weinberg expectation
##  MLEAC, A, Integer, Maximum likelihood expectation (MLE) for the allele counts (not necessarily the same as the AC), for each ALT allele, in the same order as listed
##  MLEAF, A, Float, Maximum likelihood expectation (MLE) for the allele frequency (not necessarily the same as the AF), for each ALT allele, in the same order as listed
##  MQ, 1, Float, RMS Mapping Quality
##  MQRankSum, 1, Float, Z-score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities
##  NEGATIVE_TRAIN_SITE, 0, Flag, This variant was used to build the negative training set of bad variants
##  POSITIVE_TRAIN_SITE, 0, Flag, This variant was used to build the positive training set of good variants
##  QD, 1, Float, Variant Confidence/Quality by Depth
##  RPA, ., Integer, Number of times tandem repeat unit is repeated, for each allele (including reference)
##  RU, 1, String, Tandem repeat unit (bases)
##  ReadPosRankSum, 1, Float, Z-score from Wilcoxon rank sum test of Alt vs. Ref read position bias
##  RegionType, 1, String, The type of genome region within which the variant is found. SubtelomericRepeat: repetitive regions at the ends of the chromosomes. SubtelomericHypervariable: subtelomeric region of poor conservation between the 3D7 reference genome and other samples. InternalHypervariable: chromosome-internal region of poor conservation between the 3D7 reference genome and other samples. Centromere: start and end coordinates of the centromere genome annotation. Core: everything else.
##  SNPEFF_AMINO_ACID_CHANGE, 1, String, Old/New amino acid for the highest-impact effect resulting from the current variant (in HGVS style)
##  SNPEFF_CODON_CHANGE, 1, String, Old/New codon for the highest-impact effect resulting from the current variant
##  SNPEFF_EFFECT, 1, String, The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)
##  SNPEFF_EXON_ID, 1, String, Exon ID for the highest-impact effect resulting from the current variant
##  SNPEFF_FUNCTIONAL_CLASS, 1, String, Functional class of the highest-impact effect resulting from the current variant: [NONE, SILENT, MISSENSE, NONSENSE]
##  SNPEFF_GENE_BIOTYPE, 1, String, Gene biotype for the highest-impact effect resulting from the current variant
##  SNPEFF_GENE_NAME, 1, String, Gene name for the highest-impact effect resulting from the current variant
##  SNPEFF_IMPACT, 1, String, Impact of the highest-impact effect resulting from the current variant [MODIFIER, LOW, MODERATE, HIGH]
##  SNPEFF_TRANSCRIPT_ID, 1, String, Transcript ID for the highest-impact effect resulting from the current variant
##  SOR, 1, Float, Symmetric Odds Ratio of 2x2 contingency table to detect strand bias
##  STR, 0, Flag, Variant is a short tandem repeat
##  VQSLOD, 1, Float, Log odds of being a true variant versus being false under the trained gaussian mixture model
##  VariantType, 1, String, Variant type description
##  culprit, 1, String, The annotation which was the worst performing in the Gaussian mixture model, likely the reason why the variant was filtered out
##  set, 1, String, Source VCF for the merged record in CombineVariants
##  PRA, 1, String, Plasmodium reichenowi allele
## Annotation, FORMAT variable(s):
##  AD, ., Integer, Allelic depths for the ref and alt alleles in the order listed
##  DP, 1, Integer, Approximate read depth (reads with MQ=255 or with bad mates are filtered)
##  GQ, 1, Integer, Genotype Quality
##  MIN_DP, 1, Integer, Minimum DP observed within the GVCF block
##  PGT, 1, String, Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another
##  PID, 1, String, Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group
##  PL, G, Integer, Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification
##  RGQ, 1, Integer, Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)
##  SB, 4, Integer, Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.
## Annotation, sample variable(s):
##  
# save sample identifiers
sample.id <- seqGetData(isolates, "sample.id")

We also filter the variants located on the apicoplast before any downstream analysis.

library(moimix)
## Loading required package: flexmix
## Loading required package: lattice
# get genomic coordinates of all variants
coords <- getCoordinates(isolates)
head(coords)
##    chromosome position variant.id
## 1 Pf3D7_02_v3   106099          1
## 2 Pf3D7_02_v3   106230          2
## 3 Pf3D7_02_v3   108524          3
## 4 Pf3D7_02_v3   108889          4
## 5 Pf3D7_02_v3   109059          5
## 6 Pf3D7_02_v3   109073          6
# filter variant.ids not on apicoplast
seqSetFilter(isolates, 
             variant.id = coords$variant.id[coords$chromosome != "Pf3D7_API_v3"])
## # of selected variants: 77195

1.1 Estimating the BAF matrix

The first step in our clonality analysis is to estiamte the B-allele frequencies for each isolate directly from the sequencing depth. We construct a bafMatrix object which stores the estimates of the B-allele frequencies in a matrix over the entire cohort and the genomic coordinates of the variants.

isolate_baf <- bafMatrix(isolates)
class(isolate_baf)
## [1] "bafMatrix"
str(isolate_baf)
## List of 2
##  $ baf_matrix: num [1:65, 1:77195] 0 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ sample : chr [1:65] "PA0007-C" "PA0008-C" "PA0011-C" "PA0012-C" ...
##   .. ..$ variant: chr [1:77195] "1" "2" "3" "4" ...
##  $ coords    :'data.frame':  77195 obs. of  3 variables:
##   ..$ chromosome: chr [1:77195] "Pf3D7_02_v3" "Pf3D7_02_v3" "Pf3D7_02_v3" "Pf3D7_02_v3" ...
##   ..$ position  : int [1:77195] 106099 106230 108524 108889 109059 109073 109248 109343 109601 109623 ...
##   ..$ variant.id: int [1:77195] 1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "class")= chr "bafMatrix"

You can plot the genome-wide B-allele frequencies by passing the bafMatrix object and a sample identifer to plot.

Here’s an example for an isolate that is from a single-clone infection.

plot(isolate_baf, "PA0022-C")

And an example of multiple-clone infection.

plot(isolate_baf,  "PA0021-C")

1.2 Estimating MOI with binommix

To estimate MOI we use the binomix function. MOI is estimated on a per-isolate basis using a binomial mixture model implemented in the R package flexmix. In order to fit a model for an isolate, we need a matrix of read counts and to set the value of \(k=2c\) where \(c\) is the number clonal infections.

We fit a mixture model with 4 bands on sample “PA0021-C” and report the results.

set.seed(2002)
# first generate matrix of read counts for ref and alt alleles
counts <- alleleCounts(isolates)

# fit a model on sample PA0021-C, multiple clone infection
m1 <- binommix(counts, sample.id = "PA0021-C", k = 2)
## 2 : * * * * * * * * * *
# to inspect the model and view parameter estimates
summary(m1)
## 
## Call:
## flexmix::initFlexmix(y_obs ~ 1, model = flexmix::FLXMRglm(y_obs ~ 
##     ., family = "binomial"), k = 2, control = list(iter.max = niter, 
##     minprior = 0), nrep = 10)
## 
##        prior size post>0 ratio
## Comp.1 0.496 4081   4125 0.989
## Comp.2 0.504 4143   4199 0.987
## 
## 'log Lik.' -38145.25 (df=3)
## AIC: 76296.51   BIC: 76317.55
param.estimates <- getTheta(m1)
param.estimates
##      pi.hat    mu.hat
## 1 0.4960888 0.2559665
## 2 0.5039112 0.7484539
# visualise model output
plot(isolate_baf, "PA0021-C")
abline(h = param.estimates$mu.hat)

1.3 Estimating MOI with \(F_{ws}\)

An alternative way of assessing clonality is to estimte the \(F_{ws}\) statistic. An \(F_{ws} < 0.95\) is indicative of a clonal infection.

fws_all <- getFws(isolates)

hist(fws_all)

# see if our sample that we estimated is multiclonal according to fws
fws_all["PA0021-C"] < 0.95
## PA0021-C 
##     TRUE

2 Session Info

sessionInfo()
## R version 3.2.4 (2016-03-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.9.5 (Mavericks)
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] moimix_0.0.0.0  flexmix_2.3-13  lattice_0.20-33 SeqArray_1.10.6
## [5] gdsfmt_1.6.2    BiocStyle_1.8.0
## 
## loaded via a namespace (and not attached):
##  [1] SummarizedExperiment_1.0.2 modeltools_0.2-21         
##  [3] VariantAnnotation_1.16.4   colorspace_1.2-6          
##  [5] htmltools_0.3.5            GWASExactHW_1.01          
##  [7] stats4_3.2.4               rtracklayer_1.30.3        
##  [9] yaml_2.1.13                GenomicFeatures_1.22.13   
## [11] MCMCpack_1.3-5             XML_3.98-1.4              
## [13] DBI_0.3.1                  BiocParallel_1.4.3        
## [15] Rgraphviz_2.14.0           BiocGenerics_0.16.1       
## [17] RColorBrewer_1.1-2         lambda.r_1.1.7            
## [19] plyr_1.8.3                 foreach_1.4.3             
## [21] stringr_1.0.0              zlibbioc_1.16.0           
## [23] Biostrings_2.38.4          MatrixModels_0.4-1        
## [25] munsell_0.4.3              futile.logger_1.4.1       
## [27] codetools_0.2-14           coda_0.18-1               
## [29] evaluate_0.8.3             Biobase_2.30.0            
## [31] knitr_1.12.3               IRanges_2.4.8             
## [33] SparseM_1.7                biomaRt_2.26.1            
## [35] GenomeInfoDb_1.6.3         quantreg_5.21             
## [37] parallel_3.2.4             AnnotationDbi_1.32.3      
## [39] Rcpp_0.12.3                scales_0.4.0              
## [41] BSgenome_1.38.0            formatR_1.3               
## [43] S4Vectors_0.8.11           graph_1.48.0              
## [45] XVector_0.10.0             Rsamtools_1.22.0          
## [47] mcmc_0.9-4                 digest_0.6.9              
## [49] stringi_1.0-1              GenomicRanges_1.22.4      
## [51] grid_3.2.4                 tools_3.2.4               
## [53] bitops_1.0-6               magrittr_1.5              
## [55] RCurl_1.95-4.8             RSQLite_1.0.0             
## [57] futile.options_1.0.0       MASS_7.3-45               
## [59] Matrix_1.2-4               rmarkdown_0.9.5           
## [61] iterators_1.0.8            SeqVarTools_1.8.1         
## [63] GenomicAlignments_1.6.3    nnet_7.3-12

3 References

  1. flexmix package - https://cran.r-project.org/web/packages/flexmix/index.html
  2. SeqArray package - http://bioconductor.org/packages/release/bioc/html/SeqArray.html