The analysis of positive selection in SIMMAP 1.5 use the mutational histories to estimate a number of statistics relevant to determining if a gene or site have a signature of positive selection. The information below describes the statistics and how they are derived.
Equation 1: Number of non-synonymous sites in a codon. This is a sum over each mutational interval in mutational history for each position in the codon. Notice that it is taken as a time weighted average (ν is the evolutionary time, s is probability of a non-synonymous change, and i is each position in the codon). This value is not reported but is used in subsequent calculations.
Equation 2: Number of synonymous sites in a codon. This is a sum over each mutational interval in mutational history for each position in the codon. Notice that it is taken as a time weighted average (ν is the evolutionary time, s is probability of a synonymous change, and i is each position in the codon). This value is not reported but is used in subsequent calculations.
Equation 3: The rate of non-synonymous changes where Kn is the observed number of non-synonymous changes observed in a sampled mutation history, Sn is defined in Equation 1 above, and T is the total evolutionary time. dn and Kn are reported.
Equation 4: The rate of non-synonymous changes where Ks is the observed number of non-synonymous changes observed in a sampled mutation history, Ss is defined in Equation 2 above, and T is the total evolutionary time. ds and Ks are reported.
Equation 5: The adaptive rate.
Each of the indicated statistics are reported for each codon selected in an analysis and for the gene as a whole. The whole gene estimate of ω is taken as the ratio of the expectatiosn of dn and ds.
In the case of summary statistics all of the values described above are reported as the expected values.
For more information on setting up an analysis of positive selection see here and here.
What are predictive distributions?
Predictive distributions are a Bayesian approach for hypothesis testing similar to the more well known parametric bootstrap method. The goal of predictive distributions is to generate an appropriate null distribution of the desired test statistic, T(•). The test statistic is designed to quantify some aspect of the observed data, X. In the case of character histories this statistic summarizes some aspect of the histories such as a correlation among character states.
The predictive distribution is generated by first sampling the parameters of the evolutionary model and phylogeny, θ, from their posterior distribution, p(θ|X). Given a sample of a predictive or null data set is simulated (see Figure 1 below). This is different from the parametric bootstrap (left side of the Figure 1 below) in that it samples values in proportion to their posterior probability rather than using a single point estimate such as the maximum likelihood estimate. In this way the posterior predictive data sets explicitly accommodate uncertainty in the evolutionary model and phylogeny.
Using these predictive data sets character histories are sampled from which the test statistic is calculated. The test statistics are often evaluated as their posterior expectation and therefore the null data sets and sampled character histories are evaluated over all of the samples from the posterior. The practical implications of this is not shown below but can be seen in Figure 4 of Huelsenbeck et al. (Sys. Biol, 2003) - for each null data set character histories are sampled for each topology/model values from the posterior and these are summarized using the test statistic.
How SIMMAP 1.5 evaluates posterior predictive p-values
P-values are determined by where the observed statistic falls within the predictive distribution (tail-area probabilities). This approach asks whether the observed value of ω could have been produced simply by chance, i.e., could a value as extreme as the observed by simply due to the mutational process (essentially a model of neutral evolution that accommodates the mutational biases).
A P-value less than 0.05 or you chosen level of significance indicates significance. When looking for codons under selection it is probably wise to perform a multiple tests correction (such as the FDR) when the number of codons exceeds 10 or so.