Xml Input File
Assign Branch Lengths
Character Associations (Correlations)
Often we would like to know whether two characters, or particular character states, covary on the phylogeny. For example, morphological studies addressing hypotheses regarding correlated evolution or ecological studies testing for associations between a character and a particular environment. Character histories can be used to address these questions. For a more indepth description of the use of character histories for measuring character correlation see Huelsenbeck et al. (2003).
Brief Overview of Method
Let us look at the method briefly for one of the statistics for measuring covariation discussed below. Imagine we have two characters with two-states (binary characters). We can sample a realization of a character (mutational) history for each. See the figure to the right which shows two possible character histories for each site. The histories are very similar with the only difference being the reconstruction at the root for the lower (L) character. For two binary characters we can have 4 different possible configurations (associations between the two charcters). For each state configuration we calculate the obersved time spent in this configuration along the tree (see the Observed 4x4 association table to the right). These values reflect the observed association for the sampled character history. However, we would like to correct for the likelihood that the current configuration is due simply to a chance association on the phylogeny. Luckily, this can be corrected in a straightforward manner by taking the products of the independent amount of time say in state 0 for the U character and the state 0 for the L character. The independent values for each character state are the sum of the rows/cols. For example, the expected frequency of seeing a 0-0 (U-L) state configuration is: (0.36+0.0) x (0.36+0.14) = 0.18. Most of the statistics for measuring covariation in SIMMAP utilize summarizing histories in a similar way to this. This, of course, is not the only way. Using the options for saving the raw character histories the user can develop a customized test of character correlation (see here for more information on saving the raw character histories).
Configuring the Analysis
SIMMAP 1.5 will estimate whether two characters are correlated (or associated) using mutational maps as described in Huelsenbeck et al., 2003, Sys. Biol. and Bollback, 2005, in Statistical Genetics.
A correlation analysis can include the following data types: nucleotides or morphological/standard characters. Correlations can not be performed on codons or amino acid residues.
As with a standard mutational analysis a correlation analysis requires that a model be configured as described here.
Once the substitution model has been configured a correlation analysis can be configured by opening the Analysis window (see Figures 2 and 3 below) by selecting Analysis->Configure Analysis (cmd-2) from the main menu.
General Tab -
The first step is to select Character association (correlation) in the Analysis Type box. Next, select the characters for which associations are to be estimated. Note: SIMMAP 1.0 only permitted two characters to be selected for a single analysis. The new version has relaxed this and allows correlations to be estimated bewteen all characters in the Included table. It is not recommended that this be used as a way of "fishing" for correlations.
If desired, both individual and summary (expectations) statistics can be saved to files during the analysis. This is accomplished by selecting the appropriate check boxes (Save individual statistics to file and or Save summary statistics to file) and the setting the file name and location using the Set button. For more information on the correlation statistics see here.
Sampling Tab -
The final set of tasks are to configure the sampling design to be used. Most of this has been covered in detail here so I will forgo a detailed description and focus mostly on the behavior of predictive sampling.
Perform predictive sampling : this activates posterior predictive sampling to determine p-values for the associations - i.e., whether the association observed could have arisen by chance alone.
Number of predictive samples: sets the number of predictive samples to do in addition to the number of samples that has been set. This is a multiplier of the number of samples (and prior draws). Perform predictive sampling must be active, i.e., checked.
Save predictive maps to file: the posterior predictive mutational maps, generated when performing predictive sampling, are saved to a Nexus tree file (w/ translate block) when this is checked. (See here for more details on the file written.) Perform predictive sampling must be active, i.e., checked. A file must be defined using the Set button which allows a file name and location to be chosen.
Save predictive statistics to file: the posterior predictive association statistics being collected during a predictive test are saved to a file. (See here for more details on the file written.) These are the posterior expected values obtained from each simulated replicate. These are the values used to determine the posterior p-value. Perform predictive sampling must be active, i.e., checked. A file must be defined using the Set button which allows a file name and location to be chosen.
What are predictive distributions?
Predictive distributions are a Bayesian approach for hypothesis testing similar to the more well known parametric bootstrap method. The goal of predictive distributions is to generate an appropriate null distribution of the desired test statistic, T(•). The test statistic is designed to quantify some aspect of the observed data, X. In the case of character histories this statistic summarizes some aspect of the histories such as a correlation among character states.
The predictive distribution is generated by first sampling the parameters of the evolutionary model and phylogeny, θ, from their posterior distribution, p(θ|X). Given a sample of a predictive or null data set is simulated (see Figure 4 below). This is different from the parametric bootstrap (left side of the Figure 4 below) in that it samples values in proportion to their posterior probability rather than using a single point estimate such as the maximum likelihood estimate. In this way the posterior predictive data sets explicitly accommodate uncertainty in the evolutionary model and phylogeny.
Using these predictive data sets character histories are sampled from which the test statistic is calculated. The test statistics are often evaluated as their posterior expectation and therefore the null data sets and sampled character histories are evaluated over all of the samples from the posterior. The practical implications of this is not shown below but can be seen in Figure 4 of Huelsenbeck et al. (Sys. Biol, 2003) - for each null data set character histories are sampled for each topology/model values from the posterior and these are summarized using the test statistic.
How SIMMAP 1.5 evaluates posterior predictive p-values
P-values are determined by where the observed statistic falls within the predictive distribution (tail-area probabilities). This approach asks whether the observed value could have been produced simply by chance, i.e., could a value as extreme as the observed by simply due to chance. SIMMAP 1.5 attempts to be intelligent in deciding in which direction (positive or negative) the test should be performed. For the overall statistics, D and M, they are always positive so the test asks whether the predictive distribution is greater than or equal to these values. However, the individual state statistics, mij and dij, can range from -1 to 1. Therefore, if the observed value is positive the test determines whether the predictive statistic is greater than or equal to the observed, while if the observed is negative, the test asks whether the predictive statistic is less than or equal to the observed. Given this approach a significant p-value will be less than 0.05 (or whichever cut-off is desired, e.g., 0.01, etc.).
For most circumstances this formulation is appropriate. However, there are situations where the predictie distribution is much larger than the observed value (say, if the observed is a positive value) which lies outside of the lower tail. In this case it isn't exactly clear what the interpretation should be (this situation can only be detected by actually plotting the predictive distributions from the saved predictive statistics, which is recomended regardless). Are the observed values being depressed (i.e., are they constrained)? The user will need to make these determinations.