Inferential Statistics Review

Review of Inferential Statistics

This is a brief review of the concepts in inferential statistics to support study of subjects such as statistical anaylsis in bioinformatics. For more detail see the web pages in the Khan Academy, e.g. https://www.khanacademy.org/math/probability/statistics-inferential, and see a good statistics text, such as Norman and Streiner, "Biostatistics: The Bare Essentials", 4^thed, People's Medcial Publishing House, Shelton, CT, 2014, ISBN 978-1-60795-178-0.

Inferential statistics are concerned with quantitative ways in which to estimate probabilities of hypotheses related to data. In general the hypotheses are based on models of the systems that may or may not have produced the data, including hypotheses about the relevant underlying probability distributions.

In bioinformatics, many of the models describe mechanisms relating to macromolecular sequences, including models of evolutionary development.

The major topics are:

Evolutionary Sequence Models
Statistical Inference
Analysis of Variance
Regression and Correlation
Non-parametric Statistics

Evolutionary Sequence Models

Much of the data in bioinformatics is "sequence data " because many biologically important macromolecules are based on linear sequences of monomers put together like a string of pearls that has been folded to expose some of those monomers and bury others. Proteins are composed of strands of amino acid residues and DNAs and RNAs are composed of strands of nucleotides (nucleic acid residues), in each case like a text made from a restricted alphabet of letters.

The nucleotides in RNA are Cytosine, Guanine, Adenisine and Uracil (CGAU). The nucleotide in DNA are Cytosine, Guanine, Thymine and Adenisine (CGTA). The major amino acid residues from which proteins are made are alanine, arginine, asparagine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine, with 3-letter codes ala, arg, asn, asp, cys, glu, gln, gly, his, ile, leu, lys, met, phe, pro, ser, thr, trp, tyr, and val, (A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V). The processes of evolution and mutation can be thought of as changing one or more letters in very long words made from these letters. The most popular and effective models used to estimate the probabilities of such changes in nuceotide sequences are the Cantor-Jukes and Kimura models (see http://nematodes.org/teaching/tutorials/phylogenetics/Bayesian_Workshop/PDFs/Kimura%20J%20Mol%20Evol%201980.pdf.) For proteins, heavy use is made of Hidden Markov Models (see https://cbse.soe.ucsc.edu/sites/default/files/hmm.all_.pdf).

Statistical Inference

Having useful models, we need statistical tools to see how well the data we have fits those models. We distinguish the data instances or samples from the theoretical population of data implied by the models we are testing. The process we follow in inferential statistics is to take the data instances we have, organize them into groups appropriate to the hypothesis being tested, and estimate the probability that these samples are (or are not) consistent with the hypothesis and the probability that these samples were encountered purely by chance. We do this by computing various descriptive statistics of the various groups into which we organized the samples. We are especially interested in means and variances, but we need to deal with two different sets of descriptive statistics

the ones we can compute from the samples -- e.g. sample mean, sample variance, sample standard deviation, etc., and
the ones that the model infers for the population -- e.g. population mean, population variance, population standard deviation, etc..

The sample descriptive statistics are used as estimators of the population statistics. It must always be remembered that the estimates are not likely to be exact, so we need to estimate the errors in the sample statistics, e.g. the estimated standard error which is computed from the sample standard deviation and dividing by the square root of the number of samples.

See https://en.wikipedia.org/wiki/Statistical_inference

Regression and Correlation

Regression analysis os a statistical approach to estimating relationships among variables. Some variables are treated as independent variables. The rest are tested for being suitable as dependent variables, i.e. as variables for which there is a functional dependence on the independent variables. The parameters of that functional dependence (usually the coefficients of a line or hyperplane that best fits the data) are set at values that minimize the sum of the squares of the lengths of the vector differences between data instance values of the dependent variables and the values that the assumed functional dependence would assign to those dependent variables for the data instance on the basis of the values of the independent variables for that data instance. This is called a least squares fit. The descriptive statistics of the errors in the resulting fit are used to infer the goodness of the fit. The most commonly used measure of goodness of fit is $r^2$, the coeficient of determination which is the portion of the variance of the dependent variables that is a result of the variance of the independent variables. In the case of one independent variable $x$ and one dependent variable $y$, and a linear fit, we can just take the sum of the squares of differenced of the predicted values of $y$ from the mean of the predicted $y$, $SS_{reg}$ and divide by the sum of the squares if the differences of the actual and predicted values of $y$, $SS_{res}$ and $SS_{reg}$ so that

$$r^2 = {SS_{reg}}/({SS_{res}+SS_{reg}})$$ A value of 1 indicates a very good fit. A value of 0 is an extremely poor fit.

See https://en.wikipedia.org/wiki/Regression_analysis and the links from that page.

The square root of the coeeficient of determination is an example of a correlation coefficient. Correlation refers to the hypothesis that two varables or sets of data have a linear relationship to each other. It is a mistake to assume that a demonstration of correlation shows that one variable depends on the other. Otherwise we could just as easily say the lung cancer causes smoking and other environmental insults to lung tissue as we can say that smoking and other environmental insults to lung tissue causes lung cancer. See https://en.wikipedia.org/wiki/Correlation_and_dependence

Non-Parametric Statistics

The law of large numbers allows us to use normal distributions and the usual mean and variance for much of what we do in statistics, but there are times when other statistical distributions are more appropriate, or when there is not enough information to use any particular statistical distribution as a model. We are then in the realm of non-parametric statistics. In non-parametric statistics we do not assume we have parameters such as a mean and standard deviation. We learn what the meaningful parameters are from a training set. See https://en.wikipedia.org/wiki/Nonparametric_statistics and the fourth section of the book.

For this course, the most important case of dealing witn non-parametric statistics is in survival analysis. In doing this we must be careful in our choice of how to measure survival. Survival measured from on-set of treatment, for example, creates a bias in favor of extensive use of screening tests, which is best removed by measuring survival in terms of age at death.