Statistical Inference, a Brief Introduction

© Copyright 2008, 2012, 2017 Herbert J. Bernstein

This is a brief introduction to the concepts of statistical inference. Decriptive statistics (https://en.wikipedia.org/wiki/Descriptive_statistics) provide numerical descriptors for sets of data. Inferential statistics use descriptive statistics to estimate the probability of the truth of a hypothesis.

Models, Truth and Probabilies

In order to make inferences with statistics, we start with data and models of the systems we wish to test as the generators of that data. We pose a testable hypothesis about the mechanisms of those models; i.e. a hypothesis which might be disproven by examination of the data, or perhaps by collecting additional sample data instances. In other words, we try to find a proof of the null hypothesis, the hypothesis that the original hypothesis (the alternative hypothesis) is false.

If the data demonstrates that the hypothesis is absolutely contradicted, then the null hypothesis is proven. More often the data will allow us to estimate a probability of the null hypothesis being true and the estimate a probability of the alternative hypothesis being true. We need some criteria to apply to those probabilities. For sociological data, the most common criterion for rejecting a hypothesis is that the probability of it being true is less than 5% (Z < -1.64 for one-sided tail, |Z| > 1.96 for two-sided tails). In the hard sciences, tighter limits are the norm, typically 1% (Z < -2.326 vs. |Z| > 2.5676), 0.1% (Z < -3.09 vs. |Z| > 3.29), 0.01% (Z < -3.72 vs. |Z| > 3.89). For some serious engineering problems, the standard is 6-sigma giving a probability of less than 0.0000001%.

Experiments produce the data from real systems, that system could be producing data according to the model under the null hypothesis, H0, or producing data under the alternative of the alternative hypothesis, H1. Once we have enough data to estimate probabilities, we can estimate α, the probability of a sample from H0 that supports H1 and estimate β, the probability of a sample from H1 that supports H0

In most case we cannot see that actual population. We only get to estimate the population probability distributions from sample distribution, i.e. we only get to estimate the probability of truth, not to know the truth.

See https://en.wikipedia.org/wiki/Statistical_inference

For more detail see the web pages in the Khan Academy, e.g. https://lol.khanacademy.org/math/probability/statistics-inferential, and see a good statistics text, such as Norman and Streiner, "Biostatistics: The Bare Essentials", 4thed, People's Medcial Publishing House, Shelton, CT, 2014, ISBN 978-1-60795-178-0.

Inferential statistics are concerned with quantitative ways in which to estimate probabilities of hypotheses related to data. In general the hypotheses are based on models of the systems that may or may not have produced the data, including hypotheses about the relevant underlying probability distributions.

Many of the hypotheses are concerned with whether different populations form separate custers or belong to the same cluster. In the simplest cases we will identify clusters by their centroids (their means) and call clusters the same if the distances among the centroids are within some reasonable number of standard deviations of one another. The general class of techniques used are called analysis of variance.

In bioinformatics, many of the models describe mechanisms relating to macromolecular sequences, including models of evolutionary development.

The major topics are:

Evolutionary Sequence Models

Much of the data in bioinformatics is "sequence data " because many biologically important macromolecules are based on linear sequences of monomers put together like a string of pearls that has been folded to expose some of those monomers and bury others. Proteins are composed of strands of amino acid residues and DNAs and RNAs are composed of strands of nucleotides (nucleic acid residues), in each case like a text made from a restricted alphabet of letters.

The nucleotides in RNA are Cytosine, Guanine, Adenisine and Uracil (CGAU). The nucleotide in DNA are Cytosine, Guanine, Thymine and Adenisine (CGTA). The major amino acid residues from which proteins are made are alanine, arginine, asparagine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine, with 3-letter codes ala, arg, asn, asp, cys, glu, gln, gly, his, ile, leu, lys, met, phe, pro, ser, thr, trp, tyr, and val, (A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V). The processes of evolution and mutation can be thought of as changing one or more letters in very long words made from these letters. The most popular and effective models used to estimate the probabilities of such changes in nuceotide sequences are the Cantor-Jukes and Kimura models (see http://nematodes.org/teaching/tutorials/phylogenetics/Bayesian_Workshop/PDFs/Kimura%20J%20Mol%20Evol%201980.pdf.) For proteins, heavy use is made of Hidden Markov Models (see https://cbse.soe.ucsc.edu/sites/default/files/hmm.all_.pdf).

Statistical Inference

Having useful models, we need statistical tools to see how well the data we have fits those models. We distinguish the data instances or samples from the theoretical population of data implied by the models we are testing. The process we follow in inferential statistics is to take the data instances we have, organize them into groups appropriate to the hypothesis being tested, and estimate the probability that these samples are (or are not) consistent with the hypothesis and the probability that these samples were encountered purely by chance. We do this by computing various descriptive statistics of the various groups into which we organized the samples. We are especially interested in means and variances, but we need to deal with two different sets of descriptive statistics

The sample descriptive statistics are used as estimators of the population statistics. It must always be remembered that the estimates are not likely to be exact, so we need to estimate the errors in the sample statistics, e.g. the estimated standard error which is computed from the sample standard deviation and dividing by the square root of the number of samples.

Regression and Correlation

Regression analysis is a statistical approach to estimating relationships among variables. Some variables are treated as independent variables. The rest are tested for being suitable as dependent variables, i.e. as variables for which there is a functional dependence on the independent variables. The parameters of that functional dependence (usually the coefficients of a line or hyperplane that best fits the data) are set at values that minimize the sum of the squares of the lengths of the vector differences between data instance values of the dependent variables and the values that the assumed functional dependence would assign to those dependent variables for the data instance on the basis of the values of the independent variables for that data instance. This is called a least squares fit. The descriptive statistics of the errors in the resulting fit are used to infer the goodness of the fit. The most commonly used measure of goodness of fit is $r^2$, the coeficient of determination which is the portion of the variance of the dependent variables that is a result of the variance of the independent variables. In the case of one independent variable $x$ and one dependent variable $y$, and a linear fit, we can just take the sum of the squares of differences of the predicted values of $y$ from the mean of the predicted $y$, $SS_{reg}$ ($reg$ for regression) and divide by the sum of the squares of the differences of the actual and predicted values of $y$, $SS_{res}$ ($res$ for residual) added to $SS_{reg}$ so that

$$r^2 = {SS_{reg}}/({SS_{res}+SS_{reg}})$$
A value of 1 indicates a very good fit. A value of 0 is an extremely poor fit. It is a measure of similarity, rather than a measure of distance.

When we don't know which variables are independent and which are dependent, we can try different choices and see which result in the highest $r^2$. See Allocating_Sum_of_Squares_in_Multiple_Regression.

See https://en.wikipedia.org/wiki/Regression_analysis and the links from that page.

The square root ($r$) of the coefficient of determination ($r^2$) is an example of a correlation coefficient. Correlation refers to the hypothesis that two varables or sets of data have a linear relationship to each other. It is a mistake to assume that a demonstration of correlation shows that one variable depends on the other. Otherwise we could just as easily say that lung cancer causes smoking and other environmental insults to lung tissue as we can say that smoking and other environmental insults to lung tissue causes lung cancer. See https://en.wikipedia.org/wiki/Correlation_and_dependence

Non-Parametric Statistics

The law of large numbers allows us to use normal distributions and the usual mean and variance for much of what we do in statistics, but there are times when other statistical distributions are more appropriate, or when there is not enough information to use any particular statistical distribution as a model. We are then in the realm of non-parametric statistics. In non-parametric statistics we do not assume we have parameters such as a mean and standard deviation. We learn what the meaningful parameters are from a training set. See https://en.wikipedia.org/wiki/Nonparametric_statistics and the fourth section of the book.

For this course, the most important case of dealing witn non-parametric statistics is in survival analysis. In doing this we must be careful in our choice of how to measure survival. Survival measured from on-set of treatment, for example, creates a bias in favor of extensive use of screening tests, which is best removed by measuring survival in terms of age at death.


Human_Subjects, PSPP_tutorial.html