This is a brief review of the concepts in inferential statistics to support study of subjects such as statistical anaylsis in bioinformatics. For more detail see the web pages in the Khan Academy, e.g. https://www.khanacademy.org/math/probability/statistics-inferential, and see a good statistics text, such as Norman and Streiner, "Biostatistics: The Bare Essentials", 4thed, People's Medcial Publishing House, Shelton, CT, 2014, ISBN 978-1-60795-178-0.
Inferential statistics are concerned with quantitative ways in which to estimate probabilities of hypotheses related to data. In general the hypotheses are based on models of the systems that may or may not have produced the data, including hypotheses about the relevant underlying probability distributions.
In bioinformatics, many of the models describe mechanisms relating to macromolecular sequences, including models of evolutionary development.
The major topics are:
Much of the data in bioinformatics is "sequence data " because many biologically important macromolecules are based on linear sequences of monomers put together like a string of pearls that has been folded to expose some of those monomers and bury others. Proteins are composed of strands of amino acid residues and DNAs and RNAs are composed of strands of nucleotides (nucleic acid residues), in each case like a text made from a restricted alphabet of letters.
The nucleotides in RNA are Cytosine, Guanine, Adenisine and Uracil (CGAU). The nucleotide in DNA are Cytosine, Guanine, Thymine and Adenisine (CGTA). The major amino acid residues from which proteins are made are alanine, arginine, asparagine, aspartic acid, cysteine, glutamic acid, glutamine, glycine, histidine, isoleucine, leucine, lysine, methionine, phenylalanine, proline, serine, threonine, tryptophan, tyrosine, and valine, with 3-letter codes ala, arg, asn, asp, cys, glu, gln, gly, his, ile, leu, lys, met, phe, pro, ser, thr, trp, tyr, and val, (A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V). The processes of evolution and mutation can be thought of as changing one or more letters in very long words made from these letters. The most popular and effective models used to estimate the probabilities of such changes in nuceotide sequences are the Cantor-Jukes and Kimura models (see http://nematodes.org/teaching/tutorials/phylogenetics/Bayesian_Workshop/PDFs/Kimura%20J%20Mol%20Evol%201980.pdf.) For proteins, heavy use is made of Hidden Markov Models (see https://cbse.soe.ucsc.edu/sites/default/files/hmm.all_.pdf).
Having useful models, we need statistical tools to see how well the data we have fits those models. We distinguish the data instances or samples from the theoretical population of data implied by the models we are testing. The process we follow in inferential statistics is to take the data instances we have, organize them into groups appropriate to the hypothesis being tested, and estimate the probability that these samples are (or are not) consistent with the hypothesis and the probability that these samples were encountered purely by chance. We do this by computing various descriptive statistics of the various groups into which we organized the samples. We are especially interested in means and variances, but we need to deal with two different sets of descriptive statistics
See https://en.wikipedia.org/wiki/Statistical_inference
Regression analysis os a statistical approach to estimating relationships among variables. Some variables are treated as independent variables. The rest are tested for being suitable as dependent variables, i.e. as variables for which there is a functional dependence on the independent variables. The parameters of that functional dependence (usually the coefficients of a line or hyperplane that best fits the data) are set at values that minimize the sum of the squares of the lengths of the vector differences between data instance values of the dependent variables and the values that the assumed functional dependence would assign to those dependent variables for the data instance on the basis of the values of the independent variables for that data instance. This is called a least squares fit. The descriptive statistics of the errors in the resulting fit are used to infer the goodness of the fit. The most commonly used measure of goodness of fit is $r^2$, the coeficient of determination which is the portion of the variance of the dependent variables that is a result of the variance of the independent variables. In the case of one independent variable $x$ and one dependent variable $y$, and a linear fit, we can just take the sum of the squares of differenced of the predicted values of $y$ from the mean of the predicted $y$, $SS_{reg}$ and divide by the sum of the squares if the differences of the actual and predicted values of $y$, $SS_{res}$ and $SS_{reg}$ so that
See https://en.wikipedia.org/wiki/Regression_analysis and the links from that page.
The square root of the coeeficient of determination is an example of a correlation coefficient. Correlation refers to the hypothesis that two varables or sets of data have a linear relationship to each other. It is a mistake to assume that a demonstration of correlation shows that one variable depends on the other. Otherwise we could just as easily say the lung cancer causes smoking and other environmental insults to lung tissue as we can say that smoking and other environmental insults to lung tissue causes lung cancer. See https://en.wikipedia.org/wiki/Correlation_and_dependence
For this course, the most important case of dealing witn non-parametric statistics is in survival analysis. In doing this we must be careful in our choice of how to measure survival. Survival measured from on-set of treatment, for example, creates a bias in favor of extensive use of screening tests, which is best removed by measuring survival in terms of age at death.