Review of Descriptive Statistics

© Copyright Herbert J. Bernstein 2015

This is a brief review of the concepts in descriptive statistics to support study of subjects such as statistical anaylsis in bioinformatics. For more detail see the web pages in the Khan Academy, e.g. https://www.khanacademy.org/math/probability/descriptive-statistics, and see a good statistics text, such as Norman and Streiner, "Biostatistics: The Bare Essentials", 4thed, People's Medcial Publishing House, Shelton, CT, 2014, ISBN 978-1-60795-178-0.

Descriptive statistics are concerned with quantitative ways in which to present and describe data that may be useful in estimating probabilities of hypotheses related to that data. Even if the data itself is not numeric, we can derive numbers from the data by classifying the data and using counts of populations in various classes as numeric data.

The major topics are:

Variables and Dimension

Variables are the descriptors of data. We may be concerned with residue type or carcenogenicity or color or molecular weight or binding affinity, etc.. The values that are permitted for a variable may be numeric (quantitative) or non-numeric (nominal -- just providing a name for the data value). You think of a variable as a container for values. In statistics we usually (but not always) limit the values that a single variable may contain to something simple and measurable, such as an integer or a real number or a one-word name, but more complex variables such as complex numbers, quaternions, or complete sequences, can also be handled with greater or lesser effectiveness.

In a set of data, the values of variables all collected at the same time or from the same individual or otherwise closely enough related to be considered to belong together as essentially a single point in the total space of all the data, is called a data instance and the number of variables involved in an instance is the dimension

If the data is numeric, we often can apply the terminology and techniques of linear algebra to the data, treating seach data instance as a vector. See Linear_Algebra_Module.

Classification, Frequencies

Sometimes each data instance is unique, and the only classification we can apply is to say that each data instance is unique. More often, however, we can establish some overall classifiiation. For example, we may have a variable containing molecular weights of molecules and classify data instances into small molecules versus macromolecules, or have a variable containing sequences, and classify data into bins containing molecules of high sequence homology, or have another variable containing 3D coordinates and classify data by high structural homology, and have yet another variable with a list of active sites in a data instance molecule and classify data by structural homoology of active sites, or have still another variabale with a list of molecular functions and classify data by function. If the classification for a given variable is numeric, we have the possibility of evaluating measures such such as central tendency and dispersion, but if it is not numeric or if the numbers are not appropriate for such things as computing averages because it is either nominal or does not simple linear relationships among the possible values, we may need to confine our attention to counting numbers of data instances in various classification bins, i.e. computing frequencies. For one-dimensional data, simple histograms of the frequencies can be very descriptive. For two dimensional data, 3-D perspective bar charts can be effective. For higher dimensional data, color and motion can help to carry the other dimensions, but past 4-dimensions, visual techniques become problematic. Both for frequency data and for variables that provide numerical values directly application of numeric measures becomes essential.

Measures of Central Tendency and Dispersion

Some data is organized around a central cluster point. For variables form some simple numeric linear space, the average or (arithmetic) mean of the data values is a possible indicator of the location of such a cluster point. For more general, but quantitative, data, the value at or below which half the values lie is called the median. For nominal data we convert first to frequencies. If all frequencies are essentially the same, there is no central tendency in this data, but if some particular frequency is significantly larger than the others, it is called the mode.

All the data may be at the mean, the median or the mode. Often it is more widely dispersed. In general, the range is a term for the set of possible values for a variable, often presented as an interval, but in biostatistics when working with numeric data values, the range is a single number, the difference between the highest and lowest possible values for a variable, and the more general definition of range is used with modifiers. The interquartile range, interquintile range, etc are divisions of the full range into quarters, fifths, etc.. Percentiles refers to dividing the data into one hundred bins. The 5th percentile is the maximum of the data with the lowest twentieth of the data values. The 95th percentile is maximum of the set of data not including the instances with the highest twentieth of the data values. These two percentiles give a sense of how dispersed the data is. Another measure of the dispersion of the mean deviation or mean absolute deviation (MAD), which is the average of the absolute deviations of the data from the mean. A better measure to use when combining variables in higher dimensions than 1 is the (estimated) standard deviation (ESD) which is the square root of the variance. The variance is the average of the squares of the absolute deviations of the data from the mean. Unless the true value of the mean is known without using the data points thethemselves, in general it is best in computing either the MAD or the ESD to divide, not by the number of data instances, but by one less to account for the fact the accuracy may have been lost if estimating the mean of the data from the data instances, rather than having the value of the mean a priori.

If the data is being described in terms of how well it fits a Gaussian distribution, two more statistics are used: skew (how symmetric the data is around the mean https://en.wikipedia.org/wiki/Skewness) and kurtosis (how different the data is from the bell-shape of a Gaussian, i.e. how peaked or flat it is https://en.wikipedia.org/wiki/Kurtosis).

Multimodal Data

Not all data has a single cluster point. In many cases there are multiple cluster points for the data. We call such data multimodal. It can be very difficult to determine the number of appropriate clusters from the data itself. For example, when working with height and weight data from a mixed gender population, you are likely to see one mode for the male subjects and a very different mode for the female subjects. The suggest that one could form the necessary separate clusters on which to do descriptive statistics by carefully examining one variable at a time, find one that is distinct clusters and then segregate the data on the basis. Most high dimensional data does not permit such a simple approach. See the Wikipedia cluster analysis page https://en.wikipedia.org/wiki/Cluster_analysis for a starting point.

For even the simplest multimodal case, bimodal data, it is difficult to do more with descriptive statistics (as opposed to inferential statistics) than to histogram the data looking for multiple peaks separated by significant dips, because in order to describe multimodal data beyond that point you need a model (such as the sum of Gaussians) against which to fit the data.