Data

Data, a Brief Introduction

What is Data?

This is a brief introduction to the concepts of data. Data is a synonym for information. That information could be one or more numbers, or words or pictures, etc.. More formally, information is a set of states of a system. In 1948, Claude Shannon [Shannon, Claude E. "A note on the concept of entropy." Bell System Tech. J 27 (1948): 379-423] formalized information theory, noting that we can work with the information in any system by simply counting all the possible states, using the ordinal to identify the state. So an unchanging system in a single permanent state has the minimum of information, and the more chaotic a system, the more states, and therefore the more information content it has. We can reduce all possible data to numbers and apply a computer to helping us to organize and understand the data.

Variables

A variable is a label of a container for information that may assume different values. A constant is a label of a container for information that may assume only one fixed value. For example we might have a variable for height, a variable for weight and a variable for gender. Think of these as column headings far a table of such values:

Height	Weight	Gender
70"	210#	Male
65"	130#	Female
73.5"	240#	Male
68.5"	240#	Male
60"	103#	Female
...	...	...

The contents of a variable might be the results of a calculation, be an observed value, or be something completely random. The contents may have a specific type, such as a string of letters, or an integer, or a floating point number. The contents may be unrestricted as to type. The attributes of a variable need to be specified.

Depending on the context a variable might be something the value of which we can just specify -- an independent variable. Alternatively the value of a variable may the the results of a calculation or operation of a system in a way the depends on other variables -- a dependent variable.

Types of data

Data values may be constrained to a finite list of specific values -- an enumeration. Data values may be constrained to some finite subset of the integers, or a finite subset of the real numbers, or some finite set of objects. All these are enumerations. They are also discrete.

Data values may be constrained to just the integers or the just the non-negative whole numbers. There are also discrete, but not finite. They are countable.

Data values may be constrained to intervals of real numbers. They are infinite, but not countable. If only a single interval is involved, they are continuous and uncountable. If more than one interval is involved, they are only piecewise continuous.

Even though all information can be represented as numbers, some data does not make use of the ordering of numbers or the ability to do arithmetic on numbers and compute a distance between one number and another. Those values are intended to be used just as names. That is nominal data. Other data may be numeric.

For some data the is a clear ordering of the values. This may be the case for either nominal or numeric data. This is ordinal data.

For some data we do not just know an ordering, we know precisely how much bigger or smaller one value is compared to another. This is interval data. It is meaningful to subtract such values.

For some data we also know a zero-point. This is ratio data. It is meaningful to divide such data.

When we have multiple columns of data, the ordering between columns may not be meaningful. If the ordering of columns is meaningful, we can treat each row as a vector in a space of a dimension equal to the number of columns. If the ordering of rows is not meaningful, each row is just a set. Each such rows as a vector or set is called a data instance. We may do linear algebra of data instances with numeric values.

Because we use numbers to represent all information, it can be very tempting to assume that all data has a zero point, that all data is ordered, that all data can be added, subtracted, multiplied and divided. This is often a mistake. We should perform operations on data that are appropriate to its meaning, not just to its representation. If we were to decide to use the number 1 for the male gender and 2 for the female gender, we could say male is less than female, but if we reversed the assignments we would get the opposite answer.

What We Can Do with Data

For all data we can count the instances of particular values or ranges of values. Because we can count, we can compute probabilities of occurrence. We can also make histograms -- aggregated counts of values in particular ranges or bins. See https://en.wikipedia.org/wiki/Histogram. Once we have made such histograms, the bin populations and probabilities are number we can add, subtract, multiple and divide.

Once we have data we can add, subtract, multiply and divide, either by making histograms or because the original data is of an appropriate type, we can compute descriptive statistics. See https://en.wikipedia.org/wiki/Descriptive_statistics. What statistics are appropriate, depend on the data. If the data clusters around some central value, it is useful to compute a mean to quantify that central value, perhaps transforming the data first. If there are multiple clusters one usually breaks up the data into separate clusters first. If there is enough data, once can also compute meaningful measures of dispersion, such as the variance, and higher moments that help in understanding the shape of the cluster.

Data with one well-identified central value for a single cluster is called unimodal. If there are two distinct clusters the data is bimodal, etc. For example, human height or weight data is bimodal because there are distinct gender-dependent means. In general, data is multimodal and one of the primary tasks in analyzing the data is to identify the clusters. This can be very difficult for data in high dimensions.

Finding Meaningful Clusters

Actually identifying clusters is a very complex process when working with real data, especially in high dimensions. See https://en.wikipedia.org/wiki/Cluster_analysis. In high dimensions, we encounter the "Curse of Dimensionality" [Bellman, Richard. "Dynamic programming and Lagrange multipliers." Proceedings of the National Academy of Sciences 42, no. 10 (1956): 767-769]. In high dimensions all data instances appear to be far from all other data instances, making it hard to form meaningful clusters of neighboring instances. In low dimensions (e.g. three dimensions), the Nearest Neighbor algorithms [Andrews, Lawrence C., and Herbert J. Bernstein. "NearTree, a data structure and a software toolkit for the nearest-neighbor problem." Journal of applied crystallography 49, no 3, 2016] perform well, but degrade sharply even for slightly higher dimensions, such as six or seven.

Therefore it is very desirable to find ways to identify subsets of the variables that carry most of the infomation on how the data varies, using techniques such as principal component analysis https://en.wikipedia.org/wiki/Principal_component_analysis to identify a small number of linear combinations of variables that carry most of the variance of the data.

Such efforts can be distorted if different variables are presented on different scales, so that the cpntributions of some variables are artificially inflated and some are artificially suppressed. Therefore it can be very useful to rescale each variable to be commensurate with the others before looking for principal components. The risk in doing this purely from the data is that such a transformation in and of itself can distort the contributions of particular variables.