Centering and Scaling

Outline

Your predictive models, which will have some number of input predictors, can be disrupted by values which are on different scales or are of different measurments. This can lead to confusion around the numerical relevancy of your predictor, or give you poor results when some unscaled value distorts your model. Data with these difficult inputs can be shifted to have a mean of 0, and be contained within a simple range like 0-1. The former method is called usually centering and the latter method scaling.

The problem

Someone writing a model with some predictor values will usually start out with some set of data. The type of data in their model will vary in each circumstance. But the data points themselves might start off in different scales, or be in different units. Someone passing data through a regression model, might have three points of data, and where one very correlated with the other two. A regression model can experience some inflated standard errors, and so find it difficult to determine the true relationship between the predictor and the outcome. Or a modeller might encounter difficulty when trying model with data that is not bounded by scale. A value could have a minor or non existent relationship with what you are trying to predict, but being on a higher scale means that the model considers it more important.

The development of the problem.

Problems such as these would have been encountered since the very first days of statistics. In his 1936 paper 'The use of multiple measurements in taxonomic problems' Ronald Fisher begins with a dataset of flowers and tries to find some linear function $X$ that will maximise the ratio of the differences between the specific means of the measured values $\lambda_{1}x_1 + \lambda_{2}x_2 + \lambda_{3}x_3 +\lambda_{4}x_4$ .