When preparing to conduct a fair lending regression analysis, the first step is to determine the loan sample to be analyzed. This is usually accomplished by first selecting a particular loan product on which to focus and a time period.
Any given loan sample often includes loans that may have been treated differently for various reasons.There may be any number of circumstances whereby a loan may have been atypical but, nonetheless, are present in the sample.
The question then becomes how are these loans to be dealt with in the analysis? Generally, best practices are to exclude such loans; but often those unfamiliar with statistical methods will object to this, suggesting that this somehow makes the analysis invalid or incomplete.
In fact, it is quite the opposite, and we will demonstrate this through simulation in an upcoming white paper. For now, however, I want to address the assertion that excluding loans is somehow a flawed approach.
In fair lending analysis, the question that is being addressed is whether a lending institution is discriminatory in its lending practices. This is an unknown. Since it is impossible to observe every facet of their lending, we rely on a sample of data. How this sample is chosen is critical as it must be representative and unbiased. This is the crux of proper application of statistical inference.
One of the most well known examples is the 1936 presidential election between Roosevelt and Landon in which several organizations polled millions of people and predicted a win for Landon, whereas the Gallup organization relied on a much smaller (and accurate) sample and correctly predicted the Roosevelt win.
Gallup is still around today while the other organizations no longer exist. A well selected small sample wins over a poorly selected large sample every time.
The second broader point is in regard to the regression model itself. The purpose of using regression analysis is to be able to control all the variation associated with the measurement variable (such as how a loan was priced) so the effect of protected class status can be isolated and measured. Introducing variation that may be difficult to account for is counterproductive and can produce biased estimates.
To further illustrate this point, let’s suppose that our research question was not fair lending but to determine if a health supplement produced weight loss. In assembling a sample, one may think that the sample of individuals should include persons from all walks of life, ages, gender, etc. However, this is the opposite of what should be done if the true question is to test whether the supplement works. It would make it very difficult to isolate and control for outside influences with such a broad sample.
In most instances, special case loans should be excluded from analysis rather than trying to include and control for them. We will cover this topic in further detail in an upcoming white paper.