Photo By Clint Shelton
Table of Contents
- Context
- Formulas and Derivation
- Useful Resources
Context
Introduction to Linear Regression R Squared Formula
In part 3 I derived the formulas for the linear regression F test, which tests the explanatory power of the model as a whole. The linear regression R squared formula (aka the coefficient of determination formula) is similar to the F test in that it uses components from the F test, however it is not a statistical hypothesis test.
The R Squared formula represents the percentage of squared error that is “explained” by the final model compared to the “total” amount of squared error in the baseline model.
\[\large{
R^2 = 1 \;-\; \frac{\widehat{\sigma}_\text{final}^2}{\widehat{\sigma}_\text{baseline}^2}
}\]
A model with a higher R squared percentage is more explanatory (ie has less squared error) than a model with a lower R squared percentage. However, keep in mind that simply comparing a lower value to a higher value does not tell you if the two are significantly different.
R Squared Range
R Squared can take values from approximately 0 (meaning it can be less than 0) to 1.
\[\large{
0\approx R^2 \le 1
}\]
Under the null hypothesis the baseline and final model variances are equal: \(\text{H}_0\longrightarrow \sigma_\text{final}^2 = \sigma_\text{baseline}^2\). However, that is a statement about the population variances. Estimators (eg \(\widehat{\sigma}_\text{final}^2\) and \(\widehat{\sigma}_\text{baseline}^2\)) built from sample distributions will not exactly equal their population counterparts. As such the final variance estimator can be higher than the baseline estimator in absolute terms (while still being statistically equal).
Run the sample code below several times and you will see that the adjusted R squared value can be negative (adjusted simply means using variance estimators, which are bias adjusted).
rm(list=ls())
n <- 100
df <- data.frame(
x <- runif(n,-100,100)
,e <- rnorm(n,0,30)
)
df$y <- 40 + df$e
lm <- lm(y ~ x + 1, df)
var_.baseline <- sum((df$y-mean(df$y))^2) / (n-1)
var_.final <- sum(lm$residuals^2) / (n-2)
# Adjusted R-Squared and Variances
summary(lm)
var_.baseline
var_.final
1 - var_.final / var_.baseline
No Assumption of Normality Required
The assumption that the errors are normally distributed is required to construct the T and F test statistic formulas, but it is NOT required for the R squared formula. The linear regression coefficient estimator formula chooses the vector that minimizes squared error regardless of whether or not the errors are normally distributed.
So while the T test, T test confidence intervals, and F test are all invalid if the errors are not sufficiently normal, the R squared value is still valid because it is simply a measure of percentage of total squared error explained by the final model.
Adjusted vs Unadjusted
R outputs the R squared and the R squared adjusted meaning the variance estimates have been bias adjusted. See part 2 for an explanation of estimator bias.
# Residual standard error: 42.97 on 47 degrees of freedom
# Multiple R-squared: 0.9185, Adjusted R-squared: 0.915
# F-statistic: 264.8 on 2 and 47 DF, p-value: < 2.2e-16
WARNING! On Changing Baselines
Part 3 of this series introduced the concept of a “baseline” model in the context of an F test of a linear regression model. A baseline is a model made up of some subset of the factors in the final model. The same concept is used to calculate the R squared value.
The two ubiquitous baselines are the constant and null factor baselines. These two baselines are built into the functionality of the R. However, changing the baseline does NOT change the final model. It only changes the baseline used to judge that model.
To highlight this see the example below.
##--> https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data <--##
rm(list=ls())
# import data
df <- read.csv(
"C:\\**YOURFILELOCATION**\\train.csv"
,header=TRUE
,stringsAsFactors=FALSE
)
# build linear model using null factor baseline
lm1 <- lm(SalePrice ~ (MSZoning)*(GrLivArea) + 0,df)
# build SAME linear model using constant factor baseline
lm2 <- lm(SalePrice ~ (MSZoning)*(GrLivArea) + 1,df)
# calculate total and unexplained variance for both models
tot1 <- sum((df$SalePrice)^2)/nrow(df)
tot2 <- sum((df$SalePrice-mean(df$SalePrice))^2)/(nrow(df)-1)
unexp1 <- sum((df$SalePrice-lm1$fitted.values)^2)/(nrow(df)-5)
unexp2 <- sum((df$SalePrice-lm2$fitted.values)^2)/(nrow(df)-5)
The unexplained variance between the two models will be EXACTLY the same since they have exactly the same predicted values for the same actual values. This is evidenced by the final model from both producing EXACTLY the same final aka “unexplained” variance.
# > unexp1;unexp2
# [1] 2744337851
# [1] 2744337851
# >
However, the total variance will be very different between the two. A null factor baseline uses 0 for every predicted value in the baseline model. A constant factor baseline uses the mean of the dependent variable for every predicted value in the baseline model, which inherently produces a lower total variance.
By increasing the total variance the unexplained variance as a percentage of total appears smaller, even though it was a change in the baseline that caused that NOT a more accurate model producing less unexplained variance.
# > tot1;tot2
# [1] 39039267708
# [1] 6311111264
#
# > summary(lm1) # null factor
#
# Residual standard error: 75150 on 1455 degrees of freedom
# Multiple R-squared: 0.8558, Adjusted R-squared: 0.8553
# F-statistic: 1727 on 5 and 1455 DF, p-value: < 2.2e-16
#
# > summary(lm2) # constant factor
#
# Residual standard error: 75150 on 1455 degrees of freedom
# Multiple R-squared: 0.1076, Adjusted R-squared: 0.1051
# F-statistic: 43.84 on 4 and 1455 DF, p-value: < 2.2e-16
CHANGING THE BASELINE MAKES R SQUARED VALUES NOT COMPARABLE BECAUSE EVEN IF THE TWO MODELS ARE EXACTLY THE SAME USING A DIFFERENT BASELINE PRODUCES A DIFFERENT PERCENTAGE!
R Squared Only Valid For Linear Regression
In part I of this series I mentioned that linear regression is only one member of an entire family of linear models (called the generalized linear model). This was in the context of choosing the “best” estimate for the model coefficients.
I detailed a method for finding the best coefficients estimate called maximum likelihood estimation. The reason I detailed this method and not the much more common “least squares” is that MLE produces EXACTLY the same result as least squares for linear regression, but unlike least squares it functions for the entire generalized linear model family.
A common question when people are introduced to the generalized linear model is: why doesn’t R report the R Squared value for the GLM?
The answer is, because the GLM does NOT seek to minimize squared error you cannot expect that the coefficient estimates chosen to produce the lowest possible squared error. You can calculate canonical R squared for the entire GLM, as it is merely a function of the baseline and final variance. However, it is invalid to compare models using a measure orientated around squared error while selecting parameters meant to optimize for something else that is NOT squared error.
There are pseudo R squared analogues that can be used for other GLM members.
Formulas and Derivation
Linear Regression R Squared Formulas
\[\large{
\begin{align}
&y = \stackrel{n\times q}{X}b + \varepsilon & && &\small\text{(regression model)} \\ \\
&\widehat{b} = (X^\text{T}X)X^\text{T}y,\quad\widehat{y}=X\widehat{b} & && &\small\text{(final estimate)} \\ \\
&y = \stackrel{n\times o}{W}b + \varepsilon,\;\; W=X_{[,1:o]},\;\;o<q & && &\small\text{(baseline model)} \\ \\
&\tilde{b} = (W^\text{T}W)W^\text{T}y,\;\;\tilde{y}=W\tilde{b} & && &\small\text{(baseline estimate)} \\ \\
&df_\text{final} = (n\;-\;q) & && &\small\text{(final degrees of freedom)} \\ \\
&df_\text{baseline} = (n\;-\;o) & && &\small\text{(baseline degrees of freedom)}\\ \\
&\widehat{\sigma}_\text{final}^2=\frac{1}{df_\text{final}}\sum_{i=1}^n (y_i \;-\; \widehat{y}_i)^2 & && &\small\text{(final variance)}\\
&\widehat{\sigma}_\text{baseline}^2=\frac{1}{df_\text{baseline}}\sum_{i=1}^n (y_i \;-\; \tilde{y}_i)^2 & && &\small\text{(baseline variance)}\\
&R^2 = 1 \;-\; \frac{\widehat{\sigma}^2_{\text{final}}}{\widehat{\sigma}^2_{\text{baseline}}} & && &\small\text{(R Squared Statistic)}
\end{align}
}\]
R Code For Linear Regression R Squared Formulas
The R code below manually implements the formulas from above, also uses the standard R functionality to achieve the same results, and then compares the two.
If you are new to R I suggest R-Studio as and IDE.
######################################
## Generate Data, Declare Variables ##
######################################
rm(list = ls())
`%+%` <- function(a, b) paste(a, b, sep="")
IsConstFactor <- T # control if constant factor in model
IsSigFactors <- T # control if significant factors in model
IsNonSigFactor <- T # control if non-significant factor in model
n <- 100 # sample size
sigma.model <- 40 # error standard deviation
# independent factors aka design matrix
X <- cbind(
if(IsConstFactor == T){rep(1,n)}else{NULL}
,if(IsSigFactors == T){runif(n,-100,100)}
,if(IsSigFactors == T){rpois(n,10)}
,if(IsNonSigFactor == T){rexp(n,0.1)}else{NULL}
)
# coefficient vector
b <- rbind(
if(IsConstFactor == T){40}else{NULL}
,if(IsSigFactors == T){2.5}
,if(IsSigFactors == T){4}
,if(IsNonSigFactor == T){0}else{NULL}
)
# error, linear regression model, baseline estimate
e <- cbind(rnorm(n,0,sigma.model))
y <- X %*% b + e
baseline <-
if(IsConstFactor == T) {
mean(y)
} else {0}
# QR factorization of X for more
# efficient processing
qr <- qr(X)
Q <- qr.Q(qr)
R <- qr.R(qr)
rm(qr)
# labels
colnames(X) <- c("X" %+% seq(as.numeric(!IsConstFactor),
ncol(X) - as.numeric(IsConstFactor)))
rownames(b) <- c("b" %+% seq(as.numeric(!IsConstFactor),
nrow(b) - as.numeric(IsConstFactor)))
###############################
## Linear Regression Using R ##
###############################
model.formula <- if(IsConstFactor == T) {
"y ~ 1" %+% paste(" + " %+% colnames(X)[2:ncol(X)], collapse='')
} else {"y ~ 0 " %+% paste(" + " %+% colnames(X), collapse='')}
linear.model <- lm(model.formula,as.data.frame(X))
#######################################
## Perform Liner Regression Manually ##
#######################################
b_ <- solve(R) %*% t(Q) %*% y # estimated coefficients
#b_ <- solve(t(X) %*% X) %*% t(X) %*% y
rownames(b_) <- rownames(b)
y_ <- X %*% b_ # estimated model
# degrees of freedom
df.baseline <- if(IsConstFactor == T) {n - 1} else {n}
df.final <- n - nrow(b_)
# residuals
res <- cbind(
c(y - baseline) # baseline/"total" error
,c(y - y_) # final/"unexplained" error
); colnames(res) <- c("baseline","final")
# variances
var_.baseline <- sum(res[,"baseline"]^2) / df.baseline
var_.final <- sum(res[,"final"]^2) / df.final
# R-squared value
R2 <- 1 - var_.final / var_.baseline
R2_unadj <- 1 - var_.final * df.final / var_.baseline / df.baseline
ret.R2 <- cbind(R2,R2_unadj)
colnames(ret.R2) <- c("R-squared","R-squared Unadj.")
#############
## Compare ##
#############
summary(linear.model)
ret.R2
Derivation
Every component of the linear regression R squared formula is also a component of the regression F test and as such they are all derived in part III of this series.
The R squared formula simply expresses the final aka “unexplained” variance as a percentage of the baseline aka “total” variance and subtracts it from 1 to obtain what percentage of total is explained by the final model.
\[\large{
R^2 = 1 \;-\; \frac{\widehat{\sigma}_\text{final}^2}{\widehat{\sigma}_\text{baseline}^2}
}\]
To obtain the unadjusted for bias formula simply multiply each bias adjusted variance estimator by its degrees of freedom, which is the denominator of each variance estimator. No need to replace degrees of freedom with \(n\) as each variance estimator would cancel out that value from the other.
\[\require{cancel}\large{
\begin{align}
{R^2}^* &= 1 \;-\; \frac{df_\text{final}\widehat{\sigma}_\text{final}^2}{df_\text{baseline}\widehat{\sigma}_\text{baseline}^2} \\ \\
&= 1 \;-\; \frac{\cancel{\frac{1}{n}}\cancel{df_\text{final}}\frac{1}{\cancel{df_\text{final}}}\sum_{i=1}^n res_\text{final}^2}{\cancel{\frac{1}{n}}\cancel{df_\text{baseline}}\frac{1}{\cancel{df_\text{baseline}}}\sum_{i=1}^n res_\text{baseline}^2} \\ \\
&^*\normalsize{\text{ means unadjusted for bias}}
\end{align}
}\]
Attractive sction off content. I jyst stmbled upon your boog annd in accession capital to assert that I acquire iin act enjlyed account your bpog posts.
Anny way I’ll bbe subscribing to yoyr feeds and even I achievement
youu access consistently quickly.