Standardize Variables Except For Dummy Variables, Using std_selected_boot()
R
regression
categorical variables
stdmod
standardized
bootstrapping
confidence-intervals
Author
Shu Fai Cheung
Published
June 25, 2023
This post shows one simple way to get meaningful standardized regression coefficients in multiple linear regression with dummy variables, with appropriate confidence intervals, using std_selected_boot() from the stdmod package.
(Note: This post and some related posts, e.g., this one on moderated regression, have sections that are identical or highly similar. This is intentional, to make each article self-contained. Readers do not need to refer to other articles to understand the points.)
“Betas”
It is common in my area, psychology, to report standardized coefficients, the so-called “betas,” when reporting the results of multiple regression or related methods. It is so common that some programs have the “betas” reported by default, alongside the unstandardized coefficients (the “Bs”). However, even if it is justifiable to interpret effects of some variables in the standardized metric, dummy variables, if any, should still not be standardized;
Dummy Variables Standardized Are No Longer Dummy Variables
The dummy variable can no longer be interpreted as a dummy variable because it no longer takes either 0 or 1.
So, the “betas” for dummy variables, though maybe not “wrong”, are not interpretable.
Nothing new. This issue has been raised by others (e.g., Hayes, 2022, p. 44). Some programs (e.g., jamovi) do not standardize dummy variables even when “betas” are requested.
Unfortunately, standardizing them is quite common (including in my own work).
The solutions are simple:
Do Not Standardize Dummy Variables
Standardize other variables except for the dummy variables.
However, common statistical programs standardize all variables. No choice. All or none.
To do standardization right when dummy variables are present, we need to manually do the standardization. Doable but inconvenient.
The function std_selected_boot() from the R package stdmod, which I and my collaborators developed (Cheung et al., 2022), can do the standardization right without any major changes to the common workflow.
How to Use std_selected_boot()
Workflow
Fit a regression model by lm() as usual.
Pass the output to std_selected_boot() and select variables to be standardized.
It was designed to let users have full control on which variables to standardize (see this article on how). It has an additional feature:
What It Won’t Do (Correctly)
It automatically skips factors and string variables in standardization. That is, their dummy variables will never be standardized.
The variables x1 and x2 are continuous predictors. The variable group is a string variable, with three possible values: Gp1, Gp2, and Gp3. The outcome variable is y.
Call:
lm(formula = y ~ x1 + x2 + group, data = dat)
Residuals:
Min 1Q Median 3Q Max
-42.755 -7.165 0.355 6.597 25.317
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.8205 2.2545 4.356 2.14e-05 ***
x1 0.4080 0.1021 3.997 9.10e-05 ***
x2 4.1642 0.3051 13.649 < 2e-16 ***
groupGp2 6.9809 2.0704 3.372 0.000901 ***
groupGp3 6.3161 1.7076 3.699 0.000282 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.49 on 195 degrees of freedom
Multiple R-squared: 0.5385, Adjusted R-squared: 0.5291
F-statistic: 56.89 on 4 and 195 DF, p-value: < 2.2e-16
Do Standardization Right
Install stdmod and load it:
library(stdmod)
If we want to standardize all variables except for categorical variables, if any, we just pass the output to std_selected_boot(), and set to_standardize to ~ .. The right-hand side of ~ denotes the variables to be standardized. If set to ., then all numeric variables, including the outcome variable (y), will be standardized.2
But this is not just about the coefficient. There is one issue with standardization: confidence intervals.
Beware of the t-based SE and CI
If a variable is standardized, the usual t-based standard errors and confidence intervals of the coefficients that involve it may be biased.3
This is because (a) they do not take into account the sampling variation of the standard deviations used in standardization (Yuan & Chan, 2011), and (b) the coefficients with standardization, the “betas”, are not normally distributed (though may be close to). Many statistical programs do not report the confidence intervals for “betas,” for a good reason.
This is why std_selected_boot() enables nonparametric bootstrapping percentile confidence intervals by default, just in case the bias is large.
To have stable and reproducible confidence intervals, call set.seed() before calling std_selected_boot() and set nboot to the desired number of bootstrap samples (at least 2000 but 5000 or more is recommended):
We can call summary() as usual to print the results:
summary(lm_out_std)
Call to std_selected_boot():
std_selected_boot(lm_out = lm_out, to_standardize = ~., nboot = 5000)
Selected variable(s) are centered by mean and/or scaled by SD
- Variable(s) centered: y x1 x2 group
- Variable(s) scaled: y x1 x2 group
centered_by scaled_by Note
y 40.36635 15.286176 Standardized (mean = 0, SD = 1)
x1 9.41020 7.342158 Standardized (mean = 0, SD = 1)
x2 5.35520 2.454193 Standardized (mean = 0, SD = 1)
group NA NA Nonnumeric
Note:
- Categorical variables will not be centered or scaled even if
requested.
- Nonparametric bootstrapping 95% confidence intervals computed.
- The number of bootstrap samples is 5000.
Call:
lm(formula = y ~ x1 + x2 + group, data = dat_mod)
Residuals:
Min 1Q Median 3Q Max
-2.79695 -0.46875 0.02324 0.43155 1.65619
Coefficients:
Estimate CI Lower CI Upper Std. Error t value Pr(>|t|)
(Intercept) -0.28825 -0.42260 -0.16397 0.08545 -3.374 0.000895 ***
x1 0.19598 0.10064 0.28696 0.04903 3.997 9.1e-05 ***
x2 0.66856 0.58901 0.73888 0.04898 13.649 < 2e-16 ***
groupGp2 0.45668 0.22507 0.68592 0.13544 3.372 0.000901 ***
groupGp3 0.41319 0.20033 0.63209 0.11171 3.699 0.000282 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6863 on 195 degrees of freedom
Multiple R-squared: 0.5385, Adjusted R-squared: 0.5291
F-statistic: 56.89 on 4 and 195 DF, p-value: < 2.2e-16
Note:
- Estimates and their statistics are based on the data after
mean-centering, scaling, or standardization.
- [CI Lower, CI Upper] are bootstrap percentile confidence intervals.
- Std. Error are not bootstrap SEs.
The output has one additional section:
Variables that are standardized4, and variables that are not transformed.
The other sections are similar to those of a usual multiple regression. Note that the column Estimate is intentionally not labelled as Beta because it is possible that only some variables are standardized. Labelling it as Beta, though common, is misleading.
Interpret The Output
In the regression output, because y, x1, and x2 are standardized, they are the usual “betas.” However, group is not standardized and so groupGp2 and groupGp3 still take only two possible values, 0 and 1, and so can still be interpreted as usual.
For example, the coefficient of groupGp2 is 0.457. That is, compared to Gp1, the reference group, the predicted value of y is 0.457 SD (of y) higher in Gp2.
On the other hand, the coefficient of groupGp3 is 0.413. That is, compared to Gp1, the predicted value of y is 0.413 SD (of y) higher in Gp3.
In short, the coefficients of all dummy variables can be interpreted as usual, though the difference in the outcome variable (dependent variable) is in SD of this variable if it is standardized.
Why The Common Practice Is Problematic
Standardize All Variables
Assume that we standardize all variables, including the dummy variables, as in some statistical program. To simulate this, I manually create the dummy variables, standardize them, and do regression:
The following results are what found in common statistical programs that standardize all variables, including the dummy variables, to yield the “betas”:
The coefficients of groupGp2 and groupGp3, when they are standardized, are not meaningful. After standardization, they are no longer either 0 or 1.
For example, the so-called “beta” of groupGp2, 0.188, is not the difference between Gp2 and Gp1 on standardized y. It is the increase in standardized y when groupGp2 “increases by one SD”, which is meaningless.
What’s worse, the ranking of the coefficients changed. If we interpret the coefficients with dummy variables not standardized, the Gp2 vs. Gp1 difference is larger than the Gp3 vs. Gp1 difference (0.457 vs 0.413).
However, if we interpret the coefficients with dummy variables standardized, we would have a different conclusion: the Gp2 vs. Gp1 difference is smaller than the Gp3 vs. Gp1 difference (0.188 vs 0.206), though only slightly.
The reason is, there are more cases in Gp3 than in Gp2. The differences in the SDs of the dummy variables are large enough to reverse the ranking:
table(dat$group)
Gp1 Gp2 Gp3
65 43 92
sd(dat$groupGp2)
[1] 0.4118533
sd(dat$groupGp3)
[1] 0.4996481
I created the data to demonstrate that this reversal in ranking is possible. This may not happen in real data. However, having groups with different numbers of cases is probably the norm rather than the exception in real data.5
Use t Statistics Confidence Intervals
Some programs gives confidence intervals of “betas” using t statistics. That is, standardize variables, do regression, and form the confidence intervals, as if the standardized variables were the original data.
Let us compare the bootstrap confidence intervals with the usual OLS confidence intervals based on the t-statistics.
For the output of std_selected_boot(), we can request the t-based confidence intervals by setting type to "lm" when calling confint(), the function commonly used to form confidence intervals:
# OLS confidence intervalsround(confint(lm_out_std, type ="lm"), 3)
As shown above, the confidence intervals of x1 by the two methods are close to each other. However, the bootstrap confidence interval of x2 is narrower than the t-based confidence interval.
We can compare the widths of the confidence intervals by examining their ratios (ratio = CI_Width_t / CI_Width_boot):
The widths of the confidence intervals are nearly identical for x1. However, for x2, the t-based confidence interval is nearly 28.9% wider than the bootstrap confidence interval.
For groupGp2, even though dummy variable is correctly not standardized, it can still be affected because the outcome variable, y, is standardized. Its t-based confidence interval is also wider than the bootstrap confidence interval by about 15.9%.
Final Remarks
What std_selected_boot() does can be implemented in R code by researchers using base R packages only. The function was developed to make it easier for researchers to do standardization right, without adopting the all-or-none approach.
More information on std_selected_boot() can be found from its help page.
Other Resources
There are other program or function that also correctly skip categorical variables when doing standardization. For example, the linear regression procedure in jamovi also do not standardized the dummy variables formed from categorical variables when standardized estimates are requested.
For confidence intervals, other methods have also been proposed to take into account the sampling variability of the standard deviations. Examples are the methods proposed by Dudgeon (2017) and Jones & Waller (2013), in addition to the one proposed by Yuan & Chan (2011).
For issues on this post, such as corrections and mistakes, please open an issue for the GitHub repository for this blog. Thanks.
References
:::
References
Cheung, S. F., Cheung, S.-H., Lau, E. Y. Y., Hui, C. H., & Vong, W. N. (2022). Improving an old way to measure moderation effect in standardized units. Health Psychology, 41(7), 502–505. https://doi.org/10.1037/hea0001188
Dudgeon, P. (2017). Some improvements in confidence intervals for standardized regression coefficients. Psychometrika, 82(4), 928–951. https://doi.org/10.1007/s11336-017-9563-z
Hayes, A. F. (2022). Introduction to mediation, moderation, and conditional process analysis: A regression-based approach (Third edition). The Guilford Press.
Jones, J. A., & Waller, N. G. (2013). Computing confidence intervals for standardized regression coefficients. Psychological Methods, 18(4), 435–453. https://doi.org/10.1037/a0033269
Yuan, K.-H., & Chan, W. (2011). Biases and standard errors of standardized regression coefficients. Psychometrika, 76(4), 670–690. https://doi.org/10.1007/s11336-011-9224-6
Footnotes
x1 and x2 are intentionally generated from nonnormal distributions, for illustration. Note that OLS (ordinary least squares) regression does not assume that the predictors are multivariate normal. Using nonnormal predictors violate no assumptions of OLS regression.↩︎
Suppose we only want to standardize x1 and y, because x2 is on a meaningful unit, we can set to_standardize to ~ x1 + y. Order does not matter.↩︎
Note that there are indeed cases in which they are still unbiased, and cases in which the biases are negligible. See Yuan & Chan (2011) for a detailed discussion.↩︎
std_selected_boot() allows users to do only mean-centering or scaling by standard deviation. Meaning-centering and than scaling by standard deviation is equivalent to standardization.↩︎
It can be argued that size differences should be taken into account. They should be, but probably not by standardizing the dummy variables and making the coefficients not interpretable.↩︎