Standardize Variables Except For Dummy Variables, Using std_selected_boot()

R
regression
categorical variables
stdmod
standardized
bootstrapping
confidence-intervals
Author

Shu Fai Cheung

Published

June 25, 2023

This post shows one simple way to get meaningful standardized regression coefficients in multiple linear regression with dummy variables, with appropriate confidence intervals, using std_selected_boot() from the stdmod package.

(Note: This post and some related posts, e.g., this one on moderated regression, have sections that are identical or highly similar. This is intentional, to make each article self-contained. Readers do not need to refer to other articles to understand the points.)

“Betas”

It is common in my area, psychology, to report standardized coefficients, the so-called “betas,” when reporting the results of multiple regression or related methods. It is so common that some programs have the “betas” reported by default, alongside the unstandardized coefficients (the “Bs”). However, even if it is justifiable to interpret effects of some variables in the standardized metric, dummy variables, if any, should still not be standardized;

Dummy Variables Standardized Are No Longer Dummy Variables

The dummy variable can no longer be interpreted as a dummy variable because it no longer takes either 0 or 1.

So, the “betas” for dummy variables, though maybe not “wrong”, are not interpretable.

Nothing new. This issue has been raised by others (e.g., Hayes, 2022, p. 44). Some programs (e.g., jamovi) do not standardize dummy variables even when “betas” are requested.

Unfortunately, standardizing them is quite common (including in my own work).

The solutions are simple:

Do Not Standardize Dummy Variables

Standardize other variables except for the dummy variables.

However, common statistical programs standardize all variables. No choice. All or none.

To do standardization right when dummy variables are present, we need to manually do the standardization. Doable but inconvenient.

The function std_selected_boot() from the R package stdmod, which I and my collaborators developed (Cheung et al., 2022), can do the standardization right without any major changes to the common workflow.

How to Use std_selected_boot()

Workflow
  1. Fit a regression model by lm() as usual.

  2. Pass the output to std_selected_boot() and select variables to be standardized.

It was designed to let users have full control on which variables to standardize (see this article on how). It has an additional feature:

What It Won’t Do (Correctly)

It automatically skips factors and string variables in standardization. That is, their dummy variables will never be standardized.

Example

Do Regression As Usual

Suppose this is the data set:1

Show the code for generating data
set.seed(1453243)
n <- 200
group <- sample(c("Gp1", "Gp2", "Gp3"),
                size = n,
                prob = c(.30, .20, .50),
                replace = TRUE)
x1 <- (rchisq(n, 2) - 2) * 4 + 10
x2 <- runif(n, 1, 10)
y <- 10 + 0.45 / 1 * x1 + 4 / 1 * x2 + sapply(group, switch,
                                   Gp1 = 1,
                                   Gp2 = 8,
                                   Gp3 = 6) + rnorm(n, 0, 10)
dat <- data.frame(x1 = round(x1, 2),
                  x2 = round(x2, 2),
                  group,
                  y = round(y, 2))
write.csv(dat, "dat.csv")

This is the data file.

head(dat)
    x1   x2 group     y
1 9.97 7.84   Gp1 44.99
2 2.45 5.11   Gp1 47.27
3 8.50 8.87   Gp2 60.00
4 4.10 5.60   Gp1 37.05
5 8.16 7.18   Gp3 45.51
6 2.53 9.19   Gp1 54.71

The variables x1 and x2 are continuous predictors. The variable group is a string variable, with three possible values: Gp1, Gp2, and Gp3. The outcome variable is y.

This is the regression model:

lm_out <- lm(y ~ x1 + x2 + group, dat)
summary(lm_out)

Call:
lm(formula = y ~ x1 + x2 + group, data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-42.755  -7.165   0.355   6.597  25.317 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   9.8205     2.2545   4.356 2.14e-05 ***
x1            0.4080     0.1021   3.997 9.10e-05 ***
x2            4.1642     0.3051  13.649  < 2e-16 ***
groupGp2      6.9809     2.0704   3.372 0.000901 ***
groupGp3      6.3161     1.7076   3.699 0.000282 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.49 on 195 degrees of freedom
Multiple R-squared:  0.5385,    Adjusted R-squared:  0.5291 
F-statistic: 56.89 on 4 and 195 DF,  p-value: < 2.2e-16

Do Standardization Right

Install stdmod and load it:

library(stdmod)

If we want to standardize all variables except for categorical variables, if any, we just pass the output to std_selected_boot(), and set to_standardize to ~ .. The right-hand side of ~ denotes the variables to be standardized. If set to ., then all numeric variables, including the outcome variable (y), will be standardized.2

But this is not just about the coefficient. There is one issue with standardization: confidence intervals.

Beware of the t-based SE and CI

If a variable is standardized, the usual t-based standard errors and confidence intervals of the coefficients that involve it may be biased.3

This is because (a) they do not take into account the sampling variation of the standard deviations used in standardization (Yuan & Chan, 2011), and (b) the coefficients with standardization, the “betas”, are not normally distributed (though may be close to). Many statistical programs do not report the confidence intervals for “betas,” for a good reason.

This is why std_selected_boot() enables nonparametric bootstrapping percentile confidence intervals by default, just in case the bias is large.

To have stable and reproducible confidence intervals, call set.seed() before calling std_selected_boot() and set nboot to the desired number of bootstrap samples (at least 2000 but 5000 or more is recommended):

set.seed(870516)
lm_out_std <- std_selected_boot(lm_out,
                                to_standardize = ~ .,
                                nboot = 5000)

We can call summary() as usual to print the results:

summary(lm_out_std)

Call to std_selected_boot():
std_selected_boot(lm_out = lm_out, to_standardize = ~., nboot = 5000)

Selected variable(s) are centered by mean and/or scaled by SD
- Variable(s) centered: y x1 x2 group
- Variable(s) scaled: y x1 x2 group

      centered_by scaled_by                            Note
y        40.36635 15.286176 Standardized (mean = 0, SD = 1)
x1        9.41020  7.342158 Standardized (mean = 0, SD = 1)
x2        5.35520  2.454193 Standardized (mean = 0, SD = 1)
group          NA        NA Nonnumeric                     

Note:
- Categorical variables will not be centered or scaled even if
  requested.
- Nonparametric bootstrapping 95% confidence intervals computed.
- The number of bootstrap samples is 5000.

Call:
lm(formula = y ~ x1 + x2 + group, data = dat_mod)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.79695 -0.46875  0.02324  0.43155  1.65619 

Coefficients:
            Estimate CI Lower CI Upper Std. Error t value Pr(>|t|)    
(Intercept) -0.28825 -0.42260 -0.16397    0.08545  -3.374 0.000895 ***
x1           0.19598  0.10064  0.28696    0.04903   3.997  9.1e-05 ***
x2           0.66856  0.58901  0.73888    0.04898  13.649  < 2e-16 ***
groupGp2     0.45668  0.22507  0.68592    0.13544   3.372 0.000901 ***
groupGp3     0.41319  0.20033  0.63209    0.11171   3.699 0.000282 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6863 on 195 degrees of freedom
Multiple R-squared:  0.5385,    Adjusted R-squared:  0.5291 
F-statistic: 56.89 on 4 and 195 DF,  p-value: < 2.2e-16

Note:
- Estimates and their statistics are based on the data after
  mean-centering, scaling, or standardization.
- [CI Lower, CI Upper] are bootstrap percentile confidence intervals.
- Std. Error are not bootstrap SEs.

The output has one additional section:

  • Variables that are standardized4, and variables that are not transformed.

The other sections are similar to those of a usual multiple regression. Note that the column Estimate is intentionally not labelled as Beta because it is possible that only some variables are standardized. Labelling it as Beta, though common, is misleading.

Interpret The Output

In the regression output, because y, x1, and x2 are standardized, they are the usual “betas.” However, group is not standardized and so groupGp2 and groupGp3 still take only two possible values, 0 and 1, and so can still be interpreted as usual.

For example, the coefficient of groupGp2 is 0.457. That is, compared to Gp1, the reference group, the predicted value of y is 0.457 SD (of y) higher in Gp2.

On the other hand, the coefficient of groupGp3 is 0.413. That is, compared to Gp1, the predicted value of y is 0.413 SD (of y) higher in Gp3.

In short, the coefficients of all dummy variables can be interpreted as usual, though the difference in the outcome variable (dependent variable) is in SD of this variable if it is standardized.

Why The Common Practice Is Problematic

Standardize All Variables

Assume that we standardize all variables, including the dummy variables, as in some statistical program. To simulate this, I manually create the dummy variables, standardize them, and do regression:

Show the code
dat$groupGp2 <- 0
dat$groupGp3 <- 0
dat$groupGp2 <- ifelse(dat$group == "Gp2", 1, 0)
dat$groupGp3 <- ifelse(dat$group == "Gp3", 1, 0)
dat_std <- dat[, c("x1", "x2", "groupGp2", "groupGp3", "y")]
dat_std <- as.data.frame(scale(dat_std))
head(dat_std)
           x1          x2   groupGp2   groupGp3          y
1  0.07624462  1.01247114 -0.5220306 -0.9206479  0.3024726
2 -0.94797747 -0.09991063 -0.5220306 -0.9206479  0.4516270
3 -0.12396901  1.43216096  1.9060186 -0.9206479  1.2844056
4 -0.72324789  0.09974764 -0.5220306 -0.9206479 -0.2169509
5 -0.17027692  0.74354368 -0.5220306  1.0807606  0.3364903
6 -0.93708149  1.56255004 -0.5220306 -0.9206479  0.9383413
Show the code
psych::describe(dat_std, range = FALSE, skew = FALSE)
         vars   n mean sd   se
x1          1 200    0  1 0.07
x2          2 200    0  1 0.07
groupGp2    3 200    0  1 0.07
groupGp3    4 200    0  1 0.07
y           5 200    0  1 0.07
Show the code
lm_out_std_wrong <- lm(y ~ x1 + x2 + groupGp2 + groupGp3, dat_std)

The following results are what found in common statistical programs that standardize all variables, including the dummy variables, to yield the “betas”:

Show the code
printCoefmat(summary(lm_out_std_wrong)$coefficients,
             zap.ind = 1:4,
             P.values = TRUE)
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.000000   0.048525  0.0000  1.00000    
x1          0.195980   0.049034  3.9968    9e-05 ***
x2          0.668560   0.048983 13.6488  < 2e-16 ***
groupGp2    0.188080   0.055783  3.3717  0.00090 ***
groupGp3    0.206450   0.055817  3.6987  0.00028 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Let us compare the results.

p-values

The p-values are the same, which is expected:

Show the code
coef_std <- summary(lm_out_std)$coefficients
coef_std_wrong <- summary(lm_out_std_wrong)$coefficients
round(cbind(`p-value (Std Selected)` = coef_std[, "Pr(>|t|)"],
            `p-value (Std All)` = coef_std_wrong[, "Pr(>|t|)"]), 3)
            p-value (Std Selected) p-value (Std All)
(Intercept)                  0.001             1.000
x1                           0.000             0.000
x2                           0.000             0.000
groupGp2                     0.001             0.001
groupGp3                     0.000             0.000

Coefficient Estimates

However, the coefficients for the dummy variables are different:

Show the code
round(cbind(`Estimate (Std Selected)` = coef_std[, "Estimate"],
            `Estimate (Std All)` = coef_std_wrong[, "Estimate"]), 3)
            Estimate (Std Selected) Estimate (Std All)
(Intercept)                  -0.288              0.000
x1                            0.196              0.196
x2                            0.669              0.669
groupGp2                      0.457              0.188
groupGp3                      0.413              0.206

The coefficients of groupGp2 and groupGp3, when they are standardized, are not meaningful. After standardization, they are no longer either 0 or 1.

For example, the so-called “beta” of groupGp2, 0.188, is not the difference between Gp2 and Gp1 on standardized y. It is the increase in standardized y when groupGp2 “increases by one SD”, which is meaningless.

What’s worse, the ranking of the coefficients changed. If we interpret the coefficients with dummy variables not standardized, the Gp2 vs. Gp1 difference is larger than the Gp3 vs. Gp1 difference (0.457 vs 0.413).

However, if we interpret the coefficients with dummy variables standardized, we would have a different conclusion: the Gp2 vs. Gp1 difference is smaller than the Gp3 vs. Gp1 difference (0.188 vs 0.206), though only slightly.

The reason is, there are more cases in Gp3 than in Gp2. The differences in the SDs of the dummy variables are large enough to reverse the ranking:

table(dat$group)

Gp1 Gp2 Gp3 
 65  43  92 
sd(dat$groupGp2)
[1] 0.4118533
sd(dat$groupGp3)
[1] 0.4996481

I created the data to demonstrate that this reversal in ranking is possible. This may not happen in real data. However, having groups with different numbers of cases is probably the norm rather than the exception in real data.5

Use t Statistics Confidence Intervals

Some programs gives confidence intervals of “betas” using t statistics. That is, standardize variables, do regression, and form the confidence intervals, as if the standardized variables were the original data.

Let us compare the bootstrap confidence intervals with the usual OLS confidence intervals based on the t-statistics.

For the output of std_selected_boot(), we can request the t-based confidence intervals by setting type to "lm" when calling confint(), the function commonly used to form confidence intervals:

# OLS confidence intervals
round(confint(lm_out_std, type = "lm"), 3)
             2.5 % 97.5 %
(Intercept) -0.457 -0.120
x1           0.099  0.293
x2           0.572  0.765
groupGp2     0.190  0.724
groupGp3     0.193  0.634

Without setting type, the nonparametric bootstrap confidence intervals are returned (if available):

# Bootstrap Confidence Intervals
round(confint(lm_out_std), 3)
             2.5 % 97.5 %
(Intercept) -0.423 -0.164
x1           0.101  0.287
x2           0.589  0.739
groupGp2     0.225  0.686
groupGp3     0.200  0.632

As shown above, the confidence intervals of x1 by the two methods are close to each other. However, the bootstrap confidence interval of x2 is narrower than the t-based confidence interval.

We can compare the widths of the confidence intervals by examining their ratios (ratio = CI_Width_t / CI_Width_boot):

Show the code
ci_t <- confint(lm_out_std, type = "lm")
ci_b <- confint(lm_out_std)
width_t <- ci_t[, 2] - ci_t[, 1]
width_b <- ci_b[, 2] - ci_b[, 1]
ci_compare <- data.frame(CI_Width_t = width_t,
                         CI_Width_boot = width_b)
ci_compare$ratio <- ci_compare[, 1] / ci_compare[, 2]
round(ci_compare, 3)
            CI_Width_t CI_Width_boot ratio
(Intercept)      0.337         0.259 1.303
x1               0.193         0.186 1.038
x2               0.193         0.150 1.289
groupGp2         0.534         0.461 1.159
groupGp3         0.441         0.432 1.021

The widths of the confidence intervals are nearly identical for x1. However, for x2, the t-based confidence interval is nearly 28.9% wider than the bootstrap confidence interval.

For groupGp2, even though dummy variable is correctly not standardized, it can still be affected because the outcome variable, y, is standardized. Its t-based confidence interval is also wider than the bootstrap confidence interval by about 15.9%.

Final Remarks

What std_selected_boot() does can be implemented in R code by researchers using base R packages only. The function was developed to make it easier for researchers to do standardization right, without adopting the all-or-none approach.

More information on std_selected_boot() can be found from its help page.

Other Resources

There are other program or function that also correctly skip categorical variables when doing standardization. For example, the linear regression procedure in jamovi also do not standardized the dummy variables formed from categorical variables when standardized estimates are requested.

For confidence intervals, other methods have also been proposed to take into account the sampling variability of the standard deviations. Examples are the methods proposed by Dudgeon (2017) and Jones & Waller (2013), in addition to the one proposed by Yuan & Chan (2011).

Revision History and Issues

The revision history of this post can be find in the GitHub history of the source file.

For issues on this post, such as corrections and mistakes, please open an issue for the GitHub repository for this blog. Thanks.

References

Cheung, S. F., Cheung, S.-H., Lau, E. Y. Y., Hui, C. H., & Vong, W. N. (2022). Improving an old way to measure moderation effect in standardized units. Health Psychology, 41(7), 502–505. https://doi.org/10.1037/hea0001188
Dudgeon, P. (2017). Some improvements in confidence intervals for standardized regression coefficients. Psychometrika, 82(4), 928–951. https://doi.org/10.1007/s11336-017-9563-z
Hayes, A. F. (2022). Introduction to mediation, moderation, and conditional process analysis: A regression-based approach (Third edition). The Guilford Press.
Jones, J. A., & Waller, N. G. (2013). Computing confidence intervals for standardized regression coefficients. Psychological Methods, 18(4), 435–453. https://doi.org/10.1037/a0033269
Yuan, K.-H., & Chan, W. (2011). Biases and standard errors of standardized regression coefficients. Psychometrika, 76(4), 670–690. https://doi.org/10.1007/s11336-011-9224-6

Footnotes

  1. x1 and x2 are intentionally generated from nonnormal distributions, for illustration. Note that OLS (ordinary least squares) regression does not assume that the predictors are multivariate normal. Using nonnormal predictors violate no assumptions of OLS regression.↩︎

  2. Suppose we only want to standardize x1 and y, because x2 is on a meaningful unit, we can set to_standardize to ~ x1 + y. Order does not matter.↩︎

  3. Note that there are indeed cases in which they are still unbiased, and cases in which the biases are negligible. See Yuan & Chan (2011) for a detailed discussion.↩︎

  4. std_selected_boot() allows users to do only mean-centering or scaling by standard deviation. Meaning-centering and than scaling by standard deviation is equivalent to standardization.↩︎

  5. It can be argued that size differences should be taken into account. They should be, but probably not by standardizing the dummy variables and making the coefficients not interpretable.↩︎