# Create a Test Dataset
set.seed(586045)
<- 100
n <- data.frame(x1 = rnorm(n, 5, 1),
dat x2 = rnorm(n, 10, 2))
$y <- 2 * dat$x1 + 1 * dat$x2 + rnorm(n, 0, 15)
dat# Do regression
<- lm(y ~ x1 + x2, dat) lm_out
I am thinking about the differences between R and SPSS in doing analysis: one function for one analysis, and several functions for one analysis.
Though not always the case, in R, it is common to do an analysis using several functions. One of them is the “main” function that do the main analysis. Other functions are used to extract information or compute other statistics.
For example, to do multiple regression, this is what we may do:
The main analysis is done by lm()
.
We then use other functions on the output of lm()
. For example, we can use summary()
to print commonly requested results:
summary(lm_out)
Call:
lm(formula = y ~ x1 + x2, data = dat)
Residuals:
Min 1Q Median 3Q Max
-36.807 -10.733 0.153 9.472 37.611
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -20.7634 12.0270 -1.726 0.08746 .
x1 2.8688 1.6483 1.740 0.08495 .
x2 2.5771 0.8555 3.012 0.00331 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.69 on 97 degrees of freedom
Multiple R-squared: 0.1113, Adjusted R-squared: 0.09295
F-statistic: 6.072 on 2 and 97 DF, p-value: 0.003276
Confidence intervals and variance-covariance matrix of the estimates can be obtained by confint()
and vcov()
:
confint(lm_out)
2.5 % 97.5 %
(Intercept) -44.6335974 3.106785
x1 -0.4025881 6.140222
x2 0.8791251 4.275063
vcov(lm_out)
(Intercept) x1 x2
(Intercept) 144.647827 -13.347713331 -7.460851584
x1 -13.347713 2.716868988 -0.005532551
x2 -7.460852 -0.005532551 0.731913160
There are many functions for other statistics, such as influential statistics and model comparison.
In SPSS, to do analysis, we usually use a dialog box from the pull down menu, select variables, check some checkboxes, use some buttons to open other dialog boxes and set other options, click OK, and all the requested results are in the output.
I used to think that this approach is due to the graphical user interface (GUI), which is the strength of SPSS. I forgot that (a) the GUI is a “syntax generator,” and (b) the format of SPSS syntax we have nowadays is very similar to that in SPSS before it has a GUI. Actually, when I first learned SPSS in 90’s, I did not even have access to a PC version with text menu. Syntax command was the only way to do analysis in SPSS. For example, REGRESSION
is the command, and all the checkboxes and options are values for subcommands, like arguments in R functions.
So, the common way we do analysis in SPSS, with one command for one analysis, is not due to the GUI. It has always been this way, at least in the version I used in early 90’s, before systems like Windows became popular.
So, for an analysis, such as multiple regression, one function, or many functions?
When I write functions or develop packages, I generally adopt the do-one-thing-and-do-it-well principle, though what constitutes “one thing” is not always clear. This principle make it easy for me to write, debug, and maintain a function or package.
However, for users who are used to using GUI, using one function to do many things in an analysis is conceptually similar to using a dialog box, thought without the dialog box. The many-function approach does not fit well with the experience in using a dialog box.
In R, we certainly can write a function that calls other functions, simulating commands like REGRESSION
in SPSS.
So, I think this is not a debate of which approach is better. In R, we can do both, and let the users do analysis in whatever approach they like. For development, the do-one-thing-and-do-it-well approach is a better approach. However, for users, especially when developing GUI, the one-function approach may be more convenient to the users. The function in the one-function approach, like REGRESSION
, is like a wrapper of a collection of functions: an interface to them.
For example, we can write an R function similar to REGRESSION
in SPSS. In SPSS, if all the default options are what we need, this command is sufficient:
REGRESSION
/DEPENDENT y
/ENTER x1 x2.
To request confidence intervals (confint()
in R) and the variance-covariance matrix of the estimates (vcov()
in R), this will do:
REGRESSION
/STATISTICS DEFAULT BCOV CI(95)
/DEPENDENT y
/ENTER x1 x2.
A similar function can be written in R:
regression(data = dat,
dep = "y",
ivs = c("x1", "x2"))
We can write it in a more “R-way”:
regression(data = dat,
model = y ~ x1 + x2)
The default printout is something similar to SPSS. It can be a list of tables (data frames) and a print method for printing the output.
Actually, we can still say that we are adopting the do-one-thing-and-do-it-well approach, although the “one thing” is “an interface to a set of functions.”
I am not trying to argue that we should use this or that approach. They are not mutually exclusive. I am just wondering how to make using R by writing scripts more accessible to users who are used to GUI, while still keeping the do-one-thing-and-do-it-well principle. Writing these kinds of wrappers may also make it easier to create GUIs for them. For example, as long as ...
is not used, a generic function can be developed to check the arguments of a function using its definition and then automatically generate a dialog box for it. For a wrapper with a lot of arguments, a configuration file can be used to customize the dialog box.
P.S.: jamovi is already doing something similar. Behind the dialog boxes are kind of wrapper functions. However, though can be used in console, the modules are, naturally, supposed to be used inside jamovi.
:::