Almost every biomedical research paper requires a “Table 1: baseline patient characteristics.” Many developers have published tools to help streamline the construction of such tables. The qwraps2::summary_table
function is my contribution to the toolbox.
I have constructed hundreds of Table 1s while working as a biostatistics consultant in a biostatistics department. I’ve also constructed many helper functions to try and streamline the production of such tables. However, every project, every data set, every lead author, every target journal, etc., will present slightly different requirements for the contents and formating of the tables. As such, functions which tried to “do it all”, required constant modification to provide the needed output from each nuanced project.
The tableone package does a lot of good things, and is a great tool for quickly building the baseline summary tables. What my experiences has taught me is that each row group, or even each row, might require some specific formating. A function that treats all continuous variables one way and all categorical variables another way, may work for many cases but not all.
The approach to building the tables I’ve taken now is explicitly define the summary statistics I want for each variable in the data set, the formatting for the summary statistics, and in a way that is easy to work with one or more grouping variables.
The function summary_table
within my qwraps2 package is the tool I and a few colleagues have started to rely on for building baseline patient characteristic tables. (qwraps2, “quick wraps 2”, is a package of formatting functions I’ve found useful for formating results and generating some graphics when authoring .Rmd and .Rnw files.)
Load and attach the qwraps2 namespace. We’ll set the qwraps2_markup
option to markdown
. If this option is not set, qwraps2 uses get0ption(qwraps2_markup, "latex")
as the default markup language.
library(qwraps2)
options(qwraps2_markup = 'markdown') # default is latex
We’ll use the mtcars
dataset for our examples. Let’s report several summary statistics for miles per gallon, number of cylinders, and weight of the vehicles. The following summary is provided to illustrate the functions and thus will include some summaries that would not be used in a publication.
The data summary we want will be:
- Miles Per Gallon
- min
- mean (sd)
- median (iqr)
- max
- Cylinders
- mean
- n (%) of four cylinders engines
- n (%) of six cylinders engines
- n (%) of eight cylinders engines
- Weight
- range
For cylinders we’ll report several things, the mean number of cylinders, and the count (%) of 4, 6, and 8 cylinders cars. In a publication we would likely not report such a summary, treating cylinders as both a continuous and categorical value. However, doing so here helps to illustrate the flexibility of the summary_table
method.
Outlining the wanted summary statistics above as a list-of-lists helps to explain the construction of the summary object constructed below. The summary_table
method takes two arguments,
.data
, adata.frame
or agrouped_df
object, andsummaries
a list-of-lists of right hand sidedformula
e defining the summary statistics.
The construction of the summary table is achieved via dplyr::summarize_
.
The mtcar_summaries
object constructed below, defines each needed row of the summary table via a formula
. I’ve included the qwraps2 namespace for clarity.
<-
mtcar_summaries list("Miles Per Gallon" =
list("min:" = ~ min(mpg),
"mean (sd)" = ~ qwraps2::mean_sd(mpg, denote_sd = "paren"),
"median (IQR)" = ~ qwraps2::median_iqr(mpg),
"max:" = ~ max(mpg)),
"Cylinders:" =
list("mean" = ~ mean(cyl),
"mean (formatted)" = ~ qwraps2::frmt(mean(cyl)),
"4 cyl, n (%)" = ~ qwraps2::n_perc0(cyl == 4),
"6 cyl, n (%)" = ~ qwraps2::n_perc0(cyl == 6),
"8 cyl, n (%)" = ~ qwraps2::n_perc0(cyl == 8)),
"Weight" =
list("Range" = ~ paste(range(wt), collapse = ", "))
)
The table is constructed and printed with ease:
summary_table(mtcars, mtcar_summaries)
##
##
## | |mtcars (N = 32) |
## |:-----------------------------|:--------------------|
## |**Miles Per Gallon** | |
## | min: |10.4 |
## | mean (sd) |20.09 (6.03) |
## | median (IQR) |19.20 (15.43, 22.80) |
## | max: |33.9 |
## |**Cylinders:** | |
## | mean |6.1875 |
## | mean (formatted) |6.19 |
## | 4 cyl, n (%) |11 (34) |
## | 6 cyl, n (%) |7 (22) |
## | 8 cyl, n (%) |14 (44) |
## |**Weight** | |
## | Range |1.513, 5.424 |
The markdown output, rendered as html is:
mtcars (N = 32) | |
---|---|
Miles Per Gallon | |
min: | 10.4 |
mean (sd) | 20.09 (6.03) |
median (IQR) | 19.20 (15.43, 22.80) |
max: | 33.9 |
Cylinders: | |
mean | 6.1875 |
mean (formatted) | 6.19 |
4 cyl, n (%) | 11 (34) |
6 cyl, n (%) | 7 (22) |
8 cyl, n (%) | 14 (44) |
Weight | |
Range | 1.513, 5.424 |
Extending the table to show the same summary by a grouping variable, we’ll use am
(Transmission: 0 = automatic, 1 = manual), is done as follows:
summary_table(dplyr::group_by(mtcars, am), mtcar_summaries)
##
##
## | |am: 0 (N = 19) |am: 1 (N = 13) |
## |:-----------------------------|:--------------------|:--------------------|
## |**Miles Per Gallon** | | |
## | min: |10.4 |15.0 |
## | mean (sd) |17.15 (3.83) |24.39 (6.17) |
## | median (IQR) |17.30 (14.95, 19.20) |22.80 (21.00, 30.40) |
## | max: |24.4 |33.9 |
## |**Cylinders:** | | |
## | mean |6.947368 |5.076923 |
## | mean (formatted) |6.95 |5.08 |
## | 4 cyl, n (%) |3 (16) |8 (62) |
## | 6 cyl, n (%) |4 (21) |3 (23) |
## | 8 cyl, n (%) |12 (63) |2 (15) |
## |**Weight** | | |
## | Weight |2.465, 5.424 |1.513, 3.57 |
am: 0 (N = 19) | am: 1 (N = 13) | |
---|---|---|
Miles Per Gallon | ||
min: | 10.4 | 15.0 |
mean (sd) | 17.15 (3.83) | 24.39 (6.17) |
median (IQR) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
max: | 24.4 | 33.9 |
Cylinders: | ||
mean | 6.947368 | 5.076923 |
mean (formatted) | 6.95 | 5.08 |
4 cyl, n (%) | 3 (16) | 8 (62) |
6 cyl, n (%) | 4 (21) | 3 (23) |
8 cyl, n (%) | 12 (63) | 2 (15) |
Weight | ||
Weight | 2.465, 5.424 | 1.513, 3.57 |
And lastly, building one table with a column for the whole data set and columns for each transmission type is:
cbind(summary_table(mtcars, mtcar_summaries),
summary_table(dplyr::group_by(mtcars, am), mtcar_summaries))
##
##
## | |mtcars (N = 32) |am: 0 (N = 19) |am: 1 (N = 13) |
## |:-----------------------------|:--------------------|:--------------------|:--------------------|
## |**Miles Per Gallon** | | | |
## | min: |10.4 |10.4 |15.0 |
## | mean (sd) |20.09 (6.03) |17.15 (3.83) |24.39 (6.17) |
## | median (IQR) |19.20 (15.43, 22.80) |17.30 (14.95, 19.20) |22.80 (21.00, 30.40) |
## | max: |33.9 |24.4 |33.9 |
## |**Cylinders:** | | | |
## | mean |6.1875 |6.947368 |5.076923 |
## | mean (formatted) |6.19 |6.95 |5.08 |
## | 4 cyl, n (%) |11 (34) |3 (16) |8 (62) |
## | 6 cyl, n (%) |7 (22) |4 (21) |3 (23) |
## | 8 cyl, n (%) |14 (44) |12 (63) |2 (15) |
## |**Weight** | | | |
## | Range |1.513, 5.424 |2.465, 5.424 |1.513, 3.57 |
mtcars (N = 32) | am: 0 (N = 19) | am: 1 (N = 13) | |
---|---|---|---|
Miles Per Gallon | |||
min: | 10.4 | 10.4 | 15.0 |
mean (sd) | 20.09 (6.03) | 17.15 (3.83) | 24.39 (6.17) |
median (IQR) | 19.20 (15.43, 22.80) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
max: | 33.9 | 24.4 | 33.9 |
Cylinders: | |||
mean | 6.1875 | 6.947368 | 5.076923 |
mean (formatted) | 6.19 | 6.95 | 5.08 |
4 cyl, n (%) | 11 (34) | 3 (16) | 8 (62) |
6 cyl, n (%) | 7 (22) | 4 (21) | 3 (23) |
8 cyl, n (%) | 14 (44) | 12 (63) | 2 (15) |
Weight | |||
Range | 1.513, 5.424 | 2.465, 5.424 | 1.513, 3.57 |
Using dplry::group_by
will allow you to build the table with more than one grouping variable. For example:
cbind(summary_table(mtcars, mtcar_summaries),
summary_table(dplyr::group_by(mtcars, am, vs), mtcar_summaries))
##
##
## | |mtcars (N = 32) |am: 0 vs: 0 (N = 12) |am: 0 vs: 1 (N = 7) |am: 1 vs: 0 (N = 6) |am: 1 vs: 1 (N = 7) |
## |:-----------------------------|:--------------------|:--------------------|:--------------------|:--------------------|:--------------------|
## |**Miles Per Gallon** | | | | | |
## | min: |10.4 |10.4 |17.8 |15.0 |21.4 |
## | mean (sd) |20.09 (6.03) |15.05 (2.77) |20.74 (2.47) |19.75 (4.01) |28.37 (4.76) |
## | median (IQR) |19.20 (15.43, 22.80) |15.20 (14.05, 16.62) |21.40 (18.65, 22.15) |20.35 (16.78, 21.00) |30.40 (25.05, 31.40) |
## | max: |33.9 |19.2 |24.4 |26.0 |33.9 |
## |**Cylinders:** | | | | | |
## | mean |6.1875 |8.000000 |5.142857 |6.333333 |4.000000 |
## | mean (formatted) |6.19 |8.00 |5.14 |6.33 |4.00 |
## | 4 cyl, n (%) |11 (34) |0 (0) |3 (43) |1 (17) |7 (100) |
## | 6 cyl, n (%) |7 (22) |0 (0) |4 (57) |3 (50) |0 (0) |
## | 8 cyl, n (%) |14 (44) |12 (100) |0 (0) |2 (33) |0 (0) |
## |**Weight** | | | | | |
## | Range |1.513, 5.424 |3.435, 5.424 |2.465, 3.46 |2.14, 3.57 |1.513, 2.78 |
mtcars (N = 32) | am: 0 vs: 0 (N = 12) | am: 0 vs: 1 (N = 7) | am: 1 vs: 0 (N = 6) | am: 1 vs: 1 (N = 7) | |
---|---|---|---|---|---|
Miles Per Gallon | |||||
min: | 10.4 | 10.4 | 17.8 | 15.0 | 21.4 |
mean (sd) | 20.09 (6.03) | 15.05 (2.77) | 20.74 (2.47) | 19.75 (4.01) | 28.37 (4.76) |
median (IQR) | 19.20 (15.43, 22.80) | 15.20 (14.05, 16.62) | 21.40 (18.65, 22.15) | 20.35 (16.78, 21.00) | 30.40 (25.05, 31.40) |
max: | 33.9 | 19.2 | 24.4 | 26.0 | 33.9 |
Cylinders: | |||||
mean | 6.1875 | 8.000000 | 5.142857 | 6.333333 | 4.000000 |
mean (formatted) | 6.19 | 8.00 | 5.14 | 6.33 | 4.00 |
4 cyl, n (%) | 11 (34) | 0 (0) | 3 (43) | 1 (17) | 7 (100) |
6 cyl, n (%) | 7 (22) | 0 (0) | 4 (57) | 3 (50) | 0 (0) |
8 cyl, n (%) | 14 (44) | 12 (100) | 0 (0) | 2 (33) | 0 (0) |
Weight | |||||
Range | 1.513, 5.424 | 3.435, 5.424 | 2.465, 3.46 | 2.14, 3.57 | 1.513, 2.78 |
The formatting of the output is controlled by the qwraps2:::print.qwraps2_summary_table
and qwraps2::qable
functions.
args(qwraps2:::print.qwraps2_summary_table)
## function (x, rgroup = attr(x, "rgroups"), rnames = rownames(x),
## cnames = colnames(x), ...)
## NULL
args(qwraps2::qable)
## function (x, rtitle, rgroup, rnames = rownames(x), cnames = colnames(x),
## markup = getOption("qwraps2_markup", "latex"), ...)
## NULL
The print
method for qwraps2_summary_table
objects calls qable
which is a wrapper around knitr::kable
. The row groups,rgroup
, row names rnames
, and column names, cnames
, are explicitly set in the print.qwraps2_summary_table
method. The ...
passes additional arguments to qwraps2::qable
which can then continue to pass to knitr::kable
.
A quick example of modifying a table:
<- summary_table(dplyr::group_by(mtcars, am), mtcar_summaries)
by_am
print(by_am,
cnames = c("_Manual_", "_Automatic_"),
rtitle = "Vehicle Characteristics",
align = "lcc")
##
##
## |Vehicle Characteristics | _Manual_ | _Automatic_ |
## |:-----------------------------|:--------------------:|:--------------------:|
## |**Miles Per Gallon** | | |
## | min: | 10.4 | 15.0 |
## | mean (sd) | 17.15 (3.83) | 24.39 (6.17) |
## | median (IQR) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
## | max: | 24.4 | 33.9 |
## |**Cylinders:** | | |
## | mean | 6.947368 | 5.076923 |
## | mean (formatted) | 6.95 | 5.08 |
## | 4 cyl, n (%) | 3 (16) | 8 (62) |
## | 6 cyl, n (%) | 4 (21) | 3 (23) |
## | 8 cyl, n (%) | 12 (63) | 2 (15) |
## |**Weight** | | |
## | Weight | 2.465, 5.424 | 1.513, 3.57 |
Vehicle Characteristics | Manual | Automatic |
---|---|---|
Miles Per Gallon | ||
min: | 10.4 | 15.0 |
mean (sd) | 17.15 (3.83) | 24.39 (6.17) |
median (IQR) | 17.30 (14.95, 19.20) | 22.80 (21.00, 30.40) |
max: | 24.4 | 33.9 |
Cylinders: | ||
mean | 6.947368 | 5.076923 |
mean (formatted) | 6.95 | 5.08 |
4 cyl, n (%) | 3 (16) | 8 (62) |
6 cyl, n (%) | 4 (21) | 3 (23) |
8 cyl, n (%) | 12 (63) | 2 (15) |
Weight | ||
Weight | 2.465, 5.424 | 1.513, 3.57 |
I hope that some readers will find this approach to building summary tables to be useful. If you find bugs or have suggestions on how to extend and improve this tool please create an issue on github.