table_one

Synopsis

"Table 1" is a common term for the first table in a paper that summarizes demographic and other individual data of the population that is being studied. In general terms, it is a table where different columns from the source table are summarized separately, stacked along the rows. The types of analysis can be chosen manually, or will be selected given the column types. Optionally, there can be grouping applied along the columns as well.

In this example, several variables of a hypothetical population are analyzed split by sex.

using SummaryTables
using DataFrames

data = DataFrame(
    sex = ["m", "m", "m", "m", "f", "f", "f", "f", "f", "f"],
    age = [27, 45, 34, 85, 55, 44, 24, 29, 37, 76],
    blood_type = ["A", "0", "B", "B", "B", "A", "0", "A", "A", "B"],
    smoker = [true, false, false, false, true, true, true, false, false, false],
)

table_one(
    data,
    [:age => "Age (years)", :blood_type => "Blood type", :smoker => "Smoker"],
    groupby = :sex => "Sex",
    show_n = true
)
Sex
Overall
(n=10)
f
(n=6)
m
(n=4)
Age (years)
Mean (SD) 45.6 (20.7) 44.2 (19.1) 47.8 (25.9)
Median [Min, Max] 40.5 [24, 85] 40.5 [24, 76] 39.5 [27, 85]
Blood type
0 2 (20%) 1 (16.7%) 1 (25%)
A 4 (40%) 3 (50%) 1 (25%)
B 4 (40%) 2 (33.3%) 2 (50%)
Smoker
false 6 (60%) 3 (50%) 3 (75%)
true 4 (40%) 3 (50%) 1 (25%)

Argument 1: table

The first argument can be any object that is a table compatible with the Tables.jl API. Here are some common examples:

DataFrame

using DataFrames
using SummaryTables

data = DataFrame(x = [1, 2, 3], y = ["4", "5", "6"])

table_one(data, [:x, :y])
Overall
x
Mean (SD) 2 (1)
Median [Min, Max] 2 [1, 3]
y
4 1 (33.3%)
5 1 (33.3%)
6 1 (33.3%)

NamedTuple of Vectors

using SummaryTables

data = (; x = [1, 2, 3], y = ["4", "5", "6"])

table_one(data, [:x, :y])
Overall
x
Mean (SD) 2 (1)
Median [Min, Max] 2 [1, 3]
y
4 1 (33.3%)
5 1 (33.3%)
6 1 (33.3%)

Vector of NamedTuples

using SummaryTables

data = [(; x = 1, y = "4"), (; x = 2, y = "5"), (; x = 3, y = "6")]

table_one(data, [:x, :y])
Overall
x
Mean (SD) 2 (1)
Median [Min, Max] 2 [1, 3]
y
4 1 (33.3%)
5 1 (33.3%)
6 1 (33.3%)

Argument 2: analyses

The second argument takes a vector specifying analyses, with one entry for each "row section" of the resulting table. If only one analysis is passed, the vector can be omitted. Each analysis can have up to three parts: the variable, the analysis function and the label.

The variable is passed as a Symbol, corresponding to a column in the input data, and must always be specified. The other two parts are optional.

If you specify only variables, the analysis functions are chosen automatically based on the columns, and the labels are equal to the variable names. Number variables show the mean, standard deviation, median, minimum and maximum. String variables or other non-numeric variables show counts and percentages of each element type.

using SummaryTables

data = (; x = [1, 2, 3], y = ["a", "b", "a"])

table_one(data, [:x, :y])
Overall
x
Mean (SD) 2 (1)
Median [Min, Max] 2 [1, 3]
y
a 2 (66.7%)
b 1 (33.3%)

In the next example, we rename the x variable by passing a String in a Pair.

using SummaryTables

data = (; x = [1, 2, 3], y = ["a", "b", "a"])

table_one(data, [:x => "Variable X", :y])
Overall
Variable X
Mean (SD) 2 (1)
Median [Min, Max] 2 [1, 3]
y
a 2 (66.7%)
b 1 (33.3%)

Labels can be any type except <:Function (that type signals that an analysis function has been passed). One example of a non-string label is Concat in conjunction with Superscript.

using SummaryTables

data = (; x = [1, 2, 3], y = ["a", "b", "a"])

table_one(data, [:x => Concat("X", Superscript("with superscript")), :y])
Overall
Xwith superscript
Mean (SD) 2 (1)
Median [Min, Max] 2 [1, 3]
y
a 2 (66.7%)
b 1 (33.3%)

Any object which is a subtype of Function is assumed to be an analysis function. An analysis function takes a data column as input and returns a Tuple where each entry corresponds to one analysis row. Each of these rows consists of a Pair where the left side is the analysis result and the right side the label. Here's an example of a custom number column analysis function. Note the use of Concat to build content out of multiple parts. This is preferred to interpolating into a string because interpolation destroys the original objects and takes away the possibility for automatic rounding or other special post-processing or display behavior.

using SummaryTables
using Statistics

data = (; x = [1, 2, 3])

function custom_analysis(column)
    (
        minimum(column) => "Minimum",
        maximum(column) => "Maximum",
        Concat(mean(column), " (", std(column), ")") => "Mean (SD)",
    )
end

table_one(data, :x => custom_analysis)
Overall
x
Minimum 1
Maximum 3
Mean (SD) 2 (1)

Finally, all three parts, variable, analysis function and label can be combined as well:

using SummaryTables
using Statistics

data = (; x = [1, 2, 3])

function custom_analysis(column)
    (
        minimum(column) => "Minimum",
        maximum(column) => "Maximum",
        Concat(mean(column), " (", std(column), ")") => "Mean (SD)",
    )
end

table_one(data, :x => custom_analysis => "Variable X")
Overall
Variable X
Minimum 1
Maximum 3
Mean (SD) 2 (1)

Keyword: groupby

The groupby keyword takes a vector of column name symbols with optional labels. If there is only one grouping column, the vector can be omitted. Each analysis is then computed separately for each group.

using SummaryTables

data = (; x = [1, 2, 3, 4, 5, 6], y = ["a", "a", "a", "b", "b", "b"])

table_one(data, :x, groupby = :y)
y
Overall a b
x
Mean (SD) 3.5 (1.87) 2 (1) 5 (1)
Median [Min, Max] 3.5 [1, 6] 2 [1, 3] 5 [4, 6]

In this example, we rename the grouping column:

using SummaryTables

data = (; x = [1, 2, 3, 4, 5, 6], y = ["a", "a", "a", "b", "b", "b"])

table_one(data, :x, groupby = :y => "Column Y")
Column Y
Overall a b
x
Mean (SD) 3.5 (1.87) 2 (1) 5 (1)
Median [Min, Max] 3.5 [1, 6] 2 [1, 3] 5 [4, 6]

If there are multiple grouping columns, they are shown in a nested fashion, with the first group at the highest level:

using SummaryTables

data = (;
    x = [1, 2, 3, 4, 5, 6],
    y = ["a", "a", "b", "b", "c", "c"],
    z = ["d", "e", "d", "e", "d", "e"],
)

table_one(data, :x, groupby = [:y, :z => "Column Z"])
y
a b c
Column Z Column Z Column Z
Overall d e d e d e
x
Mean (SD) 3.5 (1.87) 1 (NaN) 2 (NaN) 3 (NaN) 4 (NaN) 5 (NaN) 6 (NaN)
Median [Min, Max] 3.5 [1, 6] 1 [1, 1] 2 [2, 2] 3 [3, 3] 4 [4, 4] 5 [5, 5] 6 [6, 6]

Keyword: show_n

When show_n is set to true, the size of each group is shown under its name.

using SummaryTables

data = (; x = [1, 2, 3, 4, 5, 6], y = ["a", "a", "a", "a", "b", "b"])

table_one(data, :x, groupby = :y, show_n = true)
y
Overall
(n=6)
a
(n=4)
b
(n=2)
x
Mean (SD) 3.5 (1.87) 2.5 (1.29) 5.5 (0.707)
Median [Min, Max] 3.5 [1, 6] 2.5 [1, 4] 5.5 [5, 6]

Keyword: show_overall

When show_overall is set to false, the column summarizing all groups together is hidden. Use this only when groupby is set, otherwise the resulting table will be empty.

using SummaryTables

data = (; x = [1, 2, 3, 4, 5, 6], y = ["a", "a", "a", "a", "b", "b"])

table_one(data, :x, groupby = :y, show_overall = false)
y
a b
x
Mean (SD) 2.5 (1.29) 5.5 (0.707)
Median [Min, Max] 2.5 [1, 4] 5.5 [5, 6]

Keyword: sort

By default, group entries are sorted. If you need to maintain the order of entries from your dataset, set sort = false.

Notice how in the following two examples, the group indices are "dos", "tres", "uno" when sorted, but "uno", "dos", "tres" when not sorted. If we want to preserve the natural order of these groups ("uno", "dos", "tres" meaning "one", "two", "three" in Spanish but having a different alphabetical order) we need to set sort = false.

using SummaryTables

data = (; x = [1, 2, 3, 4, 5, 6], y = ["uno", "uno", "dos", "dos", "tres", "tres"])

table_one(data, :x, groupby = :y)
y
Overall dos tres uno
x
Mean (SD) 3.5 (1.87) 3.5 (0.707) 5.5 (0.707) 1.5 (0.707)
Median [Min, Max] 3.5 [1, 6] 3.5 [3, 4] 5.5 [5, 6] 1.5 [1, 2]
table_one(data, :x, groupby = :y, sort = false)
y
Overall uno dos tres
x
Mean (SD) 3.5 (1.87) 1.5 (0.707) 3.5 (0.707) 5.5 (0.707)
Median [Min, Max] 3.5 [1, 6] 1.5 [1, 2] 3.5 [3, 4] 5.5 [5, 6]
Warning

If you have multiple groups, sort = false can lead to splitting of higher-level groups if they are not correctly ordered in the source data.

Compare the following two tables. In the second one, the group "A" is split by "B" so the label appears twice.

using SummaryTables

data = (; x = [1, 2, 3, 4, 5, 6], y = ["A", "A", "B", "B", "B", "A"], z = ["C", "C", "C", "D", "D", "D"])

table_one(data, :x, groupby = [:y, :z])
y
A B
z z
Overall C D C D
x
Mean (SD) 3.5 (1.87) 1.5 (0.707) 6 (NaN) 3 (NaN) 4.5 (0.707)
Median [Min, Max] 3.5 [1, 6] 1.5 [1, 2] 6 [6, 6] 3 [3, 3] 4.5 [4, 5]
table_one(data, :x, groupby = [:y, :z], sort = false)
y
A B A
z z z
Overall C C D D
x
Mean (SD) 3.5 (1.87) 1.5 (0.707) 3 (NaN) 4.5 (0.707) 6 (NaN)
Median [Min, Max] 3.5 [1, 6] 1.5 [1, 2] 3 [3, 3] 4.5 [4, 5] 6 [6, 6]