Skip to content

overview_table

Synopsis

The overview_table table is intended to quickly give an overview of the columns that a dataset consists of. It displays different statistical summaries depending on the types of the columns, including bar graphs that give a quick intuition of the distribution of values. It is styled after R's dfSummary function from the summarytools package.

julia
using SummaryTables
using RDatasets

df = dataset("ggplot2", "diamonds")

overview_table(df)
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 Carat
[Float64]
Mean (sd): 0.798 (0.474)
min ≤ med ≤ max:
0.2 ≤ 0.7 ≤ 5.01
IQR (CV): 0.64 (0.594)
273 distinct values 53940
(100%)
0
(0%)
2 Cut
[CategoricalValue{String, UInt8}]
1. Ideal
2. Premium
3. Very Good
4. Good
5. Fair
21551 (40%)
13791 (25.6%)
12082 (22.4%)
4906 (9.1%)
1610 (3%)
53940
(100%)
0
(0%)
3 Color
[CategoricalValue{String, UInt8}]
1. G
2. E
3. F
4. H
5. D
6. I
7. J
11292 (20.9%)
9797 (18.2%)
9542 (17.7%)
8304 (15.4%)
6775 (12.6%)
5422 (10.1%)
2808 (5.2%)
53940
(100%)
0
(0%)
4 Clarity
[CategoricalValue{String, UInt8}]
1. SI1
2. VS2
3. SI2
4. VS1
5. VVS2
6. VVS1
7. IF
8. I1
13065 (24.2%)
12258 (22.7%)
9194 (17%)
8171 (15.1%)
5066 (9.4%)
3655 (6.8%)
1790 (3.3%)
741 (1.4%)
53940
(100%)
0
(0%)
5 Depth
[Float64]
Mean (sd): 61.7 (1.43)
min ≤ med ≤ max:
43 ≤ 61.8 ≤ 79
IQR (CV): 1.5 (0.0232)
184 distinct values 53940
(100%)
0
(0%)
6 Table
[Float64]
Mean (sd): 57.5 (2.23)
min ≤ med ≤ max:
43 ≤ 57 ≤ 95
IQR (CV): 3 (0.0389)
127 distinct values 53940
(100%)
0
(0%)
7 Price
[Int32]
Mean (sd): 3933 (3989)
min ≤ med ≤ max:
326 ≤ 2401 ≤ 18823
IQR (CV): 4374 (1.01)
11602 distinct values 53940
(100%)
0
(0%)
8 X
[Float64]
Mean (sd): 5.73 (1.12)
min ≤ med ≤ max:
0 ≤ 5.7 ≤ 10.7
IQR (CV): 1.83 (0.196)
554 distinct values 53940
(100%)
0
(0%)
9 Y
[Float64]
Mean (sd): 5.73 (1.14)
min ≤ med ≤ max:
0 ≤ 5.71 ≤ 58.9
IQR (CV): 1.82 (0.199)
552 distinct values 53940
(100%)
0
(0%)
10 Z
[Float64]
Mean (sd): 3.54 (0.706)
min ≤ med ≤ max:
0 ≤ 3.53 ≤ 31.8
IQR (CV): 1.13 (0.199)
375 distinct values 53940
(100%)
0
(0%)
Dimensions: 53940 x 10
Duplicate rows: 146

Keyword: max_categories

Only n <= max_categories categories per column will be listed individually, the rest will be lumped together. By default, only the 10 most frequent categories will be displayed.

julia
using SummaryTables

data = (;
    letters = reduce(vcat, [fill(str, i) for (str, i) in zip(string.('A':'Z'), (1:26) .^ 2)])
)

t = overview_table(data)
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 letters
[String]
1. Z
2. Y
3. X
4. W
5. V
6. U
7. T
8. S
9. R
10. Q
[16 others]
676 (10.9%)
625 (10.1%)
576 (9.3%)
529 (8.5%)
484 (7.8%)
441 (7.1%)
400 (6.5%)
361 (5.8%)
324 (5.2%)
289 (4.7%)
1496 (24.1%)
6201
(100%)
0
(0%)
Dimensions: 6201 x 1
Duplicate rows: 6175

We can reduce the number of categories by setting max_categories = 5:

julia
t = overview_table(data; max_categories = 5)
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 letters
[String]
1. Z
2. Y
3. X
4. W
5. V
[21 others]
676 (10.9%)
625 (10.1%)
576 (9.3%)
529 (8.5%)
484 (7.8%)
3311 (53.4%)
6201
(100%)
0
(0%)
Dimensions: 6201 x 1
Duplicate rows: 6175

Keyword: label_metadata_key

If column label metadata is found, a label column is added to the output. This keyword determines which key to use for the lookup, the default is "label".

julia
using SummaryTables
using DataFrames

data = DataFrame(
    letters = reduce(vcat, [fill(str, i) for (str, i) in zip(string.('A':'Z'), (1:26) .^ 2)])
)
DataFrames.colmetadata!(data, :letters, "label", "Letters of the alphabet")
DataFrames.colmetadata!(data, :letters, "spanish_label", "Letras del alfabeto")

t = overview_table(data)
No Variable Label Stats / Values Freqs (% of Valid) Graph Valid Missing
1 letters
[String]
Letters of the alphabet 1. Z
2. Y
3. X
4. W
5. V
6. U
7. T
8. S
9. R
10. Q
[16 others]
676 (10.9%)
625 (10.1%)
576 (9.3%)
529 (8.5%)
484 (7.8%)
441 (7.1%)
400 (6.5%)
361 (5.8%)
324 (5.2%)
289 (4.7%)
1496 (24.1%)
6201
(100%)
0
(0%)
Dimensions: 6201 x 1
Duplicate rows: 6175

We can pick the alternative label by specifying label_metadata_key = "spanish_label":

julia
t = overview_table(data; label_metadata_key = "spanish_label")
No Variable Label Stats / Values Freqs (% of Valid) Graph Valid Missing
1 letters
[String]
Letras del alfabeto 1. Z
2. Y
3. X
4. W
5. V
6. U
7. T
8. S
9. R
10. Q
[16 others]
676 (10.9%)
625 (10.1%)
576 (9.3%)
529 (8.5%)
484 (7.8%)
441 (7.1%)
400 (6.5%)
361 (5.8%)
324 (5.2%)
289 (4.7%)
1496 (24.1%)
6201
(100%)
0
(0%)
Dimensions: 6201 x 1
Duplicate rows: 6175