overview_table
Synopsis
The overview_table
table is intended to quickly give an overview of the columns that a dataset consists of. It displays different statistical summaries depending on the types of the columns, including bar graphs that give a quick intuition of the distribution of values. It is styled after R's dfSummary
function from the summarytools
package.
using SummaryTables
using RDatasets
df = dataset("ggplot2", "diamonds")
overview_table(df)
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
1 | Carat [Float64] |
Mean (sd): 0.798 (0.474) min ≤ med ≤ max: 0.2 ≤ 0.7 ≤ 5.01 IQR (CV): 0.64 (0.594) |
273 distinct values | 53940 (100%) |
0 (0%) |
|
2 | Cut [CategoricalValue{String, UInt8}] |
1. Ideal 2. Premium 3. Very Good 4. Good 5. Fair |
21551 (40%) 13791 (25.6%) 12082 (22.4%) 4906 (9.1%) 1610 (3%) |
53940 (100%) |
0 (0%) |
|
3 | Color [CategoricalValue{String, UInt8}] |
1. G 2. E 3. F 4. H 5. D 6. I 7. J |
11292 (20.9%) 9797 (18.2%) 9542 (17.7%) 8304 (15.4%) 6775 (12.6%) 5422 (10.1%) 2808 (5.2%) |
53940 (100%) |
0 (0%) |
|
4 | Clarity [CategoricalValue{String, UInt8}] |
1. SI1 2. VS2 3. SI2 4. VS1 5. VVS2 6. VVS1 7. IF 8. I1 |
13065 (24.2%) 12258 (22.7%) 9194 (17%) 8171 (15.1%) 5066 (9.4%) 3655 (6.8%) 1790 (3.3%) 741 (1.4%) |
53940 (100%) |
0 (0%) |
|
5 | Depth [Float64] |
Mean (sd): 61.7 (1.43) min ≤ med ≤ max: 43 ≤ 61.8 ≤ 79 IQR (CV): 1.5 (0.0232) |
184 distinct values | 53940 (100%) |
0 (0%) |
|
6 | Table [Float64] |
Mean (sd): 57.5 (2.23) min ≤ med ≤ max: 43 ≤ 57 ≤ 95 IQR (CV): 3 (0.0389) |
127 distinct values | 53940 (100%) |
0 (0%) |
|
7 | Price [Int32] |
Mean (sd): 3933 (3989) min ≤ med ≤ max: 326 ≤ 2401 ≤ 18823 IQR (CV): 4374 (1.01) |
11602 distinct values | 53940 (100%) |
0 (0%) |
|
8 | X [Float64] |
Mean (sd): 5.73 (1.12) min ≤ med ≤ max: 0 ≤ 5.7 ≤ 10.7 IQR (CV): 1.83 (0.196) |
554 distinct values | 53940 (100%) |
0 (0%) |
|
9 | Y [Float64] |
Mean (sd): 5.73 (1.14) min ≤ med ≤ max: 0 ≤ 5.71 ≤ 58.9 IQR (CV): 1.82 (0.199) |
552 distinct values | 53940 (100%) |
0 (0%) |
|
10 | Z [Float64] |
Mean (sd): 3.54 (0.706) min ≤ med ≤ max: 0 ≤ 3.53 ≤ 31.8 IQR (CV): 1.13 (0.199) |
375 distinct values | 53940 (100%) |
0 (0%) |
|
Dimensions: 53940 x 10 Duplicate rows: 146 |
Keyword: max_categories
Only n <= max_categories
categories per column will be listed individually, the rest will be lumped together. By default, only the 10 most frequent categories will be displayed.
using SummaryTables
data = (;
letters = reduce(vcat, [fill(str, i) for (str, i) in zip(string.('A':'Z'), (1:26) .^ 2)])
)
t = overview_table(data)
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
1 | letters [String] |
1. Z 2. Y 3. X 4. W 5. V 6. U 7. T 8. S 9. R 10. Q [16 others] |
676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 441 (7.1%) 400 (6.5%) 361 (5.8%) 324 (5.2%) 289 (4.7%) 1496 (24.1%) |
6201 (100%) |
0 (0%) |
|
Dimensions: 6201 x 1 Duplicate rows: 6175 |
We can reduce the number of categories by setting max_categories = 5
:
t = overview_table(data; max_categories = 5)
No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
1 | letters [String] |
1. Z 2. Y 3. X 4. W 5. V [21 others] |
676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 3311 (53.4%) |
6201 (100%) |
0 (0%) |
|
Dimensions: 6201 x 1 Duplicate rows: 6175 |
Keyword: label_metadata_key
If column label metadata is found, a label column is added to the output. This keyword determines which key to use for the lookup, the default is "label"
.
using SummaryTables
using DataFrames
data = DataFrame(
letters = reduce(vcat, [fill(str, i) for (str, i) in zip(string.('A':'Z'), (1:26) .^ 2)])
)
DataFrames.colmetadata!(data, :letters, "label", "Letters of the alphabet")
DataFrames.colmetadata!(data, :letters, "spanish_label", "Letras del alfabeto")
t = overview_table(data)
No | Variable | Label | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
1 | letters [String] |
Letters of the alphabet | 1. Z 2. Y 3. X 4. W 5. V 6. U 7. T 8. S 9. R 10. Q [16 others] |
676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 441 (7.1%) 400 (6.5%) 361 (5.8%) 324 (5.2%) 289 (4.7%) 1496 (24.1%) |
6201 (100%) |
0 (0%) |
|
Dimensions: 6201 x 1 Duplicate rows: 6175 |
We can pick the alternative label by specifying label_metadata_key = "spanish_label"
:
t = overview_table(data; label_metadata_key = "spanish_label")
No | Variable | Label | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing |
1 | letters [String] |
Letras del alfabeto | 1. Z 2. Y 3. X 4. W 5. V 6. U 7. T 8. S 9. R 10. Q [16 others] |
676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 441 (7.1%) 400 (6.5%) 361 (5.8%) 324 (5.2%) 289 (4.7%) 1496 (24.1%) |
6201 (100%) |
0 (0%) |
|
Dimensions: 6201 x 1 Duplicate rows: 6175 |