`overview_table`

Synopsis

The overview_table table is intended to quickly give an overview of the columns that a dataset consists of. It displays different statistical summaries depending on the types of the columns, including bar graphs that give a quick intuition of the distribution of values. It is styled after R's dfSummary function from the summarytools package.

julia

using SummaryTables
using RDatasets

df = dataset("ggplot2", "diamonds")

overview_table(df)


No	Variable	Stats / Values	Freqs (% of Valid)	Graph	Valid	Missing

1	Carat [Float64]	Mean (sd): 0.798 (0.474) min ≤ med ≤ max: 0.2 ≤ 0.7 ≤ 5.01 IQR (CV): 0.64 (0.594)	273 distinct values		53940 (100%)	0 (0%)
2	Cut [CategoricalValue{String, UInt8}]	1. Ideal 2. Premium 3. Very Good 4. Good 5. Fair	21551 (40%) 13791 (25.6%) 12082 (22.4%) 4906 (9.1%) 1610 (3%)		53940 (100%)	0 (0%)
3	Color [CategoricalValue{String, UInt8}]	1. G 2. E 3. F 4. H 5. D 6. I 7. J	11292 (20.9%) 9797 (18.2%) 9542 (17.7%) 8304 (15.4%) 6775 (12.6%) 5422 (10.1%) 2808 (5.2%)		53940 (100%)	0 (0%)
4	Clarity [CategoricalValue{String, UInt8}]	1. SI1 2. VS2 3. SI2 4. VS1 5. VVS2 6. VVS1 7. IF 8. I1	13065 (24.2%) 12258 (22.7%) 9194 (17%) 8171 (15.1%) 5066 (9.4%) 3655 (6.8%) 1790 (3.3%) 741 (1.4%)		53940 (100%)	0 (0%)
5	Depth [Float64]	Mean (sd): 61.7 (1.43) min ≤ med ≤ max: 43 ≤ 61.8 ≤ 79 IQR (CV): 1.5 (0.0232)	184 distinct values		53940 (100%)	0 (0%)
6	Table [Float64]	Mean (sd): 57.5 (2.23) min ≤ med ≤ max: 43 ≤ 57 ≤ 95 IQR (CV): 3 (0.0389)	127 distinct values		53940 (100%)	0 (0%)
7	Price [Int32]	Mean (sd): 3933 (3989) min ≤ med ≤ max: 326 ≤ 2401 ≤ 18823 IQR (CV): 4374 (1.01)	11602 distinct values		53940 (100%)	0 (0%)
8	X [Float64]	Mean (sd): 5.73 (1.12) min ≤ med ≤ max: 0 ≤ 5.7 ≤ 10.7 IQR (CV): 1.83 (0.196)	554 distinct values		53940 (100%)	0 (0%)
9	Y [Float64]	Mean (sd): 5.73 (1.14) min ≤ med ≤ max: 0 ≤ 5.71 ≤ 58.9 IQR (CV): 1.82 (0.199)	552 distinct values		53940 (100%)	0 (0%)
10	Z [Float64]	Mean (sd): 3.54 (0.706) min ≤ med ≤ max: 0 ≤ 3.53 ≤ 31.8 IQR (CV): 1.13 (0.199)	375 distinct values		53940 (100%)	0 (0%)

Dimensions: 53940 x 10 Duplicate rows: 146

Keyword: `max_categories`

Only n <= max_categories categories per column will be listed individually, the rest will be lumped together. By default, only the 10 most frequent categories will be displayed.

julia

using SummaryTables

data = (;
    letters = reduce(vcat, [fill(str, i) for (str, i) in zip(string.('A':'Z'), (1:26) .^ 2)])
)

t = overview_table(data)


No	Variable	Stats / Values	Freqs (% of Valid)	Graph	Valid	Missing

1	letters [String]	1. Z 2. Y 3. X 4. W 5. V 6. U 7. T 8. S 9. R 10. Q [16 others]	676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 441 (7.1%) 400 (6.5%) 361 (5.8%) 324 (5.2%) 289 (4.7%) 1496 (24.1%)		6201 (100%)	0 (0%)

Dimensions: 6201 x 1 Duplicate rows: 6175

We can reduce the number of categories by setting max_categories = 5:

julia

t = overview_table(data; max_categories = 5)


No	Variable	Stats / Values	Freqs (% of Valid)	Graph	Valid	Missing

1	letters [String]	1. Z 2. Y 3. X 4. W 5. V [21 others]	676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 3311 (53.4%)		6201 (100%)	0 (0%)

Dimensions: 6201 x 1 Duplicate rows: 6175

Keyword: `label_metadata_key`

If column label metadata is found, a label column is added to the output. This keyword determines which key to use for the lookup, the default is "label".

julia

using SummaryTables
using DataFrames

data = DataFrame(
    letters = reduce(vcat, [fill(str, i) for (str, i) in zip(string.('A':'Z'), (1:26) .^ 2)])
)
DataFrames.colmetadata!(data, :letters, "label", "Letters of the alphabet")
DataFrames.colmetadata!(data, :letters, "spanish_label", "Letras del alfabeto")

t = overview_table(data)


No	Variable	Label	Stats / Values	Freqs (% of Valid)	Graph	Valid	Missing

1	letters [String]	Letters of the alphabet	1. Z 2. Y 3. X 4. W 5. V 6. U 7. T 8. S 9. R 10. Q [16 others]	676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 441 (7.1%) 400 (6.5%) 361 (5.8%) 324 (5.2%) 289 (4.7%) 1496 (24.1%)		6201 (100%)	0 (0%)

Dimensions: 6201 x 1 Duplicate rows: 6175

We can pick the alternative label by specifying label_metadata_key = "spanish_label":

julia

t = overview_table(data; label_metadata_key = "spanish_label")


No	Variable	Label	Stats / Values	Freqs (% of Valid)	Graph	Valid	Missing

1	letters [String]	Letras del alfabeto	1. Z 2. Y 3. X 4. W 5. V 6. U 7. T 8. S 9. R 10. Q [16 others]	676 (10.9%) 625 (10.1%) 576 (9.3%) 529 (8.5%) 484 (7.8%) 441 (7.1%) 400 (6.5%) 361 (5.8%) 324 (5.2%) 289 (4.7%) 1496 (24.1%)		6201 (100%)	0 (0%)

Dimensions: 6201 x 1 Duplicate rows: 6175

overview_table ​

Synopsis ​

Keyword: max_categories ​

Keyword: label_metadata_key ​

`overview_table`

Synopsis

Keyword: `max_categories`

Keyword: `label_metadata_key`