The sumvar package provides simple and easy to use tools for summarising continuous and categorical data, inspired by Stata’s “sum” and “tab” commands. All functions are tidyverse/dplyr pipe-friendly and return tibbles.
When I first moved from Stata to R about 5 years ago, the main thing I missed was the simplicity of the “sum” and “tab” functions to efficiently explore data. Most template code to perform these commands, in introductory R books or tutorials eg. https://r4ds.hadley.nz/data-tidy.html, takes typically 3-5 lines to replicate these functions in R. I couldn’t find a package that could quite as simply and efficiently explore data.
Sumvar is fast and easy to use, and brings these variable summary functions to R.
We call dist_sum() to explore a continous variable.
The tibble output shows: the number of rows in the data, and number missing, the median, interquartile range (25th and 75th centiles), mean, the standard deviation, and 95% confidence intervals using the Wald method (normal approximation), and the minimum and maximum values.
Dist_sum() will show a density plot and histogram for a single variable, or a grouped density plot when there is a grouping varialbe.
You can save the output from dist_sum as a tibble and use the
estimates for downstream analysis, eg.
sum_df <- df %>% dist_sum(age, sex)
# Example data
set.seed(123)
df <- tibble::tibble(
age = rnorm(100, mean = 50, sd = 20),
sex = sample(c("male", "female"), 100, replace = TRUE)) %>%
dplyr::mutate(age = dplyr::if_else(sex == "male", age + 10, age))
# Call dist_sum
df %>% dist_sum(age)
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> # A tibble: 1 × 11
#> n n_miss median p25 p75 mean sd ci_lower ci_upper min max
#> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 100 0 55.6 44.0 68.1 56.9 18.2 53.3 60.5 13.8 101.
df %>% dist_sum(age, sex)
#> # A tibble: 2 × 14
#> sex n n_miss median p25 p75 mean sd min max ci_lower
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 female 49 0 52.5 41.1 65.6 54.7 17.6 16.3 93.7 49.8
#> 2 male 51 0 57.2 46.8 71.3 59.0 18.6 13.8 101. 53.9
#> # ℹ 3 more variables: ci_upper <dbl>, p_ttest <dbl>, p_wilcox <dbl>
To explore the distribution of dates, call dist_date() - it is similar to dist_sum. This can also be grouped by a second grouping variable. With a single date, a histogram is shown; when a grouping variable is also called, a density plot is shown.
df3 <- tibble::tibble(
dates = as.Date("2022-01-01") + rnorm(n=100, sd=50, mean=0),
group = sample(c("A", "B"), 100, TRUE)) %>%
dplyr::mutate(dt = dplyr::case_when(group == "A" ~ dates + 10, TRUE ~ dates))
df3 %>% dist_date(dates)
#> # A tibble: 1 × 7
#> n n_miss min p25 median p75 max
#> <int> <int> <date> <date> <date> <date> <date>
#> 1 100 0 2021-10-25 2021-11-26 2021-12-22 2022-01-28 2022-06-12
df3 %>% dist_date(dates, group)
#> # A tibble: 2 × 8
#> group n n_miss min p25 median p75 max
#> <chr> <int> <int> <date> <date> <date> <date> <date>
#> 1 A 43 0 2021-10-25 2021-11-25 2021-12-17 2022-01-16 2022-06-12
#> 2 B 57 0 2021-10-27 2021-12-01 2022-01-03 2022-02-07 2022-04-20
tab1() produces a tibble showing the distribution of a categorical variable and illustrates using a horizontal bar chart.
#> # A tibble: 4 × 3
#> Category Frequency Percent
#> <chr> <int> <chr>
#> 1 C 71 35.5
#> 2 A 66 33.0
#> 3 B 63 31.5
#> 4 Total 200 100.0
To explore the proportion of duplicate values and missing values in a variable, pass it to dup().
example_data <- dplyr::tibble(id = 1:200, age = round(rnorm(200, mean = 30, sd = 50), digits=0))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 20 values with missing.
example_data %>% dup(age)
#> # A tibble: 1 × 7
#> Variable n n_unique n_duplicate percent_duplicate n_missing
#> <chr> <int> <int> <int> <dbl> <int>
#> 1 age 200 119 66 35.7 15
#> # ℹ 1 more variable: percent_missing <dbl>
If you send the whole database to dup(), it will produce a summary of duplicates and missingness in the whole database. Dup() illustrates with a stacked bar chart.
example_data <- dplyr::tibble(age = round(rnorm(200, mean = 30, sd = 50), digits=0),
sex = sample(c("Male", "Female"), 200, TRUE),
favourite_colour = sample(c("Red", "Blue", "Purple"), 200, TRUE))
example_data$age[sample(1:200, size = 15)] <- NA # Replace 15 values with missing.
example_data$sex[sample(1:200, size = 32)] <- NA # Replace 32 values with missing.
dup(example_data)
#> # A tibble: 3 × 7
#> Variable n n_unique n_duplicate percent_duplicate n_missing
#> <chr> <int> <int> <int> <dbl> <int>
#> 1 age 200 117 68 36.8 15
#> 2 sex 200 2 166 98.8 32
#> 3 favourite_colour 200 3 197 98.5 0
#> # ℹ 1 more variable: percent_missing <dbl>