DataExplorer, customize univariate distribution

160 Views Asked by At

I am trying to use DataExplorer to help with quick EDA. I like how it shows univariate distributions. Here is a reproducible example.

A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)
df %>%
  create_report(
    output_file = "trial",
    y= "A", #to get barplots, QQ plots and scatterplots by grouping variable "A"
    report_title = "trial_EDA",
    config = configure_report(
      add_plot_density = TRUE  #To add density plots to report
    )
  )

I want to visualize density by grouping variable, "A", as shown in the picture attached.enter image description here

But I don't know how to use plot density args properly to do this. Also, please suggest other packages to easily navigate through large datasets as a preliminary analysis. Thanks!

1

There are 1 best solutions below

4
Marek Fiołka On BEST ANSWER

You have not specified which variable the B, C or D density graph should apply to. If there is only one, e.g. B then do it like this:

library(tidyverse)
library(ggpubr)

A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)

df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(B, fill=A)) +
  geom_density(alpha=0.2)

enter image description here

You can also do it separately for each of the variables on one plot.

pB = df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(B, fill=A)) +
  geom_density(alpha=0.2)
pC = df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(C, fill=A)) +
  geom_density(alpha=0.2)

pD = df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(D, fill=A)) +
  geom_density(alpha=0.2)

ggarrange(pB, pC, pD, 
          labels = c("B", "C", "D"))

enter image description here

And if you don't like the fillings, you can do it like this

df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(B, color=A)) +
  geom_density()

enter image description here

Update 1

It is possible to create charts for any number of columns. I will show it to you in the example below. First, we'll do it in a very simple, even trivial way.

library(tidyverse)
df = tibble(
  A = rep(c(1,2,3,4,5), 200) %>% factor(),
  B = rnorm(1000),
  C = rnorm(1000, mean = 100, sd=2),
  D = rnorm(1000, 2, 2)
)

fPlot = function(x, group) tibble(x=x, group=group) %>% 
  ggplot(aes(x, color=group)) +
    geom_density()

df %>% select_at(vars(B:D)) %>% 
    map(~fPlot(., df$A))

As you can see, we created three plots for variables B, C and D.

The second way is a bit more difficult to understand. But it will give you some extra bonuses.

fPlot2 = function(df, group) df$data[[1]] %>% 
  ggplot(aes(val, color=A)) +
  geom_density() +
  ggtitle(group)

df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% 
  group_by(var) %>% 
  nest() %>% 
  group_map(fPlot2)

Note that your tibble after df %>% pivot_longer(B:D, names_to = "var", values_to = "val") looks like this.

# A tibble: 3,000 x 3
   A     var        val
   <fct> <chr>    <dbl>
 1 1     B       1.06  
 2 1     C     100.    
 3 1     D       3.54  
 4 2     B      -0.652 
 5 2     C     100.    
 6 2     D       1.12  
 7 3     B       0.652 
 8 3     C      97.3   
 9 3     D       3.57  
10 4     B      -0.0972
# ... with 2,990 more rows

After doing df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% group_by(var) %>% nest() looks like this:

# A tibble: 3 x 2
# Groups:   var [3]
  var   data                
  <chr> <list>              
1 B     <tibble [1,000 x 2]>
2 C     <tibble [1,000 x 2]>
3 D     <tibble [1,000 x 2]>

As you can see the data has been collapsed into three internal tibble in the variable data. This approach will allow you to easily calculate all statistics for each column separately. Look at this.

fStat = function(df) df$data[[1]] %>% 
  group_by(A) %>% 
  summarise(
    n = n(),
    min = min(val),
    mean = mean(val),
    max = max(val),
    median = median(val),
    sd = sd(val),
    sw.stat = stats::shapiro.test(val)$statistic,
    sw.p = stats::shapiro.test(val)$p.value,
  )

df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% 
  group_by(var) %>% 
  nest() %>% 
  group_modify(~fStat(.x))

output

# A tibble: 15 x 10
# Groups:   var [3]
   var   A         n   min      mean    max     median    sd sw.stat  sw.p
   <chr> <fct> <int> <dbl>     <dbl>  <dbl>      <dbl> <dbl>   <dbl> <dbl>
 1 B     1       200 -2.14   0.139     3.16   0.153    0.960   0.994 0.561
 2 B     2       200 -2.00   0.0185    2.61   0.0162   0.923   0.992 0.373
 3 B     3       200 -3.15   0.0245    2.42   0.0718   1.02    0.992 0.347
 4 B     4       200 -2.75   0.00112   2.73  -0.00691  1.02    0.993 0.496
 5 B     5       200 -3.32  -0.00758   3.23  -0.000105 0.993   0.991 0.250
 6 C     1       200 94.6   99.8     104.    99.8      1.97    0.992 0.365
 7 C     2       200 94.8  100.      104.   100.       1.85    0.991 0.290
 8 C     3       200 94.5  100.      106.   100.       1.94    0.996 0.877
 9 C     4       200 94.4   99.9     107.    99.9      1.97    0.993 0.463
10 C     5       200 94.3   99.8     106.    99.8      2.07    0.996 0.887
11 D     1       200 -4.89   1.81      8.11   1.90     2.09    0.995 0.750
12 D     2       200 -5.42   2.15      7.57   2.18     2.14    0.995 0.726
13 D     3       200 -4.38   2.09      7.10   2.02     1.97    0.989 0.111
14 D     4       200 -4.73   2.13      8.98   1.93     1.99    0.989 0.138
15 D     5       200 -2.19   2.24      7.79   2.25     1.87    0.996 0.867

Czy to nie fajne?