DataExplorer, customize univariate distribution

Question

DataExplorer, customize univariate distribution

160 Views Asked by Nidhi Desai At 18 September 2021 at 19:09

I am trying to use DataExplorer to help with quick EDA. I like how it shows univariate distributions. Here is a reproducible example.

A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)
df %>%
  create_report(
    output_file = "trial",
    y= "A", #to get barplots, QQ plots and scatterplots by grouping variable "A"
    report_title = "trial_EDA",
    config = configure_report(
      add_plot_density = TRUE  #To add density plots to report
    )
  )

I want to visualize density by grouping variable, "A", as shown in the picture attached.

But I don't know how to use plot density args properly to do this. Also, please suggest other packages to easily navigate through large datasets as a preliminary analysis. Thanks!

Original Q&A

There are 1 best solutions below

**Marek Fiołka** · Accepted Answer · 2021-09-19T15:32:25.553000

You have not specified which variable the B, C or D density graph should apply to. If there is only one, e.g. B then do it like this:

library(tidyverse)
library(ggpubr)

A <- c(rep(c(1,2,3,4,5), 200))
A<- factor(A)
B <- c(x=rnorm(1000))
C <- c(x= rnorm(1000, mean = 100, sd=2))
D <- c(x= rnorm(1000, 2, 2))
df<- data.frame(A, B, C, D)

df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(B, fill=A)) +
  geom_density(alpha=0.2)

You can also do it separately for each of the variables on one plot.

pB = df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(B, fill=A)) +
  geom_density(alpha=0.2)
pC = df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(C, fill=A)) +
  geom_density(alpha=0.2)

pD = df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(D, fill=A)) +
  geom_density(alpha=0.2)

ggarrange(pB, pC, pD, 
          labels = c("B", "C", "D"))

And if you don't like the fillings, you can do it like this

df %>% mutate(A = A %>% fct_inorder()) %>% 
  ggplot(aes(B, color=A)) +
  geom_density()

Update 1

It is possible to create charts for any number of columns. I will show it to you in the example below. First, we'll do it in a very simple, even trivial way.

library(tidyverse)
df = tibble(
  A = rep(c(1,2,3,4,5), 200) %>% factor(),
  B = rnorm(1000),
  C = rnorm(1000, mean = 100, sd=2),
  D = rnorm(1000, 2, 2)
)

fPlot = function(x, group) tibble(x=x, group=group) %>% 
  ggplot(aes(x, color=group)) +
    geom_density()

df %>% select_at(vars(B:D)) %>% 
    map(~fPlot(., df$A))

As you can see, we created three plots for variables B, C and D.

The second way is a bit more difficult to understand. But it will give you some extra bonuses.

fPlot2 = function(df, group) df$data[[1]] %>% 
  ggplot(aes(val, color=A)) +
  geom_density() +
  ggtitle(group)

df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% 
  group_by(var) %>% 
  nest() %>% 
  group_map(fPlot2)

Note that your tibble after df %>% pivot_longer(B:D, names_to = "var", values_to = "val") looks like this.

# A tibble: 3,000 x 3
   A     var        val
   <fct> <chr>    <dbl>
 1 1     B       1.06  
 2 1     C     100.    
 3 1     D       3.54  
 4 2     B      -0.652 
 5 2     C     100.    
 6 2     D       1.12  
 7 3     B       0.652 
 8 3     C      97.3   
 9 3     D       3.57  
10 4     B      -0.0972
# ... with 2,990 more rows

After doing df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% group_by(var) %>% nest() looks like this:

# A tibble: 3 x 2
# Groups:   var [3]
  var   data                
  <chr> <list>              
1 B     <tibble [1,000 x 2]>
2 C     <tibble [1,000 x 2]>
3 D     <tibble [1,000 x 2]>

As you can see the data has been collapsed into three internal tibble in the variable data. This approach will allow you to easily calculate all statistics for each column separately. Look at this.

fStat = function(df) df$data[[1]] %>% 
  group_by(A) %>% 
  summarise(
    n = n(),
    min = min(val),
    mean = mean(val),
    max = max(val),
    median = median(val),
    sd = sd(val),
    sw.stat = stats::shapiro.test(val)$statistic,
    sw.p = stats::shapiro.test(val)$p.value,
  )

df %>% pivot_longer(B:D, names_to = "var", values_to = "val") %>% 
  group_by(var) %>% 
  nest() %>% 
  group_modify(~fStat(.x))

output

# A tibble: 15 x 10
# Groups:   var [3]
   var   A         n   min      mean    max     median    sd sw.stat  sw.p
   <chr> <fct> <int> <dbl>     <dbl>  <dbl>      <dbl> <dbl>   <dbl> <dbl>
 1 B     1       200 -2.14   0.139     3.16   0.153    0.960   0.994 0.561
 2 B     2       200 -2.00   0.0185    2.61   0.0162   0.923   0.992 0.373
 3 B     3       200 -3.15   0.0245    2.42   0.0718   1.02    0.992 0.347
 4 B     4       200 -2.75   0.00112   2.73  -0.00691  1.02    0.993 0.496
 5 B     5       200 -3.32  -0.00758   3.23  -0.000105 0.993   0.991 0.250
 6 C     1       200 94.6   99.8     104.    99.8      1.97    0.992 0.365
 7 C     2       200 94.8  100.      104.   100.       1.85    0.991 0.290
 8 C     3       200 94.5  100.      106.   100.       1.94    0.996 0.877
 9 C     4       200 94.4   99.9     107.    99.9      1.97    0.993 0.463
10 C     5       200 94.3   99.8     106.    99.8      2.07    0.996 0.887
11 D     1       200 -4.89   1.81      8.11   1.90     2.09    0.995 0.750
12 D     2       200 -5.42   2.15      7.57   2.18     2.14    0.995 0.726
13 D     3       200 -4.38   2.09      7.10   2.02     1.97    0.989 0.111
14 D     4       200 -4.73   2.13      8.98   1.93     1.99    0.989 0.138
15 D     5       200 -2.19   2.24      7.79   2.25     1.87    0.996 0.867

Czy to nie fajne?

DataExplorer, customize univariate distribution

There are 1 best solutions below

Related Questions in R

Related Questions in GROUPING

Related Questions in DENSITY-PLOT

Related Questions in EXPLORATORY-DATA-ANALYSIS

Related Questions in R-DATA-EXPLORER

Trending Questions

Popular # Hahtags

Popular Questions