Loop through chisq.test RStudio

60 Views Asked by At

I am trying to perform chisquare tests on about 30 variables. I tried to write a for loop to no luck. The loop should also save the p-value of each test.

I have used this kind of setup before, in other instances, but I recognise that using a vector with the dataframe$variable names does not work on this occasion. I suspect that there is something fundamental I do not understand about the translation from text to variable name.

Example:

survey <- data.frame(
  sex = c(1, 2, 2, 1, 1, 2, 1, 1, 2, 1),
  health = c(1, 2, 3, 4, 5, 1, 3, 2, 4, 5),
  happiness = c(1, 3, 4, 5, 1, 2, 4, 2, 3, 5)
)

variables <- c("survey$health", "data$happiness")
nLoops <- length(variables)

result <- matrix(nrow = nLoops, ncol = 2)

for (i in 1:nLoops){
  test <- chisq.test(variables[i], survey$sex)
  result[, 1] <- test$data.name
  result[, 2] <- test$p.value
}
3

There are 3 best solutions below

2
Limey On

A base R solution, changing your for loop to an lapply call:

lapply (
  c("health", "happiness"),
  function(var) {
    test <- chisq.test(survey[[var]], survey$sex)
    c("name" = paste(var, "vs sex"), "p.val" = test$p.value)
  }
) 
[[1]]
               name               p.val 
    "health vs sex" "0.796763382262977" 

[[2]]
               name               p.val 
 "happiness vs sex" "0.211945584372718" 
Warning messages:
1: In chisq.test(survey[[var]], survey$sex) :
  Chi-squared approximation may be incorrect
2: In chisq.test(survey[[var]], survey$sex) :
  Chi-squared approximation may be incorrect

The warning messages are caused by your small sample size.

A tidyverse solution, which I believe is more robust as it is independent of the names of the columns you wish to analyse. It can easily be generalised to be robust with respect to your grouping variable as well.

library(tidyverse)

survey %>% 
  pivot_longer(
    -sex,
    names_to = "Variable",
    values_to = "Value"
  ) %>% 
  group_by(Variable) %>% 
  group_map(
    function(.x, .y) {
      test <- chisq.test(.x$Value, .x$sex)
      c("name" = paste(.y$Variable, "vs sex"), "p.val" = test$p.value)
    }
  )

Results are identical to the above.

0
maksobelser On

A solution with a loop:

variables <- c("health", "happiness")
p_values = NULL

for (var in variables){
  test <- chisq.test(survey[var], survey$sex)
  p_values = c(p_values, test$p.value)
}

result = data.frame(variables = variables, p_values = p_values)
result

You weren't calling the variables correctly. "survey$health" can not be recognised as a data.frame call, it is a character. I also constructed a data.frame for results for a prettier representation.

Also, you could just add eval(parse(text = "survey$health")) to your original snippet in order to parse your character in an actual call, but that seems like an over-complication, just use []. I just added that for clarity why your snippet didn't work.

2
jay.sf On

You could use a custom function my_chi2 in an lapply and rbind.

> my_chi2 <- \(x, y, data=survey, ...) {
+   ct <- chisq.test(data[[x]], data[[y]], ...)
+   ct$data.name <- sprintf('%s and %s', x, y)
+   as.data.frame(ct[c('data.name', 'p.value')])
+ }

Usage

> (res <- lapply(variables, my_chi2, 'sex') |> do.call(what='rbind'))
          data.name   p.value
1    health and sex 0.7967634
2 happiness and sex 0.2119456
Warning messages:
1: In chisq.test(data[[x]], data[[y]], ...) :
  Chi-squared approximation may be incorrect
2: In chisq.test(data[[x]], data[[y]], ...) :
  Chi-squared approximation may be incorrect

where

> str(res)
'data.frame':   2 obs. of  2 variables:
 $ data.name: chr  "health and sex" "happiness and sex"
 $ p.value  : num  0.797 0.212

You may pass arguments, e.g.

> lapply(variables, my_chi2, 'sex', simulate.p.value=TRUE) |> do.call(what='rbind')
          data.name   p.value
1    health and sex 1.0000000
2 happiness and sex 0.6046977

NB: If you depend on your for loop, you'd need doing,

> variables <- c("health", "happiness")
> nLoops <- length(variables)
> result <- matrix(nrow = nLoops, ncol = 2)
> for (i in 1:nLoops) {
+   test <- chisq.test(survey[[variables[i]]], survey$sex)
+   result[i, 1] <- test$data.name
+   result[i, 2] <- test$p.value
+ }

but that isn't very R-ish.


Data:

> dput(survey)
structure(list(sex = c(1, 2, 2, 1, 1, 2, 1, 1, 2, 1), health = c(1, 
2, 3, 4, 5, 1, 3, 2, 4, 5), happiness = c(1, 3, 4, 5, 1, 2, 4, 
2, 3, 5)), class = "data.frame", row.names = c(NA, -10L))
> dput(variables)
c("health", "happiness")