na.omit is not removing NAs

62 Views Asked by At

I am trying to remove NAs in R. I have tried to replicate a simple example I have found multiple places online but am getting an unexpected output. I cannot find the error through searching online. What am I doing wrong?
I am using R version 4.3.2. I have restarted R and cleared the global variables (and restarted R again) and consistently get this result with anything I try.

a <- c(1,2,NA,3,4,NA,5,6)
b<- na.omit(a)
b

The output is

[1] 1 2 3 4 5 6
attr(,"na.action")
[1] 3 6
attr(,"class")
[1] "omit"

I was expecting to get the output 1 2 3 4 5 6

I have found I can instead use b <- a[!(is.na(a))], but curious why the commonly suggested na.omit does not work.

1

There are 1 best solutions below

2
r2evans On BEST ANSWER

You do get the intended values in the output. What I think you misunderstand is that the attr(,"na.action") and attr(,"class") are simply attributes attached to the numeric vector with six non-NA numbers in it. If you do b+1, you'll get the values incremented:

b + 1
# [1] 2 3 4 5 6 7
# attr(,"na.action")
# [1] 3 6
# attr(,"class")
# [1] "omit"

If you really want to use na.omit and remove the attributes, you can do:

attributes(b) <- NULL
b
# [1] 1 2 3 4 5 6

Ultimately, though, a[!is.na(a)] is much much faster, and still should be safe. Look at the `itr/sec` field to see that a[!is.na(a)] is ~10x faster on this small vector.

bench::mark(
  isna         = a[!is.na(a)]
  omit         = na.omit(a),
  omit_no_attr = `attributes<-`(na.omit(a), NULL),
  check = FALSE)
# # A tibble: 3 × 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time                gc                   
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <list>              <list>               
# 1 isna         311.88ns 325.96ns  2319673.        NA      0   10000     0     4.31ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
# 2 omit            2.8µs   3.29µs   236026.        NA     53.8  4389     1    18.59ms <NULL> <NULL> <bench_tm [4,390]>  <tibble [4,390 × 3]> 
# 3 omit_no_attr   2.91µs   3.38µs   286354.        NA      0   10000     0    34.92ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

Even on a medium-large vector, it's still faster:

a_medium <- rep(a, 1000)
bench::mark(isna = a_medium[!is.na(a_medium)], omit = na.omit(a_medium), omit_no_attr = `attributes<-`(na.omit(a_medium), NULL) , check = FALSE)
# # A tibble: 3 × 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time                gc                   
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <list>              <list>               
# 1 isna           16.2µs   18.3µs    53627.        NA     5.36  9999     1      186ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
# 2 omit           29.4µs   33.4µs    29944.        NA     0    10000     0      334ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>
# 3 omit_no_attr   29.5µs   33.7µs    29215.        NA     2.92  9999     1      342ms <NULL> <NULL> <bench_tm [10,000]> <tibble [10,000 × 3]>

But if it gets a lot larger, we start seeing some parity:

a_big <- rep(a, 100000)
bench::mark(isna = a_big[!is.na(a_big)], omit = na.omit(a_big), omit_no_attr = `attributes<-`(na.omit(a_big), NULL) , check = FALSE)
# # A tibble: 3 × 13
#   expression        min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result memory time             gc                
#   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list> <list> <list>           <list>            
# 1 isna           2.03ms   2.19ms      452.        NA     2.10   215     1      475ms <NULL> <NULL> <bench_tm [216]> <tibble [216 × 3]>
# 2 omit           3.08ms    3.3ms      259.        NA     2.05   126     1      487ms <NULL> <NULL> <bench_tm [127]> <tibble [127 × 3]>
# 3 omit_no_attr    3.1ms   3.27ms      302.        NA     2.05   147     1      487ms <NULL> <NULL> <bench_tm [148]> <tibble [148 × 3]>

but since we're talking on the order if 2-3ms for a vector 800,000 long, the payoff might not be worth the squeeze.