Trouble with NA's in large dataframe

45 Views Asked by At

I'm having trouble trying to standardize my data. So, first things first, I create the dataframe object with my data, with my desired row names (and I remove the 1st column, as it is not needed.

EXPGli <-read.delim("C:/Users/i5/Dropbox/Guilherme Vergara/Doutorado/Data/Datasets/MergedEXP3.txt", row.names=2)
EXPGli <- EXPGli[,-1]
EXPGli <- as.data.frame(EXPGli)

Then, I am supposed to convert all the columns to Z-score (each column = gene expression values; each row = sample) -> the idea here is to convert every gene expression data to a Z-score value of it for each cell

Z_score <- function(x) {(x-mean(x))/ sd(x)}
apply(EXPGli, 2, Z_score)

Which returns me [ reached 'max' / getOption("max.print") -- omitted 1143 rows ] And now my whole df is NA's cells. Indeed, there are several NAs in the dataset, some full rows and even some columns.

I tried several approaches to remove NAs

EXPGli <- na.omit(EXPGli)
EXPGli %>% drop_na()
print(EXPGli[rowSums(is.na(EXPGli)) == 0, ])
na.exclude(EXPGli)

Yet apparently, it does not work. Additionally, trying to is.na(EXPGli) Returns me False to all fields. I would like to understand what am I doing wrong here, it seems that the issue might be NA's not being recognized in R as NA but I couldnt find a solve for this. Any input is very appreciatted, thanks in advance!

1

There are 1 best solutions below

0
GuedesBF On

You may want to set the argument na.rm = TRUE in your calls to mean(x) and sd(x) inside the Z_score function, otherwise these calls would return NAs for any vector with NAs in it.

Z_score <- function(x) {(x-mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)}