I'm trying to write a function that edits workbooks imported via loadWorkbook and have a problem where blank cells get ignored rather than treated as NAs. The following code demonstrates this issue.
df <- data.frame(x = c(1, 2, 3, 4, NA, 6), y = c(7, 8, 9, 10, NA, 12))
xlsx::write.xlsx(df, file = 'testdf.xlsx', showNA = FALSE, row.names = FALSE)
testwb <- loadWorkbook(file = 'testdf.xlsx')
sheetnames <- c('Sheet1')
sheets <- lapply(sheetnames, function(name) getSheets(testwb)[[name]])
rows <- getRows(sheets[[1]], rowIndex=1:7)
cells <- getCells(rows, colIndex=1:2)
values <- sapply(unlist(cells), getCellValue)
values
length(values)
When I run it as is I get what I expected, with NAs included:
1.1 1.2 2.1 2.2 3.1 3.2 4.1 4.2 5.1 5.2 6.1 6.2 7.1 7.2
"x" "y" "1" "7" "2" "8" "3" "9" "4" "10" NA NA "6" "12"
However, if I open the excel file and manually delete the first row of data and then run it again, I get this:
1.1 1.2 3.1 3.2 4.1 4.2 5.1 5.2 7.1 7.2
"x" "y" "2" "8" "3" "9" "4" "10" "6" "12"
This causes problems for me because when doing comparisons between two sheets they end up with different numbers of values if one has more blank cells than the other. How do I force it to include NA values?
I tried manually rebuilding a copy of the workbook so I could use the keepNA argument in writeData like this:
testwb2 <- openxlsx::createWorkbook()
addWorksheet(testwb2, 'Sheet1')
writeData(testwb2, 'Sheet1', read_xlsx('testdf.xlsx', sheet = 1), keepNA = TRUE)
sheetnames <- c('Sheet1')
sheets <- lapply(sheetnames, function(name) getSheets(testwb2)[[name]])
rows <- getRows(sheets[[1]], rowIndex=1:7)
cells <- getCells(rows, colIndex=1:2)
values <- sapply(unlist(cells), getCellValue)
values
length(values)
but then I get the error
Error in envRefInferField(x, what, getClass(class(x)), selfEnv) : ‘getNumberOfSheets’ is not a valid field or method name for reference class “Workbook”
which I also don't understand. Is there a simple way to handle this?
The question is a little confusing, but the problem has probably been answered a few times already. Here's what I think is happening: When you remove the first row, your spreadsheet software does some kind of housecleaning and removes the empty row from the data (it is no longer part of the xml structure in the xlsx file you wrote), and the
xlsxpackage returns only the cells it could read in the workbook.I have tried to fix similar problems with
openxlsx, so I assume thatxlsxbehaves similarly.Below I have created a file that behaves as you described the problem, showing that
openxlsxalso behaves naughty, but that there are other packages that behave nice.