I'm a newbie to the bigstatsr package. I have a sqlite database which I want to convert to an FBM matrix of 40k rows (genes) 60K columns (samples) for later consumption. I found examples of how to populate the matrix with random values but I'm not sure of what would be the best way to populate it with values from my sqlite database.
Currently I do it sequentially, here's some mock code:
library(bigstatsr)
library(RSQLite)
library(dplyr)
number_genes <- 50e3
number_samples <- 70e3
large_genomic_matrix <- bigstatsr::FBM(nrow = number_genes,
ncol = number_samples,
type = "double",
backingfile = "fbm_large_genomic_matrix")
# Code to get a single df at the time
database_connection <- dbConnect(RSQLite::SQLite(), "database.sqlite")
sample_index_counter <- 1
for(current_sample in vector_with_sample_names){
sqlite_df <- DBI::dbListTables(conn = database_connection) %>%
dplyr::tbl("genomic_data") %>%
dplyr::filter(sample == current_sample) %>%
dplyr::collect()
large_genomic_matrix[, sample_index_counter] <- sqlite_df$value
sample_index_counter <- sample_index_counter + 1
}
big_write(large_genomic_matrix, "large_genomic_matrix.out", every_nrow = 1000, progress = interactive())
I have two questions:
- Is there a way of populating the matrix more efficiently? Not sure if big_apply could be used here, perhaps foreach
- Do I always have to use big_write in order to load my matrix later? If so why can't I just use the bk file?
Thanks in advance
That is a very good first try that you have by yourself.
What is inefficient here is to test for
dplyr::filter(sample == current_sample)for every single sample. I would try to usematch()first to get the indices. Then, what would be a bit inefficient is to populate each column individually. As you said, you could usebig_apply()to do this by blocks.big_write()is for writing the FBM to some text file (e.g. csv). What you want here is to useFBM()$save()(second line of the example in the README), and then usebig_attach()on the .rds file (next line of the README).