How do I extract out a JSON column in R where certain entries are NULL and the variables present are sometimes different?

38 Views Asked by At

I have a dataset in R that is in this format:

Name Company submitted_data
Bob Bob Inc. {"dob":2002-03-04, "tel": 1234}
Fadela Fadela Co. NULL
Andy Andrew Inc. {"dob": 1999-10-10, "industry": retail}

I wish to extract the data in the "submitted_data" column into separate columns, using the respective values and preserving NULLs, for example the above should look something like:

Name Company dob tel industry
Bob Bob Inc. 2002-03-04 1234 null
Fadela Fadela Co. NULL NULL NULL
Andy Andrew Inc. 1999-10-10 null retail

I know I need to use the jsonlite package, but so far it's thrown up errors and not been able to get anywhere. Thank you.

1

There are 1 best solutions below

0
SamR On

Essentially your problem is that your column is not valid json:

The only unquoted entities allowed except for numbers, objects and arrays are null, true, false.

We can use the pattern gsub(":(.*?)([,}])", ':"\\1"\\2', txt) to replace all instances of : followed by a value with that value in quotes (e.g. replacing "dob" : 2002-03-04 with "dob" : "2002-03-04").

I've used dplyr::bind_rows() here as it's an easy way to bind a list of json objects which do not have the same keys for each row.

library(dplyr)
dat_list <- lapply(split(dat, seq(nrow(dat))), \(row) {
    txt <- row$submitted_data
    name_df <- data.frame(Name = row$Name)
    if (txt == "NULL") {
        df_out <- name_df
    } else {
        json_txt  <- jsonlite::fromJSON(gsub(":(.*?)([,}])", ':"\\1"\\2', txt))
        df_out <- cbind(name_df, data.frame(json_txt))
    }
    df_out
})

dat |>
    select(-submitted_data) |>
    left_join(bind_rows(dat_list), by = "Name")

#     Name     Company         dob   tel industry
# 1    Bob    Bob Inc.  2002-03-04  1234     <NA>
# 2 Fadela  Fadela Co.        <NA>  <NA>     <NA>
# 3   Andy Andrew Inc.  1999-10-10  <NA>   retail

Input data:

dat  <- structure(list(Name = c("Bob", "Fadela", "Andy"), Company = c("Bob Inc.", 
"Fadela Co.", "Andrew Inc."), submitted_data = c("{\"dob\":2002-03-04, \"tel\": 1234}", 
"NULL", "{\"dob\": 1999-10-10, \"industry\": retail}")), class = "data.frame", row.names = c(NA, 
-3L))