Splitting out my data when the length of each varies

42 Views Asked by At

I have a string which I want to split into a dataframe. As you can see from my example and data below I want to split my current data out from a single cell into a position for each number, just using NAs where we have no value. I've tried splitting this out using a gsub but as the length varies from one to three '-' in each cell I don't get the result I am looking for. I thought using unlist(strsplit()) would be good for this but wasnt able to split the original data up correctly, again due to the differing lengths.

mydat
c("1-2 1-1", "1-2 1-1 3-3", "1-1 2-1 4-1", "1-1")

newdat
one   two   three     four    five    six
1     2      1        1       NA       NA
1     2      1        1       3        3
1     1      2        1       4        1
1     1      NA       NA      NA       NA
3

There are 3 best solutions below

0
jay.sf On BEST ANSWER

Using strsplit and adapting the `length<-`s.

> fn <- \(x) {
+   s <- strsplit(x, '-|\\s')
+   t(sapply(s, `length<-`, max(lengths(s))) )|> 
+     as.data.frame() |> type.convert(as.is=TRUE)
+ }
> fn(x) |> setNames(c('one', 'two', 'three', 'four', 'five', 'six'))
  one two three four five six
1   1   2     1    1   NA  NA
2   1   2     1    1    3   3
3   1   1     2    1    4   1
4   1   1    NA   NA   NA  NA

Data:

> dput(x)
c("1-2 1-1", "1-2 1-1 3-3", "1-1 2-1 4-1", "1-1")
0
Maël On

With tidyr::separate_wider_delim, you can set too_few = "align_start" to deal with varying lengths:

library(tidyr)
library(stringr)

mydat <- data.frame(col1 = c("1-2 1-1", "1-2 1-1 3-3", "1-1 2-1 4-1", "1-1"))
mydat |> 
  separate_wider_delim(col1, delim = regex(" |-"), names = as.character(english::english(1:6)), 
                       too_few = "align_start")

#   one two three four five  six
# 1   1   2     1    1 <NA> <NA>
# 2   1   2     1    1    3    3
# 3   1   1     2    1    4    1
# 4   1   1  <NA> <NA> <NA> <NA>
0
Muhammad Uzair Kabeer On

So, In order to do this you have to do the following steps:

  1. Split each each string by space and then by '-'
  2. then you have to find maximum length to the size of array
  3. Prepare a matrix of the required size
  4. Fill the matrix
  5. Display the matrix by making it a dataframe.

If you follow all of the above steps your code should look something like below:

mydat <- c("1-2 1-1", "1-2 1-1 3-3", "1-1 2-1 4-1", "1-1")

# step 1
split_data <- lapply(strsplit(mydat, " "), function(x) {
  sapply(strsplit(x, "-"), function(y) y)
})

# step 2
max_length <- max(sapply(split_data, function(x) length(unlist(x))))

# step 3
aligned_data <- matrix(NA, nrow = length(mydat), ncol = max_length)

# step 4
for (i in seq_along(split_data)) {
  row_data <- unlist(split_data[[i]])
  aligned_data[i, seq_along(row_data)] <- row_data
}

# step 5
data <- as.data.frame(aligned_data)
colnames(data) <- c("one", "two", "three", "four", "five", "six")

print(data)