Sepereate string into table by variable number of whitespaces

70 Views Asked by At

I have some data that get imported to R like that (character string):

> dput(my_data)
c("S Leistung Sub Text Ergebnis Einheit Normal Auffällig Katalog Datum Zeit Kommentar ", 
"   APOA_S      Apo A1   1.11   g/l   1.04 - 2.02      01   30.03.2023   06:56   ", 
"  ", "", "   APOB_S      Apo B   1.09   g/l   0.66 - 1.33      01   30.03.2023   06:56   ", 
"  ", "", "   B-BA_E      Basophile Granulozyten absolut   0.04   exp 9/l   0 - 0.1      01   27.03.2023   11:56   ", 
"  ", "", "   B-DBB_E      Differentialblutbild   ·            01   27.03.2023   11:45   ", 
"  ")                        
 

there are more lines, than displayed here. I need to bring it in the form of a table as follows:

S  Leistung  Sub  Text    Ergebnis  Einheit  ...   ...
   APOA_S         Apo A1  1.11      g/l
   APOB_S         Apo B1  1.09      g/l
   ...
   ...

I only really need the column named "Leistung" and "Ergebnis", but an ouput with all of them is good, too!

Problem is that my data:

  • it doesn't use a regular separator
  • The only "seperator" I could identify are multiple blank spaces (2-15 spaces)
  • However, 1 blank space can happen within a value (like "Apo A1") and should not be a separator.
  • within the header, only 1 space serves as a seperator
  • Fixed width doesn't work.
  • the column named "Sub" is always empty.

Is there a way to separate it by (multiple) blank spaces? How to deal with the empty "Sub" column?

2

There are 2 best solutions below

4
Merijn van Tilborg On BEST ANSWER

update

With missing data without a clear seperator, it is hard to assign what misses, but according to OP he needs not all data, so here a solution that get limited columns.

data <- do.call(rbind, lapply(stri_trim_both(my_data)[-1], \(line) {
  as.data.frame(t(na.omit(strsplit(line, "\\s{2,}")[[1]])[1:3]))
}))

# Just hardcode them
names(data) <- c("Leistung", "Text", "Ergebnis")

data

#   Leistung                           Text Ergebnis
# 1   APOA_S                         Apo A1     1.11
# 2   APOB_S                          Apo B     1.09
# 3  B-DBB_E           Differentialblutbild        ·
# 4   B-BA_E Basophile Granulozyten absolut     0.04

Update 2

Looking at the data - IF you always have the first 2 columns and the last three columns and either missing data is either a dot or missing at all we could do something like this to get the full data.

data <- do.call(rbind, lapply(stri_trim_both(my_data)[-1], \(line) {
  
  v <- strsplit(line, "\\s{2,}")[[1]]
  l <- length(v)
  if (l > 0) {
    if (l < 8) {
      v <- c(v[1:2], rep(NA, 3), v[(l-2):l])
    }
    as.data.frame(t(v))
  }
  
}))

names(data) <- c("Leistung", "Text", "Ergebnis", "Einheit", "Normal_Auffällig", "Katalog", "Datum", "Zeit")

data

  Leistung                           Text Ergebnis Einheit Normal_Auffällig Katalog      Datum  Zeit
1   APOA_S                         Apo A1     1.11     g/l      1.04 - 2.02      01 30.03.2023 06:56
2   APOB_S                          Apo B     1.09     g/l      0.66 - 1.33      01 30.03.2023 06:56
3  B-DBB_E           Differentialblutbild     <NA>    <NA>             <NA>      01 27.03.2023 11:45
4   B-BA_E Basophile Granulozyten absolut     0.04 exp 9/l          0 - 0.1      01 27.03.2023 11:56

New test.txt

S Leistung Sub Text Ergebnis Einheit Normal Auffällig Katalog Datum Zeit Kommentar 
   APOA_S      Apo A1   1.11   g/l   1.04 - 2.02      01   30.03.2023   06:56   

   APOB_S      Apo B   1.09   g/l   0.66 - 1.33      01   30.03.2023   06:56  

   B-DBB_E      Differentialblutbild   ·            01   27.03.2023   11:45             
  B-BA_E      Basophile Granulozyten absolut   0.04   exp 9/l   0 - 0.1      01   27.03.2023   11:56   

original answer

I assume you read in a text file somehow, lets simulate that.

test.txt

S Leistung Sub Text Ergebnis Einheit Normal Auffällig Katalog Datum Zeit Kommentar 
   APOA_S      Apo A1   1.11   g/l   1.04 - 2.02      01   30.03.2023   06:56   
   APOB_S      Apo B   1.09   g/l   0.66 - 1.33      01   30.03.2023   06:56   

When we read your data you get more or less what you show as my_data

my_data <- readLines("test.txt")

my_data

[1] "S Leistung Sub Text Ergebnis Einheit Normal Auffällig Katalog Datum Zeit Kommentar " "   APOA_S      Apo A1   1.11   g/l   1.04 - 2.02      01   30.03.2023   06:56   "   
[3] "   APOB_S      Apo B   1.09   g/l   0.66 - 1.33      01   30.03.2023   06:56   "    

From there we take two steps, I made some guesses here for the naming and the amount of data records I found in your rows. So I removed S, Sub and Kommentar and I guessed that Normal and Auffällig could be merged indicating the two values. You might adjust that if I was wrong there.

library(stringi) # used for some replacements

header <- stri_replace_all_fixed(my_data[1], "Normal Auffällig", "Normal_Auffällig")
header <- strsplit(header, " ")[[1]]
header <- header[!header %in% c("S", "Sub", "Kommentar")]

I assume these are the ones to "keep"

[1] "Leistung"         "Text"             "Ergebnis"         "Einheit"          "Normal_Auffällig" "Katalog"          "Datum"            "Zeit"    

Then we grab your data, which starts without the first header line, we trim it left and right first and then split by more than one space.

data <- do.call(rbind, lapply(stri_trim_both(my_data)[-1], \(line) {
  as.data.frame(t(strsplit(line, "\\s{2,}")[[1]]))
}))

Now add your headers

names(data) <- header

The final result

data

  Leistung   Text Ergebnis Einheit Normal_Auffällig Katalog      Datum  Zeit
1   APOA_S Apo A1     1.11     g/l      1.04 - 2.02      01 30.03.2023 06:56
2   APOB_S  Apo B     1.09     g/l      0.66 - 1.33      01 30.03.2023 06:56
1
Ben Bolker On

Here's a start:

## split on *two* spaces, trim white space, drop empty elements
x2 <- strsplit(d, "  ") |>
   sapply( trimws ) |> 
      sapply( function(x) x[nzchar(x)])
## drop empty rows
x3 <- x2[lengths(x2) > 0]
## skip header, take first four columns
x4 <- lapply(x3[-1], function(x) x[1:4])
## combine
do.call(rbind, x4)
    [,1]      [,2]                             [,3]   [,4]     
[1,] "APOA_S"  "Apo A1"                         "1.11" "g/l"    
[2,] "APOB_S"  "Apo B"                          "1.09" "g/l"    
[3,] "B-BA_E"  "Basophile Granulozyten absolut" "0.04" "exp 9/l"
[4,] "B-DBB_E" "Differentialblutbild"           "·"    "01"     

You'll still have to parse the header to extract the column names; convert to a data frame and convert columns to numeric as appropriate; figure out what's up with the formatting/structure in lines 3-4; etc. ...