I'd like to split a dataset made of character strings into columns specified by start and end.
My dataset looks something like this:
>head(templines,3)
[1] "201801 1 78"
[2] "201801 2 67"
[3] "201801 1 13"
and i'd like to split it by specifying my columns using the data dictionary:
>dictionary
col_name col_start col_end
year 1 4
week 5 6
gender 8 8
age 11 12
so it becomes:
year week gender age
2018 01 1 78
2018 01 2 67
2018 01 1 13
In reality the data comes from a long running survey and the white spaces between some columns represent variables that are no longer collected. It has many variables so i need a solution that would scale.
In tidyr::separate
it looks like you can only split by specifying the position to split at, rather than the start and end positions. Is there a way to use start / end?
I thought of doing this with read_fwf
but I can't seem to be able to use it on my already loaded dataset. I only managed to get it to work by first exporting as a txt and then reading from this .txt:
write_lines(templines,"t1.txt")
read_fwf("t1.txt",
fwf_positions(start = dictionary$col_start,
end = dictionary$col_end,
col_names = dictionary$col_name)
is it possible to use read_fwf
on an already loaded dataset?
Answering your question directly: yes, it is possible to use
read_fwf
with already loaded data. The relevant part of the docs is the part about the argumentfile
:Thus, you can simply collapse your data and then use
read_fwf
:This should scale to multiple columns, and is fast for many rows (on my machine for 1 million rows and four columns about half a second).
There are a few warnings regarding parsing failures, but they stem from your dictionary. If you change the last line to
age, 11, 12
it works as expected.