I have a large table of contract information (several hundred contracts) all collected in a single column of a table. Each contract occupies 6 sequential rows. I have been able to add another column (CAT) indicating the content of each row in the column: Company, Address, CitySTZip, Contact, Contract, Title. [Using R]
A reproducible version of the data after adding the second column to represent column names, the data looks like this:
textFile <- "col1|col2
XYZCo|Company
123 Main Street|Address
Yourtown, MA 12345|CityStZip
Joe Smith|Contact
20-234-56/3|Contract
Process for Work|Title
ZZTop Co|Company
123 Jefferson Street|Address
Chicago, IL 60636|CityStZip
Jane Doe|Contact
23-274-11/3|Contract
Yet Another One|Title"
data <- read.csv(text=textFile,header = TRUE,sep="|")
data
col1 col2
1 XYZCo Company
2 123 Main Street Address
3 Yourtown, MA 12345 CityStZip
4 Joe Smith Contact
5 20-234-56/3 Contract
6 Process for Work Title
7 ZZTop Co Company
8 123 Jefferson Street Address
9 Chicago, IL 60636 CityStZip
10 Jane Doe Contact
11 23-274-11/3 Contract
12 Yet Another One Title
I would like to reorder all the data so that each contract occupies a single row in the data table with the CAT values as the column headers.
Per comments on the answer posted below, an attempt at reformatting the file using for() loops was unsuccessful.
for(i in 1:nrow(data)){
for(j in 1:6){
# got stuck here...
}
}
The desired output would look like this:
Company Address CitySTZip Contact Contract Title
XYZCo 123 Main Street Yourtown, MA 12345 Joe Smith 20-234-56/3 Process for Work
After posting my original answer I realized that the data may be different than my assumptions due to the fact that the content in the original post referenced a column 1 and a column 2 in the raw data. If the data looks like the following, there is a relatively straightforward answer that combines
dplyrwithtidyr::pivot_wider().First, we'll read the data and print the resulting data frame, 2 columns including data values and column names.
The data frame looks like this:
In order to pivot the data frame to wide format tidy data, we'll need to add an ID column to distinguish one observation from other observations. We can use
dplyr::mutate()for this, along with theceiling()function. Theceiling()function is needed because we want the ID value to be constant for each 6 rows of input data. When we divide the result ofseq_along()by 6, it generates the desired vector.Once we've added the ID column, pivoting to wide format is relatively straightforward.
...and the output:
Original Answer
The interesting challenge with this question is that an observation spans 6 rows of data, yet is not in a fixed record layout so we can't use
read.fwf()orread.fortran()to read the file.Instead, we'll use
readLines()to read the data into a vector, then we'll write it to a temporary file, combining every 6 rows into a single output record. Finally, we'll read the reshaped data withread.csv().The original post wasn't clear whether the column names were distinguishable from the rest of the data in the raw data file, so this solution assumes that we need to parse them out of the desired result data frame.
First we read the data into a character vector with
readLines().Next, we strip out the column name data from the vector. Since there are only 6 column names, I was lazy and just used
sub()repeatedly. The original question indicates that the column names were added after the data was loaded into R, so this code may be unnecessary.Next, we loop through the vector, and combine every 6 rows into a single output record, using the pipe
|as a separator because the data includes commas in theCityStZipfield.Finally, we read the file we just created, and specify
sep = '|'as the separator between columns. We also use thecol.namesargument to set the column names....and the output: