loading .dat into R, no column names

40 Views Asked by At

I am having trouble reading a large .dat file into R.

I am using

data <- read.table("...2018029_ascii/FRSS108PUF.dat", fill=TRUE)

This results in a large dataframe with V1, V2 as column names.

I am using the ASCII file at this link: https://nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2018029

"...nameoffolder/2018029_ascii/FRSS108PUF.dat"

1

There are 1 best solutions below

0
r2evans On

That is not a "something"-delimited file, it's fixed width.

readLines(unz("~/Downloads/2018029_ascii.zip", "FRSS108PUF.dat"), n=3)
# [1] "100011331 1 1 1 1 2 1 2 2 2 2 2 4 1 1 1 1 1 1 2 1 2 3 3 3 3 3 2 2 3 3 3 2 3 2 1 2 3 2 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 5 3 3 3 3 3 3 3 5 1 5 4 4 4 4 4 4 4 4 12000000000000000000000000000000000000000000000000000000000000000000000000017.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.30382293817.303822938 18.5837245717.927828408 18.5837245718.02058140118.02058140117.92782840818.471095936           0 18.5837245718.471095936 18.5837245718.47109593617.92782840818.36509251618.47109593617.92782840818.47109593618.36509251617.927828408"
# [2] "100022331 2 1 2 2 2 1 2 2 2 2 2 4 1 1 2 1 1 1 2 2 1 4 3 3 2 3 3 4 2 3 4 4 3 2 1 4 1 2 2 2 2 2 1 4 2 2 1 1 3 1 2 1 2 4 4 4 4 4 3 3 2 4 1 4 3 4 3 3 3 2 2 3 1200000000000000000000000000000000000000000000000000000000000000000000000005.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.4137039431 5.318726681 5.3187266815.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.4137039431 5.318726681 5.3187266815.33290239175.33290239175.4137039431           0 5.318726681 5.3187266815.41370394315.4137039431 5.3187266815.22702449685.41370394315.41370394315.41370394315.41370394315.41370394315.39811066135.39811066135.39811066135.41370394315.41370394315.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.2292535051"
# [3] "100032331 2 1 1 1 2 1 2 2 2 2 2 4 1 1 1 2 2 1 1 1 3 5 4 4 4 3 4 5 5 5 5 4 4 5 4 4 3 5 2 2 5 1 1 4 1 1 1 1 3 1 1 1 2 3 4 4 5 2 2 1 1 3 1 4 3 4 5 2 2 1 1 2 1200000000000000000000000000000000000000000000000000000000000000000000000005.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.4137039431 5.318726681 5.3187266815.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.41370394315.4137039431 5.318726681 5.3187266815.33290239175.33290239175.41370394315.4137039431 5.318726681 5.3187266815.41370394315.4137039431 5.3187266815.22702449685.41370394315.4137039431           05.41370394315.41370394315.39811066135.39811066135.39811066135.41370394315.41370394315.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.22925350515.2292535051"

The hard part about fixed-width formats is determining the widths of each field. Fortunately (somewhat), the documentation zip has LayoutPUF.pdf that contains each field and the columns for each.

The widths for that file should total 1441, since that's what we're getting from the file:

nchar(readLines(unz("~/Downloads/2018029_ascii.zip", "FRSS108PUF.dat"), n=3))
# [1] 1441 1441 1441

Counting up the columns, we can use

widths <- c(5, rep(1, 4), rep(2, 73), rep(1, 74), rep(12, 101))
out <- read.fwf(unz("~/Downloads/2018029_ascii.zip", "FRSS108PUF.dat"), widths = widths)
# Warning in readLines(file, n = thisblock) :
#   incomplete final line found on '~/Downloads/2018029_ascii.zip:FRSS108PUF.dat'
str(out)
# 'data.frame': 1527 obs. of  253 variables:
#  $ V1  : int  10001 10002 10003 10004 10005 10006 10007 10008 10009 10010 ...
#  $ V2  : int  1 2 2 3 2 2 1 1 2 3 ...
#  $ V3  : int  3 3 3 2 2 2 4 4 3 2 ...
#  $ V4  : int  3 3 3 1 1 1 2 4 4 2 ...
#  $ V5  : int  1 1 1 1 1 1 1 1 1 1 ...
#  $ V6  : int  1 2 2 2 2 1 2 2 2 2 ...
#  $ V7  : int  1 1 1 1 1 2 1 1 1 1 ...
#  $ V8  : int  1 2 1 2 2 2 2 1 2 1 ...
#  $ V9  : int  1 2 1 2 2 2 2 2 2 2 ...
#   [list output truncated]

Over to you to name all 253 columns. You can transcribe from the pdf (you might be able to scrape it, but that doesn't look like an awesome scrape-able pdf), starting with something like

colnames(out) <- c("IDNUMBER", "DSIZCL3", "URBAN", "OEREG", "Q1", "Q2", ...)

It will be laborious, no doubt.


Edit: try this.

m <- pdftools::pdf_text(unz("~/Downloads/2018029_documentation.zip", "LayoutPUF.pdf")) |>
  strsplit("\n") |>
  unlist()
head(m, 20)
#  [1] "U.S. DEPARTMENT OF EDUCATION"                                               "NATIONAL CENTER FOR EDUCATION STATISTICS (NCES)"                           
#  [3] "FAST RESPONSE SURVEY SYSTEM"                                                "CAREER AND TECHNICAL EDUCATION PROGRAMS IN PUBLIC SCHOOL DISTRICTS"        
#  [5] "PUBLIC USE FILE"                                                            ""                                                                          
#  [7] ""                                                                           " Variable Name   Type   Column(s) Description"                             
#  [9] "--------------- ------ ----------- -----------"                             "IDNUMBER         Char    1-5       Random Number assigned to each record"  
# [11] ""                                                                           "DSIZCL3         Num      6        District enrollment size in 3 categories"
# [13] "                                  1 = Less than 2,000"                      "                                  2 = 2,000-4,999"                         
# [15] "                                  3 = 5,000 or more"                        ""                                                                          
# [17] "URBAN           Num      7        Community type"                           "                                  1 = City"                                
# [19] "                                  2 = Suburban"                             "                                  3 = Town"                                
length(m)
# [1] 1323
m[1280:1289]
#  [1] ""                                                                                                           
#  [2] "FWT98    Num   1406-1417 Replicate Weight 98"                                                               
#  [3] ""                                                                                                           
#  [4] "FWT99    Num   1418-1429 Replicate Weight 99"                                                               
#  [5] ""                                                                                                           
#  [6] "FWT100   Num   1430-1441 Replicate Weight 100"                                                              
#  [7] ""                                                                                                           
#  [8] ""                                                                                                           
#  [9] ""                                                                                                           
# [10] ""                                                                                                           

head(m[10:1289])
# [1] "IDNUMBER         Char    1-5       Random Number assigned to each record"   ""                                                                          
# [3] "DSIZCL3         Num      6        District enrollment size in 3 categories" "                                  1 = Less than 2,000"                     
# [5] "                                  2 = 2,000-4,999"                          "                                  3 = 5,000 or more"                       
tail(m[10:1289])
# [1] ""                                              "FWT100   Num   1430-1441 Replicate Weight 100" ""                                             
# [4] ""                                              ""                                              ""                                             
nms <- grep("^\\S+", m[10:1289], value = TRUE) |>
  grep(x = _, "[", fixed = TRUE, invert = TRUE, value = TRUE)
head(nms)
# [1] "IDNUMBER         Char    1-5       Random Number assigned to each record"                         
# [2] "DSIZCL3         Num      6        District enrollment size in 3 categories"                       
# [3] "URBAN           Num      7        Community type"                                                 
# [4] "OEREG           Num      8        REGION"                                                         
# [5] "Q1    Num   9         Does your district offer CTE programs to students at the high school level?"
# [6] "Q2A   Num   10-11     Does an area/regional CTE center or a group/consortium of school districts" 
head(sub(" .*", "", x = nms))
# [1] "IDNUMBER" "DSIZCL3"  "URBAN"    "OEREG"    "Q1"       "Q2A"     
colnames(out) <- sub(" .*", "", x = nms)
out[1:3, 1:10]
#   IDNUMBER DSIZCL3 URBAN OEREG Q1 Q2A Q2B Q2C Q2D Q2E
# 1    10001       1     3     3  1   1   1   1   1   2
# 2    10002       2     3     3  1   2   1   2   2   2
# 3    10003       2     3     3  1   2   1   1   1   2

If you need the questions themselves, you may be able to get some of it from the nms object (some substring might be useful), though the filtering will leave many of those sentences incomplete.