Duplicate Error message using "Intsvy" package in Rstudio

66 Views Asked by At

I am using the package intsvy to analyze PISA data. Using the merge function, I am trying to combine the 2015 student file with the school file. However, I get an error telling me there are duplicate labels.

What is strange is that this code has worked for the past two months, and then inexplicably stopped working and produced the error message below. The two files do have similar labels, but it was my understanding the the merge function recognizes this and combines the two datasets.

Any insights as to why this error is suddenly occurring?

library(intsvy)

PISA2015 <- pisa.select.merge(folder = "/Users/x/Desktop/x/EPICER/Analysis/R Script and Supporting Datasets",
                              school.file = "2015_SCHQ.sav",
                              student.file = "2015_STUQ1.sav",
                              student = c("ESCS", "PARED"), 
                              school = c("CLSIZE", "SCHSIZE"),
                              countries = c("PRT"))

File character set is 'WINDOWS-1252'.
Converting character set to UTF-8.
File character set is 'WINDOWS-1252'.
Converting character set to UTF-8.
Error in as.factor(x) : Duplicate labels
In addition: Warning messages:
  1: 11 variables have duplicated labels:
  CNTRYID, Region, STRATUM, SUBNATIO, ST011D17TA, ST011D18TA,
ST011D19TA, PROGN, OCOD1, OCOD2, OCOD3 
2: 4 variables have duplicated labels:
  CNTRYID, Region, STRATUM, SUBNATIO 

I have tried deleting the original PISA data files, then redownloading them. However the issue persists. I have also tried uninstalling the package and Rstudio, then reinstalling both but that did not work either.

1

There are 1 best solutions below

1
L Tyrone On

Can't say why the issue is occurring, maybe a bug, but here is a repex using data from the PISA 2015 database. You can replace the file paths with your own.

The approach outlined below bypasses the intsvy package and instead uses the dplyr and haven packages. I tried your method using intsvy and received the same error. I have never used intsvy but perhaps some other settings need to be declared. Either way, this works:

library(haven) # For importing .sav into R
library(dplyr) # For data manipulation
options(scipen = 999) # For repex to ensure school ids not displayed as scientific notation

# PISA data downloaded from https://www.oecd.org/pisa/data/2015database/ for this repex
# Load previously unzipped .sav files into R (replace paths with your file paths)
SCHQ_2015 <- read_sav("C:/test/CY6_MS_CMB_SCH_QQQ.sav")
STUQ1_2015 <- read_sav("C:/test/CY6_MS_CMB_STU_QQQ.sav") # May take a while depending on your computer

# Subset schools data
schqprt15 <- SCHQ_2015 %>%
  select(CNT, CNTSCHID, CLSIZE, SCHSIZE) %>%
  filter(CNT == "PRT")

# Subset students data
stuqprt15 <- STUQ1_2015 %>%
  filter(CNT == "PRT") %>%
  select(CNTSCHID, ESCS, PARED)
  
# Join data
PISA2015 <- stuqprt15 %>%
  left_join(., schqprt15, by = "CNTSCHID")

# Result
data.frame(PISA2015[100:110,])
   CNTSCHID    ESCS PARED CNT CLSIZE SCHSIZE
1  62000005  0.3794    12 PRT     23     680
2  62000005  0.3238    15 PRT     23     680
3  62000005 -0.5893     9 PRT     23     680
4  62000005 -1.9881     6 PRT     23     680
5  62000005  0.5125    15 PRT     23     680
6  62000005 -1.3411     9 PRT     23     680
7  62000005  1.2722    17 PRT     23     680
8  62000005 -1.4580     9 PRT     23     680
9  62000005 -0.4288    12 PRT     23     680
10 62000005  0.9064    15 PRT     23     680
11 62000005 -0.2307    12 PRT     23     680