How to account for single apostrophe/quotation in read.table?

198 Views Asked by At

I have the following data frame:

1            1                                        What percent of the world\xd5s population is between 15 and 64 years old?
2            2                                               What percent of the world\xd5s airports are in the United States? 
3            3                                            The area of the USA is what percent of the area of the Pacific Ocean?
4            4                                                      What percent of the earth\xd5s surface is covered by water?
5            5 What percent of the goods exported worldwide are mineral fuels (including oil, coal, gas, and refined products)?
6            6                    What percent of the world\xd5s countries have a higher fertility rate than the United States?
7            7                        What percent of the worldwide gross domestic product (GDP) comes from the service sector?
8            8                                    What percent of the worldwide income does the richest 10% of households earn?
9            9      What percent of the worldwide gross domestic product (GDP) is re-invested (\xd2gross fixed investment\xd3)?
10          10                                      What percent of the worldwide labor force works in the agricultural sector?
11          11                                             What percent of the worldwide land mass is not used for agriculture?
12          12                           What percent of the world\xd5s population speaks Mandarin Chinese as a first language?
13          13                What percentage of the world\xd5s countries have a higher life expectancy than the United States?
14          14                             What percent of the world\xd5s population aged 15 years or older can read and write?
15          15      What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)?
16          16                                                    Saudi Arabia consumes what percentage of the oil it produces?
17          17                   What percent of the world\xd5s population lives in either China, India, or the European Union?
18          18                                                          What percent of the world\xd5s population is Christian?
19          19                                                               What percent of the world\xd5s roads are in India?
20          20                         What percent of the world\xd5s telephone lines are in China, USA, or the European Union?

There is supposed to be an apostrophe in each question for possessive words like, world's or earth's and it is reading differently than I would like, as you can see. I was trying expressions like this DF <- read.table("mydata.csv", header=TRUE, sep="\t", quote="") to no avail. Surprisingly, it is extremely difficult to find an answer to this issue.

4

There are 4 best solutions below

0
On BEST ANSWER

I ended up finding an answer with DF1 <- read.csv("mydata.csv", header=TRUE, sep=",", quote="")

1
On

If that cannot be fixed by choosing a better read-in method, then it can be cured using regex; for example:

x <- "What percent of the world\xd5s population"
gsub("\\\xd5", "'", x)
[1] "What percent of the world's population"

You seem to have other unfortunate transformations of apostrophes; these can be addressed by alternative patterns (but, interestingly, not by regex shortforms such as \\d for number)

x <- c("What percent of the world\xd5s population", 
       "gross domestic product (GDP) is re-invested (\xd2gross fixed investment\xd3)")
gsub("\\\xd5|\\\xd2|\\\xd3", "'", x)
[1] "What percent of the world's population"                                
[2] "gross domestic product (GDP) is re-invested ('gross fixed investment')"
4
On

You may read the table using readLines and exploit the fact that the first two columns together appear to have always 14 characters.

r <- trimws(readLines(file("mydata.csv")))

res <- data.frame(do.call(rbind, strsplit(substring(r, 1, 14), "\\s+")), 
                  X3=trimws(substring(r, 15, nchar(r))))

Then do the cleaning.

within(res, {
  X1 <- as.numeric(X1)
  X2 <- as.numeric(X2)
  X3 <- gsub("\\\\xd5", "'", X3)
  X3 <- gsub("\\\\xd2|\\\\xd3", '"', X3)
})
#    X1 X2                                                                                                               X3
# 1   1  1                                           What percent of the world's population is between 15 and 64 years old?
# 2   2  2                                                   What percent of the world's airports are in the United States?
# 3   3  3                                            The area of the USA is what percent of the area of the Pacific Ocean?
# 4   4  4                                                         What percent of the earth's surface is covered by water?
# 5   5  5 What percent of the goods exported worldwide are mineral fuels (including oil, coal, gas, and refined products)?
# 6   6  6                       What percent of the world's countries have a higher fertility rate than the United States?
# 7   7  7                        What percent of the worldwide gross domestic product (GDP) comes from the service sector?
# 8   8  8                                    What percent of the worldwide income does the richest 10% of households earn?
# 9   9  9            What percent of the worldwide gross domestic product (GDP) is re-invested ("gross fixed investment")?
# 10 10 10                                      What percent of the worldwide labor force works in the agricultural sector?
# 11 11 11                                             What percent of the worldwide land mass is not used for agriculture?
# 12 12 12                              What percent of the world's population speaks Mandarin Chinese as a first language?
# 13 13 13                   What percentage of the world's countries have a higher life expectancy than the United States?
# 14 14 14                                What percent of the world's population aged 15 years or older can read and write?
# 15 15 15      What percent of the worldwide gross domestic product (GDP) is used for the military (military expenditure)?
# 16 16 16                                                    Saudi Arabia consumes what percentage of the oil it produces?
# 17 17 17                      What percent of the world's population lives in either China, India, or the European Union?
# 18 18 18                                                             What percent of the world's population is Christian?
# 19 19 19                                                                  What percent of the world's roads are in India?
# 20 20 20                            What percent of the world's telephone lines are in China, USA, or the European Union?
0
On

The string

What percent of the world\xd5s population is between 15 and 64 years old?

is most likely the result of reading a text file that contains non-ASCII characters. Here, the sequence \xd5 represents a left single quote mark in whatever encoding the file is using, not the 4 characters \ x d 5. Similarly, \xd2 and \xd3 represent the left and right double quote marks respectively. So your file is being read correctly, it's just not being printed in the way you expect.

To convert \xd5 into a regular ASCII quote mark:

gsub("\xd5", "'", x)  # no extra backslashes needed

And similarly, to convert \xd2 and \xd3 into ASCII double quote marks:

gsub("\xd2|\xd3", '"', x)

(And you should also read your data in with read.table(*, stringsAsFactors=FALSE) if you're using a version of R < 4.0.)