I am trying to collect information from the World Bank. In particular, data from this website.
https://microdata.worldbank.org/index.php/catalog/3761/get-microdata
The "DDI/XML" link contains the metadata behind this dataset. If you download the DDI/XML file, and search for "a3_1", you will notice that the data under the tag "qstnLit" has important information commented out, for example:
"[CDATA[ Please identify which of the following you consider the most important development priorities in Vietnam. (Choose no more than THREE) - Food safety ]]"
I can't seem to collect this information: My code is the following
from bs4 import BeautifulSoup
file = "VNM_2020_WBCS_v01_M.xml"
import pandas as pd
with open(file, "r", encoding = "utf-8") as f:
sauce = f.read()
soup = BeautifulSoup(sauce, features = "lxml")
ref = pd.DataFrame()
soup = soup.select("codeBook")[0].select("dataDscr")[0].select("var")
for txt in soup:
try:
part = pd.DataFrame(data = {"ID" : [txt.attrs["name"]],
"Qn" : [txt.find("qstn").find("qstnlit").text],
"Lbl" : [txt.find("labl").text],
"Max" : [txt.find(attrs = {"type":"max"}).text],
"Min" : [txt.find(attrs = {"type":"min"}).text]
})
ref = ref.append(part)
except:
pass
I can pick up non-commented out text, but no tex that has been commented
Is there a way to recognise comments?
I think you have a different problem than you've described in your question. If I run your code as written, it prints this warning:
And produces no output (so how can we tell if it's working or not?).
If we modify the code to address that warning:
And to get rid of that blank
except, which you should never use, so that the for loop looks like this:We see it fail like this:
Now we're seeing useful information! That tells us that
txt.find('qstn')has returned no results, so maybe we should check for that.A second problem we see here is that you have mis-spelled
qstnLitasqstnlit, so we need to fix that, too.That gets us:
With those problems resolved, we have a new error:
The question becomes: how do we handle entries that are missing these attributes? Since you were previously discarding the entire entry in these situations, we can continue to that by re-introducing the
try/exceptblock, now that we've solved the problem around the question text:Note that rather than using a blank
except, I'm capturing the specific exception we expect to receive when there is a missing attribute.This code now runs without errors. But does it work? If we print out
refafter the loop:We seem to have found some results:
The tl;dr here is that your problems had nothing to do with the CDATA blocks. Rather, your
try/exceptblock was hiding errors that would help you resolve the problem. By at least temporarily removing that block, we were able to detect and correct code errors.