I am trying to parse the XML (or maybe HTML?) output of the San Francisco transit Operators API (free API key required):
https://511.org/open-data/transit
Pasted the full XML string into this Gist since it's so long and I haven't bothered minimizing the example: https://gist.github.com/MichaelChirico/7a3a5bb95d577d8d83ebea37c44320d0
I'm using R's xml2 package to process this, which uses libxml2 as a backend:
For some reason, I can't find Operator nodes in the normal way:
library(xml2)
s = ' <xml string here> '
xml = read_xml(s)
xml_find_all(xml, "//Operator")
# {xml_nodeset (0)}
However, name() finds Operator as the correct node name:
# Using '*' because some intermediate nodes have the same issue,
# basically anything nested beyond a `siri:` node.
xml_find_chr(xml, 'name(*/*/*/*/*/*)')
# [1] "Operator"
And this convoluted approach works:
xml_find_all(xml, '//*[name() = "Operator"]') |> head()
# {xml_nodeset (6)}
# [1] <Operator id="5E" version="any">\n <Extensions>\n <Monitored>false</Monitored>\n <OtherM ...
# [2] <Operator id="5F" version="any">\n <Extensions>\n <Monitored>false</Monitored>\n <OtherM ...
# [3] <Operator id="5O" version="any">\n <Extensions>\n <Monitored>false</Monitored>\n <OtherM ...
# [4] <Operator id="5S" version="any">\n <Extensions>\n <Monitored>false</Monitored>\n <OtherM ...
# [5] <Operator id="AC" version="any">\n <Extensions>\n <Monitored>true</Monitored>\n <OtherMo ...
# [6] <Operator id="CE" version="any">\n <Extensions>\n <Monitored>false</Monitored>\n <OtherM ...
Is this a bug, or am I doing something wrong?
The XML in question has multiple namespaces.
Two of them are relevant to the
<Operator>XML element:So, the fully qualified XPath expression would be as follows:
Where ns1 is an alias for the default namespace.
As end result, you need to add namespaces handling to your code and use a proper XPath expression(s).