I'm trying to do a find all from a Word document for <v:imagedata r:id="rId7" o:title="1-REN"/> with namespace xmlns:v="urn:schemas-microsoft-com:vml" and I cannot figure out what on earth the syntax is.
The docs only cover the very straight forward case and with the URN and VML combo thrown in I can't seem to get any of the examples I've seen online to work. Does anyone happen to know what it is?
I'm trying to do something like this:
namespace = {'v': "urn:schemas-microsoft-com:vml"}
results = ET.fromstring(xml).findall("imagedata", namespace)
for image_id in results:
print(image_id)
Edit: What @aneroid wrote is 1000% the right answer and super helpful. You should upvote it. That said, after understanding all that - I went with the BS4 answer because it does the entire job in two lines exactly how I need it to . If you don't actually care about the namespaces it seems waaaaaaay easier.
ET.findall()vsBS4.find_all():findall()is not recursive by default*. It's only going to find direct children of the node provided. So in your case, it's only searching for image nodes directly under the root element.matchargument (tag or path) with".//"will search for that node anywhere in the tree, since it's supports XPath's.find_all()searches all descendants. So it seaches for 'imagedata' nodes anywhere in the tree.However,
ElementTree.iter()does search all descendants. Using the 'working with namespaces' example in the docs:ET.iterfind()which works with namespaces as a dict (like ET.findall), also does not search descendants, only direct children by default*. Just like ET.findall. Apart from how empty strings''in the tags are treated wrt the namespace, and one returns a list while the other returns an iterator, I can't say there's a meaningful difference betweenET.findallandET.iterfind.ET.findall(), prefixing".//"makes it search the entire tree (matches with any node).When you use the namespaces with ET, you still need the namespace name with the tag. The results line should be:
Also, the
'v'doesn't need to be a'v', you could change it to something more meaningful if needed:Of course, this still won't necessarily get you all the imagedata elements if they aren't direct children of the root. For that, you'd need to create a recursive function to do it for you. See this answer on SO for how. Note, while that answer does a recursive search, you are likely to hit Python's recursion limit if the descendant depth is too...deep.To get all the imagedata elements anywhere in the tree, use the
".//"prefix: