Hello dear Overflowers,
I work mainly with Python processing XMLs, mostly EAD but in this case DC. I have to make a comparsion of two datasets 1. EAD and 2. DC. The goal is to remove all duplicates from the DC-dataset based on a list of EAD-IDs.
The DC-file looks like this and has a relatively flat hierarchy:
NOTE: the main focus lies on the <dc:identifier type="providerItemId>"-Element
<?xml version="1.0" encoding="UTF-8"?><metadata xmlns:europeana="http://www.europeana.eu/schemas/ese/"
xmlns:dcterms="http://purl.org/dc/terms/" xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:doc="http://www.lyncode.com/xoai" xmlns:dc="http://purl.org/dc/elements/1.1/">
<record>
<dc:title xml:lang="">Vorlass Autonomes Frauen- Lesbenreferat der RUB (VL-FLRub)</dc:title>
<dc:identifier type="URL"
>https://meta-katalog.eu/Record/VorlassAutonomesFrauen-LesbenreferatderRUBVL-FLRublieselle</dc:identifier>
<dc:identifier type="providerItemId"
>VorlassAutonomesFrauen-LesbenreferatderRUBVL-FLRublieselle</dc:identifier>
<dc:identifier type="providerId">oid1553513926447</dc:identifier>
<dc:source xml:lag="de">Vorlass Autonomes Frauen- Lesbenreferat der RUB (VL-FLRub). </dc:source>
<dc:type type="document" xml:lang="de">Archivgut</dc:type>
</record>
<record>
<dc:title xml:lang="">Let's talk about Funk'n Flug</dc:title>
<dc:description type="object" xml:lag="de">Konzept und Stimmen: Charlotte Kaiser, Rebecca
Schröder, Katja Teichmann Skript: Rebecca Schröder, Katja Teichmann Schnitt: Rebecca
Schröder Sounddesign: Isabel Hintzen</dc:description>
<dc:rights type="binary"
>https://www.deutsche-digitale-bibliothek.de/content/lizenzen/rv-fz</dc:rights>
<dc:source xml:lag="de">Sammlung FrauenLesbenRadio Funk'n Flug Bochum
(NL-FF)</dc:source>
<dcterms:extent xml:lag="de"/>
<dc:identifier type="URL">https://meta-katalog.eu/Record/229lieselle</dc:identifier>
<dc:identifier type="providerItemId">229lieselle</dc:identifier>
<dc:identifier type="providerId">oid1553513926447</dc:identifier>
<dc:identifier type="binary"
>https://manifests.meta-katalog.eu/api/thumbnail/229lieselle_1</dc:identifier>
<dc:source xml:lag="de">Let's talk about Funk'n Flug. In: Sammlung
FrauenLesbenRadio Funk'n Flug Bochum (NL-FF). </dc:source>
<dc:type type="document" xml:lang="de">Tonträger</dc:type>
<dcterms:created/>
<dc:format>image/jpg</dc:format>
<dc:publisher/>
<dc:publisher resource=""/>
<dc:subject type="subject" xml:lag="de">Feminismus</dc:subject>
<dc:subject type="subject" xml:lag="de">Feministische Kritik</dc:subject>
<dc:subject type="subject" xml:lag="de">Frauenbewegung</dc:subject>
<dc:subject type="subject" xml:lag="de">Lesben</dc:subject>
<dc:subject type="subject" xml:lag="de">Medien</dc:subject>
<dc:subject type="subject" xml:lag="de">Öffentlichkeit</dc:subject>
</record>
My code is following:
from lxml import etree
from loguru import logger
def parse_xml_content(xml_findbuch_in, input_type, input_file):
namespaces = {"dc": "http://purl.org/dc/elements/1.1/"}
#.xpath('//dc:identifier[@type="ID"]', namespaces=namespaces)
madonna_ead_identifiers = open("ead_cfiles.txt")
dc_records= xml_findbuch_in.findall("{http://www.openarchives.org/OAI/2.0/}record")
for line in madonna_ead_identifiers:
for dc_record in dc_records:
dc_record_id_pre = dc_record.xpath("dc:identifier[@type='providerItemId']", namespaces=namespaces)
for dc_record_id in dc_record_id_pre:
if dc_record_id.attrib["type"] == "providerItemId":
dc_record_id_txt = dc_record_id.text
if dc_record_id_txt in line.strip():
if dc_record.getparent() is not None:
dc_record.getparent().remove(dc_record)
return xml_findbuch_in
The List I use contains 97 Items in a .txt file. A part of the list is shown here:
229lieselle
277lieselle
278lieselle
279lieselle
15lieselle
3lieselle
So I aim that the ID "229lieselle" in the XML found as <dc:identifier type="providerItemId">229lieselle</dc:identifier> is found in the DC-Dataset and its parent element is kicked from the DC-Dataset.
Unfortunately the Elements that get removed are following:
229liselle
29liselle
9liselle
liselle
277liselle
77liselle
7liselle
and so on... It should remove just 229liselle and it should not dismember the ID on its fragments (229liselle to 29liselle/9liselle/liselle) and kick them out of the dataset.
Am I missing something?
Thanks for your help!
Dear Overflowers and @Parfait,
The mistake lies in the last chunk of Code.
in the sector
if dc_record_id_txt in line.strip():it should not be "in" it should be "==".