XML Python output not as expected (EAD & DC)

43 Views Asked by At

Hello dear Overflowers,

I work mainly with Python processing XMLs, mostly EAD but in this case DC. I have to make a comparsion of two datasets 1. EAD and 2. DC. The goal is to remove all duplicates from the DC-dataset based on a list of EAD-IDs.

The DC-file looks like this and has a relatively flat hierarchy:

NOTE: the main focus lies on the <dc:identifier type="providerItemId>"-Element

<?xml version="1.0" encoding="UTF-8"?><metadata xmlns:europeana="http://www.europeana.eu/schemas/ese/"
xmlns:dcterms="http://purl.org/dc/terms/" xmlns="http://www.openarchives.org/OAI/2.0/"
xmlns:doc="http://www.lyncode.com/xoai" xmlns:dc="http://purl.org/dc/elements/1.1/">
<record>
    <dc:title xml:lang="">Vorlass Autonomes Frauen- Lesbenreferat der RUB (VL-FLRub)</dc:title>
    <dc:identifier type="URL"
        >https://meta-katalog.eu/Record/VorlassAutonomesFrauen-LesbenreferatderRUBVL-FLRublieselle</dc:identifier>
    <dc:identifier type="providerItemId"
        >VorlassAutonomesFrauen-LesbenreferatderRUBVL-FLRublieselle</dc:identifier>
    <dc:identifier type="providerId">oid1553513926447</dc:identifier>
    <dc:source xml:lag="de">Vorlass Autonomes Frauen- Lesbenreferat der RUB (VL-FLRub). </dc:source>
    <dc:type type="document" xml:lang="de">Archivgut</dc:type>
</record>
<record>
    <dc:title xml:lang="">Let's talk about Funk'n Flug</dc:title>
    <dc:description type="object" xml:lag="de">Konzept und Stimmen: Charlotte Kaiser, Rebecca
        Schröder, Katja Teichmann Skript: Rebecca Schröder, Katja Teichmann Schnitt: Rebecca
        Schröder Sounddesign: Isabel Hintzen</dc:description>
    <dc:rights type="binary"
        >https://www.deutsche-digitale-bibliothek.de/content/lizenzen/rv-fz</dc:rights>
    <dc:source xml:lag="de">Sammlung FrauenLesbenRadio Funk&apos;n Flug Bochum
        (NL-FF)</dc:source>
    <dcterms:extent xml:lag="de"/>
    <dc:identifier type="URL">https://meta-katalog.eu/Record/229lieselle</dc:identifier>
    <dc:identifier type="providerItemId">229lieselle</dc:identifier>
    <dc:identifier type="providerId">oid1553513926447</dc:identifier>
    <dc:identifier type="binary"
        >https://manifests.meta-katalog.eu/api/thumbnail/229lieselle_1</dc:identifier>
    <dc:source xml:lag="de">Let&apos;s talk about Funk&apos;n Flug. In: Sammlung
        FrauenLesbenRadio Funk&apos;n Flug Bochum (NL-FF). </dc:source>
    <dc:type type="document" xml:lang="de">Tonträger</dc:type>
    <dcterms:created/>
    <dc:format>image/jpg</dc:format>
    <dc:publisher/>
    <dc:publisher resource=""/>
    <dc:subject type="subject" xml:lag="de">Feminismus</dc:subject>
    <dc:subject type="subject" xml:lag="de">Feministische Kritik</dc:subject>
    <dc:subject type="subject" xml:lag="de">Frauenbewegung</dc:subject>
    <dc:subject type="subject" xml:lag="de">Lesben</dc:subject>
    <dc:subject type="subject" xml:lag="de">Medien</dc:subject>
    <dc:subject type="subject" xml:lag="de">Öffentlichkeit</dc:subject>
</record>

My code is following:

from lxml import etree
from loguru import logger

def parse_xml_content(xml_findbuch_in, input_type, input_file):
namespaces = {"dc": "http://purl.org/dc/elements/1.1/"}

#.xpath('//dc:identifier[@type="ID"]', namespaces=namespaces)

madonna_ead_identifiers = open("ead_cfiles.txt")
dc_records= xml_findbuch_in.findall("{http://www.openarchives.org/OAI/2.0/}record")

    for line in madonna_ead_identifiers:

    for dc_record in dc_records:
        dc_record_id_pre = dc_record.xpath("dc:identifier[@type='providerItemId']", namespaces=namespaces)
        for dc_record_id in dc_record_id_pre:
            if dc_record_id.attrib["type"] == "providerItemId":
                dc_record_id_txt = dc_record_id.text

                if dc_record_id_txt in line.strip():
                    if dc_record.getparent() is not None:
                        dc_record.getparent().remove(dc_record)


return xml_findbuch_in

The List I use contains 97 Items in a .txt file. A part of the list is shown here:

229lieselle
277lieselle
278lieselle
279lieselle
15lieselle
3lieselle

So I aim that the ID "229lieselle" in the XML found as <dc:identifier type="providerItemId">229lieselle</dc:identifier> is found in the DC-Dataset and its parent element is kicked from the DC-Dataset.

Unfortunately the Elements that get removed are following:

229liselle
29liselle
9liselle
liselle
277liselle
77liselle
7liselle

and so on... It should remove just 229liselle and it should not dismember the ID on its fragments (229liselle to 29liselle/9liselle/liselle) and kick them out of the dataset.

Am I missing something?

Thanks for your help!

1

There are 1 best solutions below

0
DokiDok On

Dear Overflowers and @Parfait,

The mistake lies in the last chunk of Code.

if dc_record_id_txt in line.strip():
                if dc_record.getparent() is not None:
                    dc_record.getparent().remove(dc_record)

in the sector if dc_record_id_txt in line.strip(): it should not be "in" it should be "==".