How to keep UTF8 encoding during xml parse, attribute value change and file writing? (python)

36 Views Asked by At

I'm writing a programm in Python, my goal is to :

  • read input xml file one line at a time
  • for each line find "CH" attribute
  • change attribute value : translate from french to portugese
  • write changed line into output xml file
  • as i manipulate texts in various languages i'd like to keep utf8 encoding to display foreign special characters in the output file

My code:

import os
import xml.etree.ElementTree as ET
from googletrans import Translator



        with open("input file.txt", "r", encoding='utf-8') as input_file:
            with open("output file.txt", "w", encoding='utf-8') as output_file:
                # Read input file
                for ligne in input_file:
                        # line parse
                        root = ET.fromstring(ligne)

                        # Change CH attribute value, translate from french fr to portugese pt
                        current_text= root.get("CH")
                        translator = Translator()
                        translated_text = translator.translate(dest="pt", src="fr", text=current_text)
                        root.attrib["CH"] = translated_text.text

                        # convert bytes to string 
                        decoded_string = ET.tostring(root).decode("utf-8")
                        
                        # write output file
                        output_file.write(decoded_string)

The problem is that in the output file i get non encoded chraracters, for example with the below input file:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

i get this result:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />                                          
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

instead of expected result :


<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" /> 
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

i have checked with displays that the translated_text.text is well formated ("A vitória é nossa"), but decoded_string value is wrong despite utf8 coding specification : <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />

i do not understand why i have this result, could you please help me?

1

There are 1 best solutions below

0
Mark Tolonen On

Parse the whole tree and iterate the ITEXT nodes. The following demonstrates how to change and write the text. Write the modified tree with the .write() method using an XML declaration and declaring the encoding:

# pip install googletrans==4.0.0rc1
# Note 3.0.0 didn't work
import xml.etree.ElementTree as ET
import googletrans as gt

tree = ET.parse('input file.txt')
translator = gt.Translator()
for itext in tree.iterfind('*/ITEXT'):
    current_text = itext.get('CH')
    itext.attrib['CH'] = translator.translate(dest="pt", src="fr", text=current_text).text
tree.write('output file.txt', xml_declaration=True, encoding='UTF-8')

output file.txt

<?xml version='1.0' encoding='UTF-8'?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0" />
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
                <para ALIGN="1" LINESP="10" />
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
                <trail ALIGN="1" LINESP="10" />
        </StoryText>
</SCRIBUSUTF8NEW>