How to keep UTF8 encoding during xml parse, attribute value change and file writing? (python)

Question

How to keep UTF8 encoding during xml parse, attribute value change and file writing? (python)

36 Views Asked by JB B At 03 February 2024 at 20:04

I'm writing a programm in Python, my goal is to :

read input xml file one line at a time
for each line find "CH" attribute
change attribute value : translate from french to portugese
write changed line into output xml file
as i manipulate texts in various languages i'd like to keep utf8 encoding to display foreign special characters in the output file

My code:

import os
import xml.etree.ElementTree as ET
from googletrans import Translator



        with open("input file.txt", "r", encoding='utf-8') as input_file:
            with open("output file.txt", "w", encoding='utf-8') as output_file:
                # Read input file
                for ligne in input_file:
                        # line parse
                        root = ET.fromstring(ligne)

                        # Change CH attribute value, translate from french fr to portugese pt
                        current_text= root.get("CH")
                        translator = Translator()
                        translated_text = translator.translate(dest="pt", src="fr", text=current_text)
                        root.attrib["CH"] = translated_text.text

                        # convert bytes to string 
                        decoded_string = ET.tostring(root).decode("utf-8")
                        
                        # write output file
                        output_file.write(decoded_string)

The problem is that in the output file i get non encoded chraracters, for example with the below input file:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

i get this result:

<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vit&#243;ria &#233; nossa" />                                          
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

instead of expected result :


<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" /> 
                <para ALIGN="1" LINESP="10"/>
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />                 
                <trail ALIGN="1" LINESP="10"/>
        </StoryText>
</SCRIBUSUTF8NEW>

i have checked with displays that the translated_text.text is well formated ("A vitória é nossa"), but decoded_string value is wrong despite utf8 coding specification : <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />

i do not understand why i have this result, could you please help me?

Original Q&A

There are 1 best solutions below

**Mark Tolonen** · Answer 1 · 2024-02-03T23:59:37.527000

Parse the whole tree and iterate the ITEXT nodes. The following demonstrates how to change and write the text. Write the modified tree with the .write() method using an XML declaration and declaring the encoding:

# pip install googletrans==4.0.0rc1
# Note 3.0.0 didn't work
import xml.etree.ElementTree as ET
import googletrans as gt

tree = ET.parse('input file.txt')
translator = gt.Translator()
for itext in tree.iterfind('*/ITEXT'):
    current_text = itext.get('CH')
    itext.attrib['CH'] = translator.translate(dest="pt", src="fr", text=current_text).text
tree.write('output file.txt', xml_declaration=True, encoding='UTF-8')

output file.txt

<?xml version='1.0' encoding='UTF-8'?>
<SCRIBUSUTF8NEW Version="1.5.5">
        <StoryText>
                <DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0" />
                <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
                <para ALIGN="1" LINESP="10" />
                <ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
                <trail ALIGN="1" LINESP="10" />
        </StoryText>
</SCRIBUSUTF8NEW>

How to keep UTF8 encoding during xml parse, attribute value change and file writing? (python)

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in UTF-8

Related Questions in TYPE-CONVERSION

Related Questions in GOOGLETRANS

Trending Questions

Popular # Hahtags

Popular Questions