I'm writing a programm in Python, my goal is to :
- read input xml file one line at a time
- for each line find "CH" attribute
- change attribute value : translate from french to portugese
- write changed line into output xml file
- as i manipulate texts in various languages i'd like to keep utf8 encoding to display foreign special characters in the output file
My code:
import os
import xml.etree.ElementTree as ET
from googletrans import Translator
with open("input file.txt", "r", encoding='utf-8') as input_file:
with open("output file.txt", "w", encoding='utf-8') as output_file:
# Read input file
for ligne in input_file:
# line parse
root = ET.fromstring(ligne)
# Change CH attribute value, translate from french fr to portugese pt
current_text= root.get("CH")
translator = Translator()
translated_text = translator.translate(dest="pt", src="fr", text=current_text)
root.attrib["CH"] = translated_text.text
# convert bytes to string
decoded_string = ET.tostring(root).decode("utf-8")
# write output file
output_file.write(decoded_string)
The problem is that in the output file i get non encoded chraracters, for example with the below input file:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="la victoire est à nous"/>
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="vive l'empereur"/>
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
i get this result:
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
instead of expected result :
<?xml version="1.0" encoding="UTF-8"?>
<SCRIBUSUTF8NEW Version="1.5.5">
<StoryText>
<DefaultStyle ALIGN="1" SCALEV="100" BASEO="0" KERN="0"/>
<ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
<para ALIGN="1" LINESP="10"/>
<ITEXT FONT="Times New Roman Bold" SCALEH="100" SCALEV="100" BASEO="0" KERN="0" CH="Vida longa ao" />
<trail ALIGN="1" LINESP="10"/>
</StoryText>
</SCRIBUSUTF8NEW>
i have checked with displays that the translated_text.text is well formated ("A vitória é nossa"), but decoded_string value is wrong despite utf8 coding specification : <ITEXT FONT="Times New Roman Bold" BASEO="0" KERN="0" CH="A vitória é nossa" />
i do not understand why i have this result, could you please help me?
Parse the whole tree and iterate the ITEXT nodes. The following demonstrates how to change and write the text. Write the modified tree with the
.write()method using an XML declaration and declaring the encoding:output file.txt