Sanitizing XML text in Python (&amp)

245 Views Asked by At

I am writing a Python script that imports work items from IBM RTC and exports them to Microsoft ADS. One issue I found is that some strings from RTC xml data are imported with strange text characters such as &amp:

9.	Customize the rules for Feature work item
9. Customize the rules for Feature work item

1.	Send out the on-boarding form to Capsule Tech to understand the features/Tools/Customization used by them
1. Send out the on-boarding form to Capsule Tech to understand the features/Tools/Customization used by them

Speed Up RTC->ADS queries
Speed Up RTC->ADS queries

I've tried using the following code to sanitize and normalize the text:

from bs4 import BeautifulSoup
from html import unescape


    soup = BeautifulSoup(unescape(rtc_title), 'lxml')
    ads_title=soup.text

But it is replacing the characters with tabs most of the time, which is incorrect:

1.\tSend out the on-boarding form to Capsule Tech to understand the features/Tools/Customization used by them

is there a better way to parse and normalize these strings taken from IBM RTC xml data? Thanks

0

There are 0 best solutions below