BS4 replace_with for replacing with new tag

1.4k Views Asked by At

I need to find certain words in an html file and replace them with links. The result should be that the file (displayed by a browser) allows you to klick on the links as usual. Beautiful Soup automatically escapes the tag. How can I avoid that behaviour?

Minimal Example

#!/usr/bin/env python3
from bs4 import BeautifulSoup
import re

html = \
'''
   Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
  if re.search('identi',txt,re.I) and txt.parent.name != 'a':
    newtext = re.sub('identify', '<a href="test.html"> test </a>', txt.lower())
    txt.replace_with(newtext)
print(soup)

Result:

&lt;a href="test.html"&gt; test &lt;/a&gt;

Intended result:

<a href="test.html"> test </a>
2

There are 2 best solutions below

1
Andrej Kesely On BEST ANSWER

You can put new soup with markup as parameter to .replace_with(), for example:

import re
from bs4 import BeautifulSoup


html = '''
   Other Identify Other
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
  if re.search('identi',txt,re.I) and txt.parent.name != 'a':
    new_txt = re.sub(r'identi[^\s]*', '<a href="test.html">test</a>', txt, flags=re.I)
    txt.replace_with(BeautifulSoup(new_txt, 'html.parser'))

print(soup)

Prints:

   Other <a href="test.html">test</a> Other
1
Just for fun On

You can use w3lib, it's replace_entities() function to replace HTML entities from a string.

To Install: pip install w3lib

from bs4 import BeautifulSoup
import re
from w3lib.html import replace_entities
html = \
'''
   Identify
'''
soup = BeautifulSoup(html,features="html.parser")
for txt in soup.findAll(text=True):
  if re.search('identi',txt,re.I) and txt.parent.name != 'a':
    newtext = re.sub('identify', r'<a href="test.html"> test </a>', txt.lower())
    txt.replace_with(newtext)

print(replace_entities(str(soup))) #str(soup) as its BeautifulSoup type not str

#Output
>>> <a href="test.html"> test </a>