I have a number of p tags with table tags I am retrieving in order into a list content_items. I am trying to join all p tags and then, once a table is found, append what I have collected already and then parse the table as a separate item in the list. I am able to collect the tables yet for some reason I am unable to collect and join all p tags until I hit a table tag. Code so far:
from bs4 import BeautifulSoup, NavigableString
import html2text
converter = html2text.HTML2Text()
soup = BeautifulSoup(data3, 'html.parser')
content_items = [] # List to store the content items
for tag in soup.descendants:
content_dict = {'Title': "35.23.060 - DR Zone Standards", 'Content': ''}
if tag.name == "p":
content_dict['Content'] += converter.handle(str(tag))
elif tag.name == "table":
if content_dict['Content']:
content_items.append(content_dict)
content_dict['Content'] = converter.handle(str(tag))
content_items.append(content_dict)
# Print the extracted data
print(json.dumps(content_items, indent=4))
The problem lies in the placement of the content_dict initialization inside the loop. With your current code, you are overwriting the dictionary in each iteration, resulting in the loss of previously collected paragraph content. You should move the dictionary initialization inside the loop, so a new dictionary is created for each iteration.