I need to extract specific information from a webpage using BeautifulSoup and / or Selenium. I'm trying to extract information related to a particular organism from a webpage, but I'm encountering difficulties.
I tried this
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Define the search term
search_term = "Streptomyces anthocyanicus JCM 5058"
# Open a Chrome browser
driver = webdriver.Chrome()
# Construct the search URL for assembly
search_url = f"https://www.ncbi.nlm.nih.gov/assembly/?term={search_term.replace(' ', '+')}"
# Navigate to the search URL
driver.get(search_url)
from selenium.webdriver.common.by import By
# Find elements containing the text "JCM 5058"
elements = driver.find_elements(By.XPATH, "//*[contains(text(), 'JCM 5058')]")
if elements:
print("Text 'JCM 5058' found on the webpage.")
# Loop through elements and extract text
text_to_print = ""
for element in elements:
text_to_print += element.text + "\n" # Add newline for readability
# Print the extracted text
print(text_to_print)
else:
print("Text 'JCM 5058' not found on the webpage.")
and I got like this
Text 'JCM 5058' found on the webpage.
JCM 5058
("Streptomyces anthocyanicus"[Organism] AND ("Streptomyces anthocyanicus"[Organism] OR JCM 5058[All Fields])) AND (latest[filter] AND all[filter] NOT anomalous[filter])
Streptomyces anthocyanicus JCM 5058 AND (latest[filter] AND all[f... (6)
but Matched section look like this in web page
ASM1465115v1
Organism: Streptomyces anthocyanicus (high G+C Gram-positive bacteria)
Infraspecific name: Strain: JCM 5058
Submitter: WFCC-MIRCEN World Data Centre for Microorganisms (WDCM)
Date: 2020/09/12
Assembly level: Scaffold
Genome representation: full
Relation to type material: assembly from type material
GenBank assembly accession: GCA_014651155.1 (latest)
RefSeq assembly accession: GCF_014651155.1 (latest)
IDs: 8121141 [UID] 22194358 [GenBank] 22446388 [RefSeq]
I want to extract or print all this information as such or in a table.
I got the answer, while working arround, but dont know is it correct approach or not,
will print this
another easy way is
print