I'm trying to extract the contents of the class name. How to do I extract all the contents including the ones inside the 'em' tags and after the 'em' tags too? See picture below:
I tried the following and these were the results:
Trial 1:
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']").extract()
Output 1:
>> <span class="st"><span class="f">Nov 26, 2018 - </span>First #<em>GDPR fine</em> awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 <em>fine</em>. Why "so low"?</span>
Trial 2:
driver = webdriver.Chrome(options=options)
sel = Selector(text = driver.page_source)
sel.xpath("//*[@class ='st']/text()").extract()
Output 2:
>> First #
Ideally, the output I want to get is:
>> Nov 26, 2018 - First #GDPR fine awarded in Germany. 330,000 user data stolen. Usernames and passwords stored in plaintext. €20,000 fine. Why "so low"?
I eventually found a way to solve the problem though not an elegant one, would still welcome a more elegant solution.
I pulled in the contents of the class name using:
I then defined a function that stripped the html away from the text:
Looping through the contents in the list and stripping the html one at a time gave me the result I wanted: