How to extract every string between two substring in a paragraph?

105 Views Asked by At

After web-scrapping, I get the following:

[<p>xxx<p>, <p>1.apple</p>, <p>aaa</p>, <p>xxxxx</p>, <p>xxxxx</p>, <p>2.orange</p>, <p>aaa</p>, <p>xxxxx</p>,<p>3.banana</p>, <p>aaa</p>, <p>xxxxx</p>]

From the list, "xxxx" are those useless values. I can see the pattern that the result I want is between two substrings. Substring1 = "<p>1" / "<p>2" / "<p>3" ; Substring2 = "</p>, <p>aaa".

Assume this pattern repeats hundreds of times. How do I get the result by python? Many thanks !!

My target result is :

  1. apple

  2. orange

  3. banana

I have tried to use split and tried [sub1:sub2] but it doesn't work

1

There are 1 best solutions below

0
Swifty On

From what I INFER from your question (assuming the words you're looking for follow a beacon of format <p>number. ), a regex would do the job:

import re
print(re.findall(r'<p>\d+.([^<]+)', html_string)

# ['apple', 'orange', 'banana']