I want to match the titles of h1 to h6 in an HTML file, without returning the h tags themselves, using regular expressions.
Consider the following piece of an HTML file. I want to match "Welcome to my Homepage", "SQL", "RegEx", but not "This is not a valid HTML" (which is surrounded by a pair of unmatched tags).
<body>
<H1>Welcome to my Homepage</H1>
Content is divided into two sections:<br/>
<h2>SQL</h2>
Information about SQL.
<h2>RegEx</h2>
Information about Regular Expressions.
<h3>This is not a valid HTML</h4>
</body>
I use (?<=<[hH]([1-6])>).*?(?=<\/[hH]\1>) at regex101.com. However, it also mathes the numbers 1, 2 in the tags <H1> and <h2>.
How to fix it?
Not really. The match itself captures only the content. The number comes from the capturing group in your lookbehind. You can just ignore that.