I'm trying to match Hearst-Patterns with Java regex this is my regex:
<np>(\w+)<\/np> such as (?:(?:, | or | and )?<np>(\w+)<\/np>)*
If I have a annotated sentence like:
I have a <np>car</np> such as <np>BMW</np>, <np>Audi</np> or <np>Mercedes</np> and this can drive fast.
I want to get the groups:
1. car
2. [BMW, Audi, Mercedes]
UPDATE: Here is my current java code:
Pattern pattern = Pattern.compile("<np>(\\w+)<\\/np> such as (?:(?:, | or | and )?<np>(\\w+)<\\/np>)*");
Matcher matcher = pattern.matcher("I have a <np>car</np> such as <np>BMW</np>, <np>Audi</np> or <np>Mercedes</np> and this can drive fast.");
while (matcher.find()) {
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
But the 2nd group element only contains Mercedes, how can I get all the matches for the 2nd group (maby as array)? Is this possible with java Pattern and Matcher? And if yes, what is my mistake?
If you want to be sure to have contiguous results, you can use the
\Ganchor that forces a match to be contiguous to a precedent match:note: the
\Ganchor means the end of a precedent match or the start of the string. To avoid to match the start of the string, you can add the lookbehind(?<!^)after the\G