Identify substrings between delimiters with regex

72 Views Asked by At

Background

I want to identify text between a starting and an ending delimiter, but I want to have also the different text in between. You can see this example on regexr.com.

Current solution

Input

Text:

"aaa_s_123abc_e_bbbccc_s_456def_e_bbbddd_s_7890_e_wwwddd"

Pattern:

"/(.*?)(_s_.*?_e_)(.*?)/"

Result

0: aaa_s_123abc_e_
1: aaa
2: _s_123abc_e_
3: 
------------------
0: bbbccc_s_456def_e_
1: bbbccc
2: _s_456def_e_
3: 
------------------
0: bbbddd_s_7890_e_
1: bbbddd
2: _s_7890_e_
3: 
------------------

Problem

I am missing the string "wwwddd" at the end.

Question

Why is group 3 empty? How do i get the text after the ending delimiter "e"?
Any idea how to update the pattern?

2

There are 2 best solutions below

0
trincot On

Why is group 3 empty?

Because the empty string is a match for it, and there is no pattern to match after that group, so the empty string suffices to have the regex succeed. Be aware that ? is lazy. If you would have dropped that last ?, making the .* greedy, the third group would contain all remaining characters in that line. Also that would not be what you wanted, because then it captures too much, even all other _s_ and _e_.

How do i get the text after the ending delimiter "e"?

By:

  • repeating the execution of the regex as many times as there are matches. Your programming language is likely to have a function for such repetition. For instance, PHP has preg_match_all; and
  • allowing a match (in capture group 1) to be followed by either _s_ or by the end of the input ($).

Any idea how to update the pattern?

Drop the third capture group, as you want successive matches to be captured by the first capture group.

Proposed regex:

(.*?)(?:_s_.*?_e_|$)

0
sln On

When you put a * quantifier, 0 to many, that is qualified with a non-greedy ?
X*? at the end of any regex, it will never match anything because it chooses
the 0 part of to many.

However if you have something that it must match right after that, it forces the
to many part to be tried.

With that knowledge you can make the last segment (.*?) a requirement only at or near the end of string. This can be written different ways, I liked this way that just excludes underscores at the end.

(.*?)(_s_.*?_e_)((?:[^_]+$)?)

https://regex101.com/r/zUJ6Fh/1

Overview

( .*? )                       # (1)
( _s_ .*? _e_ )               # (2)
(                             # (3 start)
   (?:                           # Cluster an optional segment
      [^_]+                         # that excludes underscore
      $                             # up to the end of string
   )?                            # Optional characters at end of string only
)                             # (3 end)