Refining this name parse pattern

84 Views Asked by At

This pattern

/^(.*?)\b((?:[Vv][ao]n|(?:[Dd][eu]\s+)?[Ll]a|[Dd][eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)(.*)/gum

parses name tokens.

I had help deriving this pattern (ECMAScript Flavor) and have made small adjustments, but I'm stuck on the third name token in the test string.

Van H. Manning properly parses to Van H. Manning (just use trim() to remove extra space)

Lionel Van Deerlin properly parses to Lionel Van Deerlin

But Van Taylor does not parse to Van Taylor

Can this pattern be adjusted to properly parse Van Taylor along with the other instances of Van?

I'm still working out how this pattern works and how to understand this particular regex wizardry.

TIA

** Update **

Fools errand though it may be, I am doing the best possible version of a parse.

Per the comments, Van H. Manning is distinct because Van is a first name whereas Van Deerlin is a surname.

Similarly to Van H. Manning, Van Taylor consists of Van as a first name and Taylor as a surname.

I can see that part of the logic is that Van ocurring at the beginning of the string distinguishes between surname and last name, however, the pattern is properly grouping Van \w+ already so it seems like a small adjustment is needed.

As far as Van H. Manning being parsed as Van H. Manning, I am using a conditional to handle that. It's beyond me on how to regex that one with everything else and I've already asked for a lot of heavy lifting here.

1

There are 1 best solutions below

1
Patrick Janser On BEST ANSWER

I think it will get rather complicated to handle all cases because as everybody pointed out, you'll probably get the first name in front or behind the surname (last name or family name). In some countries I even think that your last name can come from your parent's first name, so imagine how complicated it can get to try and detect the order.

But, if you want to stick to a regular expression, you could just use your assumption that if Van is at the beginning of the string then it's the first name. In this case, just add two alternatives to your regular expression and capture the parts in several groups. I've named them for easier access, compared to indexed groups. You'll then have to put some logic to see which group is filled or empty.

I also used the i flag for case-insensitive instead of handling it with [Dd].

I personally think that having several regular expressions or trying to find a library to handle that for you might be a better idea, especially if you also know the origin of the person, which could help to use specific rules by region of the planet.

The PCRE regex :

/^
(?: # Where "Van" would be the first name:
  (?<firstname_van>Van)\s(?<lastname_van>.*)
|
  # Other cases: the first name is probably first, but not sure.
  (?<firstname>.*?)\s*
  (?<lastname>
    \b
    (?:
      (?<!^)V[ao]n
      |(?:D[eu]\s+)?La
      |D[eu]
      |St\.
      |Le
      |Auf\s+der
    )
    \s+\p{L}+\.?
  )
  \h*
  (?:
    (?<senority>(?:[JS]r\.?|[IVX]+))
    |
    (?<more>.*)
  )
)
$/gumix

The JavaScript version to enhance :

const regexp = /^(?:(?<firstname_van>Van)\s(?<lastname_van>.*)|(?<firstname>.*?)\s*(?<lastname>\b(?:(?<!^)V[ao]n|(?:D[eu]\s+)?La|D[eu]|St\.|Le|Auf\s+der)\s+\p{L}+\.?)[ \t]*(?:(?<senority>(?:[JS]r\.?|[IVX]+))|(?<more>.*)))$/gumi;

const input = `Van H. Manning 
Lionel Van Deerlin
Van Taylor
Emile La Sére
George A. La Dow
Gilbert De La Matyr
Robert M. La Follette
William Leroy La Follette
Robert M. La Follette Sr.
Robert M. La Follette Jr.
Charles M. La Follette
Monica De La Cruz
David A. De Armond
Justin De Witt Bowersock
De Witt C. Giddings
Julien de Lallande Poydras
Henry St. John
Edward St. Loe Livermore
Oscar L. Auf der Heide
Kika de la Garza
Francis Celeste Le Blond
Robert Le Roy Livingston`;

let i = 1;
while ((match = regexp.exec(input)) !== null) {
  console.log(`Match ${i++}`, match.groups);
}