Extract abbreviation from string of the words longer than 3 letters by regex

76 Views Asked by At
string1 =  'Department of the Federal Treasury "IFTS No. 43"'
string2 =  'Federal Treasury Company "Light-8"'

I need to get the first capital letters of words longer than 3 characters that are before the opening quote, and also extract the quoted expression using a common pattern for 2 strings.

Final string should be:

  • for string1: 'IFTS No. 43, DFT'.
  • for string2: 'Light-8, FTC'.

I would like to get a common pattern for two lines for further use of this expression in DataFrame.

1

There are 1 best solutions below

0
bobble bubble On

You can use a capturing group and alternation.

"([^"]+)"|\b[A-Z]

See this demo at regex101 (FYI read: The Trick)

It either matches the quoted parts and captures negated double quotes "inside" to the first capturing group OR matches each capital letter at an initial \b word boundary (start of word).

import re

regex = r"\"([^\"]+)\"|\b[A-Z]"

s = "Department of the Federal Treasury \"IFTS No. 43\"\n"

res = ["", ""]

for m in re.finditer(regex, s):
  if(m.group(1)):
    res[0] += m.group(1)
  else:
    res[1] += m.group(0)

print(res)

Python demo at tio.run >

['IFTS No. 43', 'DFT']