Using Power Query to separate name from other attributes

Question

Using Power Query to separate name from other attributes

72 Views Asked by Jim Naroski At 06 March 2024 at 16:30

I have some data where I'm trying to separate player names from a series of a finite set of acronym-based attributes that are added to the end of the player's name using Power Query. The number of attributes varies from player to player. There are 20 or so known attributes like UER, RC, COR and all are in all caps. Things are further complicated by suffixes after the name, like Jr., Sr., etc. Here is an example of the data:

Data
Bob Brenly
Jamie Moyer
Cal Ripken, Jr.
Ken Caminiti UER
Kenny Rogers RC
Lloyd McClendon ERR
Lloyd McClendon COR, RC

I want a table that looks like this:

Name	Attributes
Bob Brenly
Jamie Moyer
Cal Ripken, Jr.
Ken Caminiti	UER
Kenny Rogers	RC
Lloyd McClendon	ERR
Lloyd McClendon	COR, ERR

I've tried all sorts of column splitting, columns by example, etc. but been unable to separate this content without also pulling the Jr., Sr., etc. into the Attributes column. Any thoughts on how to do this correctly using either Power Query or even Excel?

Original Q&A

There are 3 best solutions below

**Alexis Olson** · Answer 1 · 2024-03-06T16:54:29.510000

I can't figure out how to do it in a single step, but you can try looking for the last lowercase (or period) and using that as an index to split on.

let
  Source = Table.FromRows(
    {
      {"Bob Brenly"},
      {"Jamie Moyer"},
      {"Cal Ripken, Jr."},
      {"Ken Caminiti UER"},
      {"Kenny Rogers RC"},
      {"Lloyd McClendon ERR"},
      {"Lloyd McClendon COR, RC"}
    },
    type table [Data = text]
  ),
  AddSplitIndex = Table.AddColumn(
    Source,
    "SplitIndex",
    each Text.PositionOfAny([Data], {"a" .. "z", "."}, Occurrence.Last),
    Int64.Type
  ),
  AddName = Table.AddColumn(
    AddSplitIndex,
    "Name",
    each Text.Start([Data], [SplitIndex] + 1),
    type text
  ),
  AddAttributes = Table.AddColumn(
    AddName,
    "Attributes",
    each Text.Trim(Text.RemoveRange([Data], 0, [SplitIndex] + 1)),
    type text
  )
in
  AddAttributes

**Ron Rosenfeld** · Answer 2 · 2024-03-06T21:27:43.430000

Here's a method using Regular Expressions in Power Query.

Add a custom function (blank query)
Name it something like (I used) fnRegexPosition
It will return the starting position of the match
- defined as the first space separated substring that has at least two uppercase letters and no subsequent lower case letters up to the end of the line

let   fx=(text,regex)=>
    Web.Page(
        "<script>
            var x='"&text&"';
            var y=new RegExp('"&regex&"','g');
            var b=x.match(y).index;
            document.write(b);
        </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}

in
fx

Then, in the main code, you can

Add a custom column
Use the position to split the text
Expand the column to show the results

let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Data", type text}}),
    
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each let 
            pos = try Number.From(fnRegexPosition([Data], "\\s[A-Z]{2}[^a-z]*$")) otherwise 999,
            part1 = Text.Trim(Text.Start([Data],pos)),
            part2 = Text.Trim(Text.Middle([Data],pos))
        in 
            Record.FromList({part1,part2},{"Name", "Attributes"}), type record),
    
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Data"}),
    #"Expanded Custom" = Table.ExpandRecordColumn(#"Removed Columns", "Custom", {"Name", "Attributes"})

in
    #"Expanded Custom"

**Mayukh Bhattacharya** · Answer 3 · 2024-03-07T00:03:59.803000

Found the question interesting and nice, so tried to attempt it using Excel Formulas, this assumes one must be using Modern Versions of Excel aka MS365 in order to run the formulas in their spreadsheet. I am aware the post subject line and in even OP it says specifically about using Power Query still i tried for learning purposes:

• Formula used in cell B2

=BYROW(A2:A11,LAMBDA(a,
 LET(b, TEXTSPLIT(a,{" ",", "}),
 c, DROP(b,,2),
 d, (EXACT(UPPER(c),c))*(ISERR(FIND(".",c))),
 TEXTJOIN(", ",1,IFERROR(FILTER(c,d,""),"")))))

To start with we are passing the range A2:A11 using BYROW() function here, it applies a custom LAMBDA() calculations to each row in an array and returns one single output per row, this can also be performed using MAP() which applies the custom LAMBDA() functions to each element in an array to return a single dynamic array as output.
Using the LET() function within the above to define variables for functions which may be repeated more than once and as well as it helps in reading the functions easily.
The variable b is defined here for the function TEXTSPLIT() applied to the range which is defined as a. TEXTSPLIT() function split the string into columns by using delimiter (used are space and comas).
Next, I am specifically assuming the names will have only First Name and Last Name and no middle name or maiden name in between followed by other attributes and suffixes hence to remove those two i am using the DROP() function which removes the first two columns which returned using the above steps, and leaves me with the ones which are required.
Now, I am doing some conditional checks by using UPPER() function to change all the values of variables c to upper case letters and then comparing it with the original c using EXACT() function

As the said function will return TRUE if its exactly the same else will return FALSE also note that the function is case-sensitive thus which gives me an opportunity to use it.

Along with the above, I am also using the FIND() function to find if there is a . dot or not in the variable c, if it does then it returns a number else returns error. So to bypass the error to return it as TRUE we use ISERR() function (this is because we dont need any which have a dot, like the one suffixes) and then using * --> AND() to verify if both condition satisfies with our given conditions or not.

Finally, from the above we are taking the ones which are TRUE values using FILTER() and wrapping with TEXTJOIN() function with a delimeter comma.

Notes: The Excel Formula attempt is only to see whether it can be achieved or not, but it is better and best to use POWER QUERY because one of the biggest drawback of the above function is TEXTJOIN() as per MSFT documentations if the resulting string (which returns as one single array for many rows/records of data) exceeds 32767 characters (cell limit), then it returns the #VALUE! error. However, it will definitely work smoothly without giving any error if used for each row instead of using it within BYROW(). Shown below.

Assume you are using Structured References aka Tables then the formula would be:

=LET(
     _a,TEXTSPLIT(A2,{" ",", "}),
     _b,DROP(_a,,2),
     _c,(EXACT(UPPER(_b),_b))*(ISERR(FIND(".",_b))),
     TEXTJOIN(", ",1,IFERROR(FILTER(_b,_c,""),"")))

Using Power Query to separate name from other attributes

There are 3 best solutions below

Related Questions in EXCEL

Related Questions in EXCEL-FORMULA

Related Questions in POWERQUERY

Related Questions in M

Trending Questions

Popular # Hahtags

Popular Questions