Using Power Query to separate name from other attributes

72 Views Asked by At

I have some data where I'm trying to separate player names from a series of a finite set of acronym-based attributes that are added to the end of the player's name using Power Query. The number of attributes varies from player to player. There are 20 or so known attributes like UER, RC, COR and all are in all caps. Things are further complicated by suffixes after the name, like Jr., Sr., etc. Here is an example of the data:

Data
Bob Brenly
Jamie Moyer
Cal Ripken, Jr.
Ken Caminiti UER
Kenny Rogers RC
Lloyd McClendon ERR
Lloyd McClendon COR, RC

I want a table that looks like this:

Name Attributes
Bob Brenly
Jamie Moyer
Cal Ripken, Jr.
Ken Caminiti UER
Kenny Rogers RC
Lloyd McClendon ERR
Lloyd McClendon COR, ERR

I've tried all sorts of column splitting, columns by example, etc. but been unable to separate this content without also pulling the Jr., Sr., etc. into the Attributes column. Any thoughts on how to do this correctly using either Power Query or even Excel?

3

There are 3 best solutions below

0
Alexis Olson On

I can't figure out how to do it in a single step, but you can try looking for the last lowercase (or period) and using that as an index to split on.

let
  Source = Table.FromRows(
    {
      {"Bob Brenly"},
      {"Jamie Moyer"},
      {"Cal Ripken, Jr."},
      {"Ken Caminiti UER"},
      {"Kenny Rogers RC"},
      {"Lloyd McClendon ERR"},
      {"Lloyd McClendon COR, RC"}
    },
    type table [Data = text]
  ),
  AddSplitIndex = Table.AddColumn(
    Source,
    "SplitIndex",
    each Text.PositionOfAny([Data], {"a" .. "z", "."}, Occurrence.Last),
    Int64.Type
  ),
  AddName = Table.AddColumn(
    AddSplitIndex,
    "Name",
    each Text.Start([Data], [SplitIndex] + 1),
    type text
  ),
  AddAttributes = Table.AddColumn(
    AddName,
    "Attributes",
    each Text.Trim(Text.RemoveRange([Data], 0, [SplitIndex] + 1)),
    type text
  )
in
  AddAttributes

Output Screenshot

1
Ron Rosenfeld On

Here's a method using Regular Expressions in Power Query.

  • Add a custom function (blank query)
  • Name it something like (I used) fnRegexPosition
  • It will return the starting position of the match
    • defined as the first space separated substring that has at least two uppercase letters and no subsequent lower case letters up to the end of the line
let   fx=(text,regex)=>
    Web.Page(
        "<script>
            var x='"&text&"';
            var y=new RegExp('"&regex&"','g');
            var b=x.match(y).index;
            document.write(b);
        </script>")[Data]{0}[Children]{0}[Children]{1}[Text]{0}

in
fx

Then, in the main code, you can

  • Add a custom column
  • Use the position to split the text
  • Expand the column to show the results
let
    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    #"Changed Type" = Table.TransformColumnTypes(Source,{{"Data", type text}}),
    
    #"Added Custom" = Table.AddColumn(#"Changed Type", "Custom", each let 
            pos = try Number.From(fnRegexPosition([Data], "\\s[A-Z]{2}[^a-z]*$")) otherwise 999,
            part1 = Text.Trim(Text.Start([Data],pos)),
            part2 = Text.Trim(Text.Middle([Data],pos))
        in 
            Record.FromList({part1,part2},{"Name", "Attributes"}), type record),
    
    #"Removed Columns" = Table.RemoveColumns(#"Added Custom",{"Data"}),
    #"Expanded Custom" = Table.ExpandRecordColumn(#"Removed Columns", "Custom", {"Name", "Attributes"})

in
    #"Expanded Custom"

enter image description here

1
Mayukh Bhattacharya On

Found the question interesting and nice, so tried to attempt it using Excel Formulas, this assumes one must be using Modern Versions of Excel aka MS365 in order to run the formulas in their spreadsheet. I am aware the post subject line and in even OP it says specifically about using Power Query still i tried for learning purposes:

enter image description here


• Formula used in cell B2

=BYROW(A2:A11,LAMBDA(a,
 LET(b, TEXTSPLIT(a,{" ",", "}),
 c, DROP(b,,2),
 d, (EXACT(UPPER(c),c))*(ISERR(FIND(".",c))),
 TEXTJOIN(", ",1,IFERROR(FILTER(c,d,""),"")))))

  • To start with we are passing the range A2:A11 using BYROW() function here, it applies a custom LAMBDA() calculations to each row in an array and returns one single output per row, this can also be performed using MAP() which applies the custom LAMBDA() functions to each element in an array to return a single dynamic array as output.
  • Using the LET() function within the above to define variables for functions which may be repeated more than once and as well as it helps in reading the functions easily.
  • The variable b is defined here for the function TEXTSPLIT() applied to the range which is defined as a. TEXTSPLIT() function split the string into columns by using delimiter (used are space and comas).
  • Next, I am specifically assuming the names will have only First Name and Last Name and no middle name or maiden name in between followed by other attributes and suffixes hence to remove those two i am using the DROP() function which removes the first two columns which returned using the above steps, and leaves me with the ones which are required.
  • Now, I am doing some conditional checks by using UPPER() function to change all the values of variables c to upper case letters and then comparing it with the original c using EXACT() function

As the said function will return TRUE if its exactly the same else will return FALSE also note that the function is case-sensitive thus which gives me an opportunity to use it.

Along with the above, I am also using the FIND() function to find if there is a . dot or not in the variable c, if it does then it returns a number else returns error. So to bypass the error to return it as TRUE we use ISERR() function (this is because we dont need any which have a dot, like the one suffixes) and then using * --> AND() to verify if both condition satisfies with our given conditions or not.

  • Finally, from the above we are taking the ones which are TRUE values using FILTER() and wrapping with TEXTJOIN() function with a delimeter comma.

Notes: The Excel Formula attempt is only to see whether it can be achieved or not, but it is better and best to use POWER QUERY because one of the biggest drawback of the above function is TEXTJOIN() as per MSFT documentations if the resulting string (which returns as one single array for many rows/records of data) exceeds 32767 characters (cell limit), then it returns the #VALUE! error. However, it will definitely work smoothly without giving any error if used for each row instead of using it within BYROW(). Shown below.


Assume you are using Structured References aka Tables then the formula would be:

enter image description here


=LET(
     _a,TEXTSPLIT(A2,{" ",", "}),
     _b,DROP(_a,,2),
     _c,(EXACT(UPPER(_b),_b))*(ISERR(FIND(".",_b))),
     TEXTJOIN(", ",1,IFERROR(FILTER(_b,_c,""),"")))