In open refine how to merge cells for a particular key value when another column value matches

199 Views Asked by At

This is a typical one to many / many to one data transform. I've gotten things reformatted most of the way but this last step - combining column rows based on matching value in another column is something I'm stuck on.

Schema of current and desired table

In excel I'd use something with VLOOKUP to do this. I can't quite figure out how to do this in OpenRefine even after RTFM.

I've tried various transforms but cannot get it working. When using the rows to columns transform it asks how many columns but it doesn't seem to handle this exact case either

1

There are 1 best solutions below

1
b2m On

I think what makes your transformation "problematic" is, that you have two keys (columns "Key" and "New"). So I would create a temporary double key column to sort and transform the data accordingly.

Transforming using Records and multi-valued cells

One way is to use the records mode of OpenRefine by creating records based on a temporary double key column.

  1. Add new a temporary key column via "Key" => "Join column...". Join the columns "Key" and "New" into a new column named "tmp_key".
  2. Move the column "tmp_key" via "tmp_key" => "Edit column" => "Move column to beginning" to the beginning.
  3. Sort by column "tmp_key" and make the sort permanent (just to make sure).
  4. Blank down the column "tmp_key" via "tmp_key" => "Edit cells" => "Blank down".
  5. Check whether there are empty cells in column "Old". If this is the case fill them with a temporary placeholder like "EMPTY".
  6. Join the values in column "Old" via "Old" => Edit cells" => Join multi-valued cells...".
  7. Delete rows with empty cells in column "Old".
  8. Remove "EMPTY" placeholder and temporary key column.

Transforming using cross

You could use the cross function to look up other cells. But you still need a unique key.

  1. Add new a temporary key column via "Key" => "Join column...". Join the columns "Key" and "New" into a new column named "tmp_key".
  2. Transform the column "Old" via "Old" => "Edit cells" => "Transform..." and the following GREL expression:
cross(cells["tmp_key"], "", "tmp_key").cells["Old"].value.join(", ")
  1. Remove duplicated rows e.g. by the strategy used above (move column "tmp_key" to the beginning, sort by column "tmp_key", blank down on "tmp_key", blank down on column "Old", filter on empty cells in "Old", delete filtered rows. Remember to check for empty rows beforehand.