How to add data that doesn't exist on one list to another?

80 Views Asked by At

I have two sheets with thousands of records. There are overlaps between the two, in that each record on the two docs use identifiers. I need to migrate the records of what doesn't already exist into the other doc. How would I do that?

1

There are 1 best solutions below

8
b2m On

So this is possible, but if you are familiar with scripting languages like Python or R I currently would recommend using them to perform the merge.

Nevertheless here is a rudimentary recipe to perform the merge using only OpenRefine.

Note that this behavior might change in the future, watch GitHub Issue #556 for that.

Assuming you have two projects called A and B and want to merge into Project B every row from Project A that is not already there yet. The two projects share a common id column.

1. Prepare Project B for synchronization

In Project A mark the rows that are already in Project B using cross. For that add a new column named "Sync" in Project A based on your ID column using the following GREL expression:

if(
    cell.cross("Project B", "ID column").cells["ID column"].value.length() > 0,
    "Found",
    row.index
)

This will use the index of the row as temporary id for synchronization, but only for rows that are not already in Project B.

2. Prepare Project B for synchronization

In Project B we also add a new column named "Sync" using the following GREL expression:

with(6000, rowsInProjectA,
  with(7000, rowsInProjectB,
    if(row.index + 1 == rowsInProjectB,
      "," + forRange(0, rowsInProjectA, 1, v, v).join(","),
      ""
    )
  )
)

This will add a string ,0,1,2,...,6000 in the last row of column "Sync" in Project B. Note that you manually have to determine and set the two variables rowsInProjectA (currently 6000) and rowsInProjectB (currently 7000).

Then we use Split multi-valued cells on column "Sync" in Project B using the comma , as separator. This will basically add new rows to Project B containing only a value in the column "Sync" to be able to load the missing rows from Project A.

3. Load rows from Project A

In Project B we use cross again to to load the missing rows from Project A. For that we use the transformation dialog in the ALL column to be able to load several columns in one step.

if(isNonBlank(row.cells["Sync"].value),
  row.cells["Sync"].cross("Project A", "Sync").cells[columnName].value[0],
  value
)

This GREL expression is assuming that the columns in Project A and Project B have the same names. Otherwise you would have to use the transform dialog on each column separately and manually map the column names from Project A and Project B.

4. Clean up

  1. Delete rows in Project B that only contain a value in "Sync" but in no other column (these are rows that already are in Project B and therefore have not been loaded in step 3).
  2. Delete column "Sync" in Project B
  3. Delete column "Sync" in Project A