Deedle Equivalent to pandas.merge

1k Views Asked by At

I am looking to merge two Deedle (F#) frames based on a specific column in each frame in a similar manner as pandas.DataFrame.Merge.The perfect example of this would be a primary frame that contains columns of data and a (city, state) column along with an information frame that contains the following columns: (city, state); lat; long. If I want to add the lat long columns into my primary frame, I would merge the two frames on the (city, state) column.

Here is an example:

    let primaryFrame =
            [(0, "Job Name", box "Job 1")
             (0, "City, State", box "Reno, NV")
             (1, "Job Name", box "Job 2")
             (1, "City, State", box "Portland, OR")
             (2, "Job Name", box "Job 3")
             (2, "City, State", box "Portland, OR")
             (3, "Job Name", box "Job 4")
             (3, "City, State", box "Sacramento, CA")] |> Frame.ofValues

    let infoFrame =
            [(0, "City, State", box "Reno, NV")
             (0, "Lat", box "Reno_NV_Lat")
             (0, "Long", box "Reno_NV_Long")
             (1, "City, State", box "Portland, OR")
             (1, "Lat", box "Portland_OR_Lat")
             (1, "Long", box "Portland_OR_Long")] |> Frame.ofValues

    // see code for merge_on below.
    let mergedFrame = primaryFrame
                      |> merge_On infoFrame "City, State" null

Which would result in 'mergedFrame' looking like this:

> mergedFrame.Format();;
val it : string =
  "     Job Name City, State    Lat             Long             
0 -> Job 1    Reno, NV       Reno_NV_Lat     Reno_NV_Long     
1 -> Job 2    Portland, OR   Portland_OR_Lat Portland_OR_Long 
2 -> Job 3    Portland, OR   Portland_OR_Lat Portland_OR_Long 
3 -> Job 4    Sacramento, CA <missing>       <missing>   

I have come up with a way of doing this (the 'merge_on' function used in the example above), but being a Sales Engineer who is new to F#, I imagine there is a more idiomatic/efficient way of doing this. Below is my functions for doing this along with a 'removeDuplicateRows' which does what you would expect and was needed for the 'merge_on' function; if you want to comment on a better way of doing this as well, please do.

    let removeDuplicateRows column (frame : Frame<'a, 'b>) =
             let nonDupKeys = frame.GroupRowsBy(column).RowKeys
                              |> Seq.distinctBy (fun (a, b) -> a) 
                              |> Seq.map (fun (a, b) -> b)  
             frame.Rows.[nonDupKeys]


    let merge_On (infoFrame : Frame<'c, 'b>) mergeOnCol missingReplacement 
                  (primaryFrame : Frame<'a,'b>) =
          let frame = primaryFrame.Clone() 
          let infoFrame =  infoFrame                           
                           |> removeDuplicateRows mergeOnCol 
                           |> Frame.indexRows mergeOnCol
          let initialSeries = frame.GetColumn(mergeOnCol)
          let infoFrameRows = infoFrame.RowKeys
          for colKey in infoFrame.ColumnKeys do
              let newSeries =
                  [for v in initialSeries.ValuesAll do
                        if Seq.contains v infoFrameRows then  
                            let key = infoFrame.GetRow(v)
                            yield key.[colKey]
                        else
                            yield box missingReplacement ]
              frame.AddColumn(colKey, newSeries)
          frame

Thanks for your help!

UPDATE:

Switched Frame.indexRowsString to Frame.indexRows to handle cases where the types in the 'mergOnCol' are not strings.

Got rid of infoFrame.Clone() as suggested by Tomas

1

There are 1 best solutions below

4
Tomas Petricek On BEST ANSWER

The way Deedle does joining of frames (only in row/column keys) sadly means that it does not have a nice built-in function to do joining of frames over a non-key column.

As far as I can see, your approach looks very good to me. You do not need Clone on the infoFrame (because you are not mutating the frame) and I think you can replace infoFrame.GetRow with infoFrame.TryGetRow (and then you won't need to get the keys in advance), but other than that, your code looks fine!

I came up with an alternative and a bit shorter way of doing this, which looks as follows:

// Index the info frame by city/state, so that we can do lookup
let infoByCity = infoFrame |> Frame.indexRowsString "City, State"

// Create a new frame with the same row indices as 'primaryFrame' 
// containing the additional information from infoFrame.
let infoMatched = 
  primaryFrame.Rows
  |> Series.map (fun k row -> 
      // For every row, we get the "City, State" value of the row and then
      // find the corresponding row with additional information in infoFrame. Using 
      // 'ValueOrDefault' will automatically give missing when the key does not exist
      infoByCity.Rows.TryGet(row.GetAs<string>("City, State")).ValueOrDefault)
  // Now turn the series of rows into a frame
  |> Frame.ofRows

// Now we have two frames with matching keys, so we can join!
primaryFrame.Join(infoMatched)

This is a bit shorter and maybe more self-explanatory, but I have not done any tests to check which is faster. Unless performance is a primary concern, I think going with the more readable version is a good default choice though!

Related Questions in F#

Related Questions in F#-DATA