Drop duplicates except for the first occurrence with Deedle

97 Views Asked by At

I have a table with one key with duplicate values. I would like to drop/reduce all duplicate keys but preserve the first row of each duplicate.

let data = "A;B\na;1\nb;\nb;2\nc;3"

let bytes = System.Text.Encoding.UTF8.GetBytes data
let stream =  new MemoryStream( bytes )

let df= 
    Frame.ReadCsv(
        stream = stream,
        separators = ";",
        hasHeaders = true
    )

df.Print()
     A B         
0 -> a 1         
1 -> b <missing> 
2 -> b 2         
3 -> c 3              

The result should be

     A B         
0 -> a 1         
1 -> b <missing>       
2 -> c 3       

I have tried applyLevel but I only get the value not the first entry:

let df1 =
    df
    |> Frame.groupRowsByString "A"
    |> Frame.applyLevel fst (fun s -> s |> Series.firstValue)

df1.Print()
     A B 
a -> a 1 
b -> b 2 <- wrong
c -> c 3 
2

There are 2 best solutions below

0
Brian Berns On BEST ANSWER

This is essentially a duplicate of a previous SO question. The short answer is:

let df1 =
    df
        |> Frame.groupRowsByString "A"
        |> Frame.nest                        // convert to a series of frames
        |> Series.mapValues (Frame.take 1)   // take the first row from each frame
        |> Frame.unnest                      // convert back to a single frame
        |> Frame.mapRowKeys snd
df1.Print()

The output is:

     A B
0 -> a 1
1 -> b <missing>
3 -> c 3

I've added a call to Frame.mapRowKeys at the end to match your desired output as closely as possible. Note that the actual output differs slightly from your expected output, because row 3 -> c 3 has original index 3 instead of 2. I think this is more correct, but you can renumber the rows if necessary.

The referenced question has more details.

0
jim108dev On

Using Frame.nest/Frame.unnest is a reasonable solution. I have noticed, it is a little bit slow.

My solution involves putting the keys in a Map and checking:

let dropDuplicates (df:Frame<_,_>) =
    let selectedMap = 
        df.RowKeys
        |> Seq.fold (fun (m:Map<'A,'B>) (a,b) -> 
            if m.ContainsKey a then m else m |> Map.add a b) Map.empty

    df
    |> Frame.filterRows(fun (a,b) _ -> 
        match selectedMap.TryFind a with
        | Some entry -> entry = b
        | _ -> false)

let df1 =
    df
    |> Frame.groupRowsByString "A"
    |> dropDuplicates

df1.Print()
       A B         
a 0 -> a 1         
b 1 -> b <missing> 
c 3 -> c 3   

Related Questions in F#