pyspark Anti-join 2 dataframes

258 Views Asked by At

I have 2 data frames df and df1. I want to filter out the records that are in df from df1 and I was thinking an anti-join can achieve this. But the id variable is different in 2 tables and I want to join the tables on multiple columns. Is there an neat way to do this ?

df1

user_id name
1 AAA
2 ABC
3 XYZ

df

id name
1 AAA
2 ABC

Result_dataframe

id name
3 XYZ
1

There are 1 best solutions below

0
notNull On

Try to standardize the column names in two dataframes and then use the column names in your join condition.

Example:

df=spark.createDataFrame([(1,'AAA'),(2,'ABC')],['id','name'])
df1=spark.createDataFrame([(1,'AAA'),(2,'ABC'),(3,'XYZ')],['user_id','name'])

#standardize the df1 column names based on df columns
df2 = spark.createDataFrame(df1.rdd,df.columns)

#join the dataframe using left_Anti join and list of column names
df2.join(df,*[df.columns],'left_anti').show(100,False)
#+---+----+
#|id |name|
#+---+----+
#|3  |XYZ |
#+---+----+