I have a dataframe 'merged_tstbr' with over 17000 rows and a column 'ID'. In this column multiple entries appear as one long string, separated by ';' and \n.
Here an example of what a cell in this column looks like:
XDX_AbCd_XY_2415;\
XDX_AbCd_XY_3335;\
XDX_AbCd_XY_3351;\
XDX_AbCd_XY_3354
I have used the below script to check for all rows in 'ID' column and add to a new list 'tstbr_anf' if the ID is in my_list:
my_list = ['XDX_AbCd_XY_3351', 'XDX_AbCd_XY_3335'] # (and so on, 250 elements in total)
for index, row in merged_tstbr.iterrows():
row_values = [value.strip() for value in row['ID'].split('\n')]
matching = [value for value in row_values if value in my_list]
tstbr_anf.extend(matching)
Out of 250 elements, 6 are not showing in txtbr_anf.
I checked the presence of these in the column 'ID' in excel, and they are here, but somehow not getting added to txtbr_anf. They are present in my_list as well.
What can be the possible reasons for this.
Follow-up from How to search in a column of a dataframe for any and all value in a column another dataframe in Python using Pandas
Input data I've used to try and reproduce your issue
Each cell in your input dataframe contains a number of IDs together in a string:
Please confirm this matches your input, otherwise provide a dataframe constructor.
Problem:
You seem to believe that
.strip()removes semicolons, but it does not, unless you force it to. You could verify that with aprint(commented out in the code below): therow_valuesyou were testing membership for, still have their semicolon about them:Quick fix:
.split(), then.strip()is not needed:.strip()to care:Both will output
tstbr_anfas:Now why this problem only arises with 6 IDs, we can only know when you provide larger input data. It could be that semicolons are present in only few cells.
If your data looks anything like in your previous question How to search in a column of a dataframe for any and all value in a column another dataframe in Python using Pandas, then indeed not all IDs have semicolons but it could also be other undesired characters.
Alternative suggestion
The two-step process below will extract those IDs and retain, in a list, those only present in my_list. You can then directly assign to
tstbr_anf.Final output:
As you can see, we've gotten rid of the intermediate
matchinglist, which was the temporary product of each row iteration in your code.References