I want to join two dataframes together to add labels to codes. The codes consist of a combination of a letter and a number. Is there a way to join the tables based on the letter and number in one step? It does not have to be native pandas. Note that joining on the letter will not suffice, as some letters show overlap. Lastly, my dataset is huge, therefore doing it in two steps did not work until now (that is, first merging on only the letter and then as a second step filtering to see fit in range).
I created some fake data
def make_df_from_lists(index,**kwargs):
return pd.DataFrame(list(zip(*kwargs.values())),index=index,columns=list(kwargs.keys()))
index = [1,2,3,4,5,6,7,8,9,10]
number = [1,2,3,4,5,6,7,8,9,10]
code= ["A11", "E01", "H95", "B22", "D51", "D11", "C15", "H56", "A15", "E11"]
diags = make_df_from_lists(index,Number= number,Code=code)
index = [1,2,3,4,5,6,7,8]
range = ["A00-B99", "C00-D48", "D50-D90", "E00-E90", "F00-F99", "G00-G99", "H00-H59", "H60-H95"]
label = ["label1", "label2", "label3", "label4", "label5", "label6", "label7", "label8"]
labels = make_df_from_lists(index,Range=range,Label=label)
Wanted outcome:
Number Diag Label
1 A11 label1
2 E01 label4
3 H95 label8
4 B22 label1
5 D51 label3
6 D11 label2
7 C15 label2
8 H56 label7
9 A15 label1
10 E11 label4
Credits to https://www.jcchouinard.com/generate-dummy-data-with-python/ for inspiration to build a fake dataset.
You can basically replace every letter with its corresponding positional index in the alphabet:
i.e.
Ais 1,Bis 2 and so on.This way, ranges are still preserved:
The same goes for Codes:
Then, you can take advantage of a merge_asof operation, which does the following:
That is to say that since ranges are sorted, we could actually merge the new numerical codes on the lowest (or highest) value of a range.
Consider the lowest value of each (numerical) range: 100, 300 ... 800, 869. Now, if we'd like to match A11 (which is 111 now) to its nearest low-threshold value we'd get 100.
Code:
lowand ahighthreshold: