Part 1 :
I am trying to build an NLP model such that if the keyword is present in the text the output should be 1 else 0. However, the context of the sentence also needs to be considered to understand that the keyword is actually present in the text.
For example:
Keywords = [ "jaundice","jaundiced","portal hypertension"]
Sentence ="No Jaundice and hypertension detected"
Here the output is expected to be 0 because although the keyword is present the context says the patient doesn't have Jaundice.
My code :
KEYWORDS = list((map(lambda x: x.lower(), KEYWORDS)))
def find_keyword(text, keyword):
if keyword in text:
return 1
else:
return 0
for keyword in KEYWORDS:
df1['output'] = df1['CLINICAL_DOCUMENT_TEXT'].apply(lambda t: find_keyword(t, keyword))
df_output=df1[['MCN','CLINICAL_DOCUMENT_TEXT','output']]
df_output.head()
Here the text data is in the column "CLINICAL_DOCUMENT_TEXT" of the dataframe. I want the output as 1 or 0 in the new column output. MCN is the user id.
My output :
MCN CLINICAL_DOCUMENT_TEXT output
2478812 PROGRESS NOTE Service: Hospitalist 6 SU... 0
2478812 PROGRESS NOTE Service: Hospitalist 3 ... 0
2478812 Encounter created for infectious disease scre... 0
2478812 Clinical Note Types: Progress NUTRITION V... 0
2478812 Facts: pt is 59 yo male admitted with covid p... 0
As my code cannot capture the context, it is labeling the text as 1 if the keyword is present even though the sentence means "NO Jaundice".
Part 2 :
I am trying to create a pivot table for each user showing which keywords are present in the text.
My code :
def keywords(row):
strings = row['CLINICAL_DOCUMENT_TEXT']
Keywords = [ "jaundice","jaundiced","portal hypertension"] # I have 75 keywords
keywords = [key for key in keywords if key.upper() in strings.upper()]
return keywords
df1['keyword'] = df1.apply(keywords, axis=1)
df_pivot=df1.explode('keyword').pivot_table(index ="MCN" , columns = "keyword").fillna(0).astype(int).reset_index()
df_pivot.columns=df_pivot.columns.droplevel(0)
df_pivot.head()
My output :
keyword SBP TIPS ascites asterixis black stools jaundice jaundiced lactulose melena ... SBP TIPS ascites asterixis black stools jaundice jaundiced lactulose melena rifaximin
0 1175880 -2147483648 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1376151 -2147483648 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 1784428 -2147483648 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1932574 0 1199667889 0 0 0 -2147483648 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1977098 -2147483648 0 0 0 0 0 0 -2147483648 -2147483648 ... 0 0 0 0 0 0 0 0 0 0
5 rows × 41 columns
It would be helpful if I get some guidance on how to solve my problem better. Thank you!