I have a numpy.ndarray and i want to replace all abbrevations in it using the below dictionary? how could i do this such that output i get in the same format as input. Currently this is what i am doing
X_trying=array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
dtype='<U97064')
X_trying
#notice double quotes
array([[" And my account number His Okay It is Arrow My name with a K Last name Is and another phone numbers That's okay it's just number Yes <unk> at Gmail Dot com that is a lower "],
["Hi Amber I'm relocating so I need a insurance card for my car First name <unk> last name is D key No for brand new isn't"]],
dtype='<U97064')
df_for_abbreviations = pd.DataFrame(X_trying, columns = ['text'])#converting to a dataframe
df_for_abbreviations['text_lower']=df_for_abbreviations['text'].apply(lambda x:x.lower())#converting to lowercase so it works with dictionary
df_for_abbreviations["unabbreviated_text"] = df_for_abbreviations["text_lower"].replace(abbreviations_master, regex=True)
#then when i convert back to ndarray format gets screwed up - quotes change from double to single and it causes in donstream code
x=df_for_abbreviations['unabbreviated_text'].to_numpy(dtype='<U97064').reshape(df_for_abbreviations.shape[0],1)
x
#notice that quotes change to single quotes
array([[' and my account number his okay it is arrow my name with a k last name is and another phone numbers that is okay it is just number yes <unk> at gmail dot com that is a lower '],
['hi amber i am relocating so i need a insurance card for my car first name <unk> last name is d key no for brand new is not']],
dtype='<U97064')
single quotes affect affect the downstream output
I have a dictionary of words that I would like to replace as below
abbreviations_master={}
abbreviations_master["i'm"]="i am"
abbreviations_master["it's"]="it is"
abbreviations_master["that's"]="that is"
abbreviations_master["don't"]="do not"
abbreviations_master["i'll"]="i will"
abbreviations_master["i've"]="i have"
abbreviations_master["we're"]="we are"
abbreviations_master["didn't"]="did not"
abbreviations_master["ma'am"]="madam"
abbreviations_master["you're"]="you are"
abbreviations_master["there's"]="there is "
abbreviations_master["let's"]="let us"
abbreviations_master["they're"]="they are"
abbreviations_master["can't"]="can not"
abbreviations_master["he's"]="he is"
abbreviations_master["doesn't"]="does not"
abbreviations_master["she's"]="she is"
abbreviations_master["what's"]="what is"
abbreviations_master["i'd"]="I would "
abbreviations_master["haven't"]="have not"
abbreviations_master["wasn't"]="was not"
abbreviations_master["we'll"]="we will"
abbreviations_master["won't"]="will not"
abbreviations_master["it'll"]="it will"
abbreviations_master["we've"]="we have"
abbreviations_master["wouldn't"]="would not"
abbreviations_master["that'd"]="that would "
abbreviations_master["you've"]="you have"
abbreviations_master["couldn't"]="could not"
abbreviations_master["that'll"]="that will"
abbreviations_master["y'all"]="you all"
abbreviations_master["isn't"]="is not"
abbreviations_master["it'd"]="it would"
abbreviations_master["would've"]="would have"
abbreviations_master["'cause"]="because"
abbreviations_master["hasn't"]="has not"
abbreviations_master["they've"]="they have"
abbreviations_master["you'll"]="you will"
abbreviations_master["here's"]="here is"
abbreviations_master["name's"]="name is"
abbreviations_master["shouldn't"]="should not"
abbreviations_master["wife's"]="?"
abbreviations_master["driver's"]="?"
abbreviations_master["they'll"]="they will"
abbreviations_master["everything's"]="?"
abbreviations_master["husband's"]="?"
abbreviations_master["there'll"]="there will"
abbreviations_master["should've"]="should have"
abbreviations_master["we'd"]="we would"
abbreviations_master["'bout"]="about"
abbreviations_master["she'll"]="she will"
abbreviations_master["he'll"]="he will"
abbreviations_master["you'd"]="you would"
abbreviations_master["one's"]="?"
abbreviations_master["who's"]="who has"
abbreviations_master["weren't"]="were not"
abbreviations_master["aren't"]="are not"
abbreviations_master["how's"]="how is"
abbreviations_master["how're"]="how are"
abbreviations_master["hadn't"]="had not"
You can use
re.splitto break the input in words, while keeping the separators (since some of your examples started with a), and check if any of the words is in your dictionary, otherwise, just keep the word. The code below is not very elegant because your input is annp.array. If you can make it a simple list of strings, the code can be simplified.The output format is similar to the input:
Note that the
()insplitare important. If you have more separators, you can add them as:re.split('( |\.|,). But your examples didn't have any other punctuation, so I didn't add it.