I am trying to use the tokenizer from openai-whisper to be used as the tokenizer for spaCy.
Im doing the following but it is giving errors, what is the correct way to use whisper tokenizer as a custom tokenizer in spaCy.
import spacy
import en_core_web_sm
from whisper.tokenizer import Tokenizer, get_tokenizer
nlp = en_core_web_sm.load()
nlp.tokenizer = Tokenizer
The issue is that when openai-whisper tokenizes " Toby makes fun", it considers the first " " (space) as a token. When passing through spaCy tokenizer it also recognizes this, but while finding the entity, it discards any empty spaces. Here the entity output is "Toby" but I need it to be " Toby"
nlp = en_core_web_sm.load()
ref_doc = nlp(" Toby makes fun")
print([(X.text, X.label_) for X in ref_doc.ents])
# ref_doc.text = " Toby makes fun"
# but ref_doc.ent.text = "Toby"