Goal: First split a token into two tokens. Then use SpanRuler to label both the re-tokenized tokens as a single span with one label.
Problem: The labeled span consists of the original text (a single token) rather than the two tokens concatenated with a separating space (ie after re-tokenization).
What I did:
I add a custom tokenizer splitter as the first stage. It correctly splits the single token into two tokens.
I then detect the two (splitted) tokens using a SpanRuler. Notice that the SpanRuler works for a pattern of two separated tokens (ie pattern=['abc', 'efg']), and will correctly detect nothing if the pattern is the original single token (pattern='abcefg').
Notice the custom retokenizer does respect Spacy's non-destructive retokenization.
Thanks for any help.
Minimal Reproducible Example:
import spacy
from spacy.language import Language
@Language.component('splitter')
def splitter(doc):
with doc.retokenize() as retokenizer:
retokenizer.split(doc[0], ['abc', 'efg'], heads=[doc[0], doc[0]])
return doc
nlp = spacy.load('en_core_web_sm'])
nlp.add_pipe('splitter', first=True)
sp_ruler = nlp.add_pipe('span_ruler')
sp_ruler.add_patterns([{'label': 'testing', 'pattern': [{'TEXT': 'abc'}, {'TEXT': 'efg'}]}])
doc = nlp('abcefg')
print([(tok.text, i) for i, tok in enumerate(doc)])
print([(type(span), span.text, span.label_) for span in doc.spans["ruler"]])
print(len(doc.spans['ruler']))
Actual Output:
> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abcefg', 'testing')]
> 1
Expected output:
> [('abc', 0), ('efg', 1)]
> [(<class 'spacy.tokens.span.Span'>, 'abc efg', 'testing')] # notice the space in the text, expected due to custom re-tokenization
> 1