Sentence segmentation rule not working as expected

253 Views Asked by At

I have created my own simple sentence segmentation rule to sentencize on a new line (and keep the default segmentation rules as well):

import spacy
nlp = spacy.load('en_core_web_sm')
def set_custom_boundaries(doc):
    for token in doc[:-1]:
        if token.text.startswith('\n') or token.text == '\n':
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe(set_custom_boundaries, before='parser')
nlp.pipe_names

This is working fine for most cases. But there's one line which has been constantly a pain.

doc = nlp(u'"Management is doing things right; leadership is doing the right things." -Peter Drucker.\n\n Please M60 6ES!\n Mobile: +44 (0)793 990 2594\nReception: +44 (0)161 296 8956\n\n')

This produces the following output and I cannot make any sense out of it:

"Management is doing things right; leadership is doing the right things."
-Peter Drucker.


Please M60 6ES!

Mobile: +44 (0)793
990 2594

Reception: +44 (0)161 296 8956 

I would expect mobile number to be just 1 sentence (like Reception number). Like this:

"Management is doing things right; leadership is doing the right things."
-Peter Drucker.


Please M60 6ES!

Mobile: +44 (0)773 990 2504


Reception: +44 (0)161 236 8256

But no matter what I try, it wont join up with +44 0(793). Is it because of some default Spacy rule?

Can you please help.

0

There are 0 best solutions below