Description I want to separate the lines from the given chunk so that each line would represent a meaning full sentence or at least it won't change the other line by mixing it up. For example, I have this text :
515 Pacific Ave, Los Angeles, CA 90291, United States
(541) 754-3010 · [email protected]
MICHELLE LOPEZ, Fashion Designer
Expert Fashion Designer with 11+ years’ experience in women’s
high-end shoes. Launched product lines for Chanel and Gucci.
Designs showcased in Elle and Vogue. Attained recognition of
top seller fashionista in 2017.
The output should be :
(541) 754-3010 · [email protected]
MICHELLE LOPEZ, Fashion Designer
Expert Fashion Designer with 11+ years’ experience in women’s high-end shoes.
Launched product lines for Chanel and Gucci.
Designs showcased in Elle and Vogue. Attained recognition of top seller fashionista in 2017.
My approach
The text is separated using \n character and as the text contained that characters after "women's" (4th line) and "of" (6th line) , The solution of \n is not reliable.
I tried to separate using period . and for that approach the first line will be all joined until the full stop is found. So. this solution is not reliable as well.
After these scenarios, I tried to use machine learning using nltk and tokenizing the sentences. The result was the same with first line because I found this approach works same as dividing using period character.
The code for nltk:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.tree import Tree
# Sample text
text = """1515 Pacific Ave, Los Angeles, CA 90291, United States
(541) 754-3010 · [email protected]
MICHELLE LOPEZ, Fashion Designer
Expert Fashion Designer with 11+ years’ experience in women’s
high-end shoes Launched product lines for Chanel and Gucci
Designs showcased in Elle and Vogue Attained recognition of
top seller fashionista in 2017"""
# Remove new line characters
text = text.replace("\n", " ")
# Tokenize into sentences
sentences = sent_tokenize(text)
# Tokenize into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
# Part-of-speech tagging
pos_tagged_sentences = [pos_tag(tokens) for tokens in tokenized_sentences]
# Named Entity Recognition (NER)
ner_tagged_sentences = [ne_chunk(pos_tag(tokens)) for tokens in tokenized_sentences]
# Helper function to convert tagged sentence to tree format
def to_tree(chunked_sentence):
return Tree(chunked_sentence.label(), [to_tree(c) if isinstance(c, Tree) else c[0] for c in chunked_sentence])
# Print the reconstructed sentences
for sentence, ner_tagged_sentence in zip(sentences, ner_tagged_sentences):
reconstructed_sentence = " ".join(to_tree(ner_tagged_sentence).leaves())
print(reconstructed_sentence)
Expectations
So, I want to process using nlp and I have actually visited the huggingface and looked through all the pipelines. But, unfortunately, cannot find a pipeline which could do this task. I may need to train some model, but, so far I have tried, different NER models which could only label the sentences so it does not work.