How to avoid splitting on specific text block in LangChain

Question

How to avoid splitting on specific text block in LangChain

218 Views Asked by DMcC At 19 February 2024 at 09:43

I'm using langchain ReucrsiveCharacterTextSplitter to split a string into chunks. Within this string is a substring which I can demarcate. I want this substring to not be split up, whether that's entirely it's own chunk, appended to the previous chunk, or prepended to the next chunk. Is there a relatively simple way to do this?

For example, the following code:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=2, separators=[' '], keep_separator=False)
nosplit = '<nosplit>Keep all this together, very important! Seriously though it is...<nosplit>'
text = 'Giggity! ' + nosplit + 'Ahh yeah...\nI just buy a jetski.'
chunks = splitter.split_text(text)
print(chunks)

Prints: ['Giggity!', 'Keep', 'all', 'this', 'together,', 'very', 'important!', 'Seriously', 'though', 'it', 'is...Ahh', 'yeah...\nI', 'just', 'buy a', 'jetski.']

Whereas I would like it to print: ['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...\nI', 'just', 'buy a', 'jetski.']

The only partial solution I have is to give <nosplit> priority in my separators list and then temporarily replace all other separators within the nosplit text with non-separator placeholders, and then put them back in. E.g.:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=2, separators=['<nosplit>', ' '], keep_separator=False)
nosplit = '<nosplit>Keep all this together, very important! Seriously though it is...<nosplit>'
space_word = 'x179lp'
nosplit = nosplit.replace(' ', space_word)
text = 'Giggity!' + nosplit + 'Ahh yeah...\nI just buy a jetski.'
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    chunks[i] = chunk.replace(space_word, ' ')
print(chunks)

The issue is with this method I can't use character level splitting, i.e. '', as a possible separator for the rest of the document (I'd like to use the default separators list (along with nosplit): ["nosplit", "\n\n", "\n", " ", ""]).

Thank you!

Original Q&A

There are 1 best solutions below

**Kostas Mouratidis** · Answer 1 · 2024-02-19T10:04:58.977000

One-liner with `re`

>>> import re
>>> [
    x for y in [
        [chunk.replace("<nosplit>", "")] if chunk.startswith("<nosplit>")
        else chunk.split()  # replace this with your splitter of choice
        for chunk in re.split("(<nosplit>.*)<nosplit>", text)
    ] for x in y]

['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...', 'I', 'just', 'buy', 'a', 'jetski.']

Explanation

Split your text using a simple regex including a capture group, and keep the <nosplit> in the beginning for easier filtering later:

>>> re.split("(<nosplit>.*)<nosplit>", text)
['Giggity! ', '<nosplit>Keep all this together, very important! Seriously though it is...', 'Ahh yeah...\nI just buy a jetski.']

Iterate over every chunk. If the chunk starts with <nosplit>, put it in a list (useful for making unpacking simpler later) and remove the <nosplit> tag, otherwise split it with whatever method you prefer (I used str.split here):

[
    # 1-item list, also remove the nosplit tag
    [chunk.replace("<nosplit>", "")] if chunk.startswith("<nosplit>")
    # otherwise, split langchain or whatever
    else chunk.split()  # replace this with your splitter of choice
    for chunk in ...
]

[['Giggity!'], ['Keep all this together, very important! Seriously though it is...'], ['Ahh', 'yeah...', 'I', 'just', 'buy', 'a', 'jetski.']]

Now we have a list of lists, and you can just unpack it:

>>> [x for y in list_of_lists_from_above for x in y]

['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...', 'I', 'just', 'buy', 'a', 'jetski.']

I suggest splitting the above in multiple lines to increase readability. I did in one because I am lazy.

How to avoid splitting on specific text block in LangChain

There are 1 best solutions below

One-liner with `re`

Explanation

Related Questions in PYTHON

Related Questions in NLP

Related Questions in LANGCHAIN

Related Questions in RETRIEVAL-AUGMENTED-GENERATION

Related Questions in TEXT-CHUNKING

Trending Questions

Popular # Hahtags

Popular Questions

How to avoid splitting on specific text block in LangChain

There are 1 best solutions below

One-liner with re

Explanation

Related Questions in PYTHON

Related Questions in NLP

Related Questions in LANGCHAIN

Related Questions in RETRIEVAL-AUGMENTED-GENERATION

Related Questions in TEXT-CHUNKING

Trending Questions

Popular # Hahtags

Popular Questions

One-liner with `re`