How to avoid splitting on specific text block in LangChain

218 Views Asked by At

I'm using langchain ReucrsiveCharacterTextSplitter to split a string into chunks. Within this string is a substring which I can demarcate. I want this substring to not be split up, whether that's entirely it's own chunk, appended to the previous chunk, or prepended to the next chunk. Is there a relatively simple way to do this?

For example, the following code:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=2, separators=[' '], keep_separator=False)
nosplit = '<nosplit>Keep all this together, very important! Seriously though it is...<nosplit>'
text = 'Giggity! ' + nosplit + 'Ahh yeah...\nI just buy a jetski.'
chunks = splitter.split_text(text)
print(chunks)

Prints: ['Giggity!', 'Keep', 'all', 'this', 'together,', 'very', 'important!', 'Seriously', 'though', 'it', 'is...Ahh', 'yeah...\nI', 'just', 'buy a', 'jetski.']

Whereas I would like it to print: ['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...\nI', 'just', 'buy a', 'jetski.']

The only partial solution I have is to give <nosplit> priority in my separators list and then temporarily replace all other separators within the nosplit text with non-separator placeholders, and then put them back in. E.g.:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=2, separators=['<nosplit>', ' '], keep_separator=False)
nosplit = '<nosplit>Keep all this together, very important! Seriously though it is...<nosplit>'
space_word = 'x179lp'
nosplit = nosplit.replace(' ', space_word)
text = 'Giggity!' + nosplit + 'Ahh yeah...\nI just buy a jetski.'
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    chunks[i] = chunk.replace(space_word, ' ')
print(chunks)

The issue is with this method I can't use character level splitting, i.e. '', as a possible separator for the rest of the document (I'd like to use the default separators list (along with nosplit): ["nosplit", "\n\n", "\n", " ", ""]).

Thank you!

1

There are 1 best solutions below

2
Kostas Mouratidis On

One-liner with re

>>> import re
>>> [
    x for y in [
        [chunk.replace("<nosplit>", "")] if chunk.startswith("<nosplit>")
        else chunk.split()  # replace this with your splitter of choice
        for chunk in re.split("(<nosplit>.*)<nosplit>", text)
    ] for x in y]

['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...', 'I', 'just', 'buy', 'a', 'jetski.']

Explanation

Split your text using a simple regex including a capture group, and keep the <nosplit> in the beginning for easier filtering later:

>>> re.split("(<nosplit>.*)<nosplit>", text)
['Giggity! ', '<nosplit>Keep all this together, very important! Seriously though it is...', 'Ahh yeah...\nI just buy a jetski.']

Iterate over every chunk. If the chunk starts with <nosplit>, put it in a list (useful for making unpacking simpler later) and remove the <nosplit> tag, otherwise split it with whatever method you prefer (I used str.split here):

[
    # 1-item list, also remove the nosplit tag
    [chunk.replace("<nosplit>", "")] if chunk.startswith("<nosplit>")
    # otherwise, split langchain or whatever
    else chunk.split()  # replace this with your splitter of choice
    for chunk in ...
]

[['Giggity!'], ['Keep all this together, very important! Seriously though it is...'], ['Ahh', 'yeah...', 'I', 'just', 'buy', 'a', 'jetski.']]

Now we have a list of lists, and you can just unpack it:

>>> [x for y in list_of_lists_from_above for x in y]

['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...', 'I', 'just', 'buy', 'a', 'jetski.']

I suggest splitting the above in multiple lines to increase readability. I did in one because I am lazy.