I'm using langchain ReucrsiveCharacterTextSplitter to split a string into chunks. Within this string is a substring which I can demarcate. I want this substring to not be split up, whether that's entirely it's own chunk, appended to the previous chunk, or prepended to the next chunk. Is there a relatively simple way to do this?
For example, the following code:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=2, separators=[' '], keep_separator=False)
nosplit = '<nosplit>Keep all this together, very important! Seriously though it is...<nosplit>'
text = 'Giggity! ' + nosplit + 'Ahh yeah...\nI just buy a jetski.'
chunks = splitter.split_text(text)
print(chunks)
Prints: ['Giggity!', 'Keep', 'all', 'this', 'together,', 'very', 'important!', 'Seriously', 'though', 'it', 'is...Ahh', 'yeah...\nI', 'just', 'buy a', 'jetski.']
Whereas I would like it to print: ['Giggity!', 'Keep all this together, very important! Seriously though it is...', 'Ahh', 'yeah...\nI', 'just', 'buy a', 'jetski.']
The only partial solution I have is to give <nosplit> priority in my separators list and then temporarily replace all other separators within the nosplit text with non-separator placeholders, and then put them back in. E.g.:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=5, chunk_overlap=2, separators=['<nosplit>', ' '], keep_separator=False)
nosplit = '<nosplit>Keep all this together, very important! Seriously though it is...<nosplit>'
space_word = 'x179lp'
nosplit = nosplit.replace(' ', space_word)
text = 'Giggity!' + nosplit + 'Ahh yeah...\nI just buy a jetski.'
chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
chunks[i] = chunk.replace(space_word, ' ')
print(chunks)
The issue is with this method I can't use character level splitting, i.e. '', as a possible separator for the rest of the document (I'd like to use the default separators list (along with nosplit): ["nosplit", "\n\n", "\n", " ", ""]).
Thank you!
One-liner with
reExplanation
Split your text using a simple regex including a capture group, and keep the
<nosplit>in the beginning for easier filtering later:Iterate over every chunk. If the chunk starts with
<nosplit>, put it in a list (useful for making unpacking simpler later) and remove the<nosplit>tag, otherwise split it with whatever method you prefer (I usedstr.splithere):Now we have a list of lists, and you can just unpack it:
I suggest splitting the above in multiple lines to increase readability. I did in one because I am lazy.