How to select chunk size of data for embedding with an LLM?

352 Views Asked by Lance Kind At 27 February 2024 at 13:56

I have structured data (CSV) that has a column of semantically rich text of variable length. I could mine the data so the CSV file has a max length per row of data by using an LLM to summarize semantically rich text to a max size. I’m using OpenAI GPT 3.5Turbo.

Is it important to pick a chunk size that accommodates the max possible size of a row? Or does it matter very little and I can work with a variable row size, select a median chunk size for my data, and let the LLM deal with receiving some records that are split into separate chunks?

Original Q&A

There are 2 best solutions below

Lance Kind On 17 March 2024 at 15:37 BEST ANSWER

For CSV data, it’s best to fit a row of data alone within a single chunk. For different types of data (non-csv or not-record based) this answer may not apply. This answer can be generalized to all row based data, independent of CSV format.

Background: Since it is CSV data, it’s implied that content within a row has a strong semantic relationship and that there is little to no semantic relationship with the next row or previous, ie, row ordering can be random because the rows are independent of each other.

So when generating embedding for this kind of data, where the LLM is to generate responses using semantic meaning between rows, the goal is that each row of the CSV becomes a vector so that when a LLM is queried, it generates answers oriented around the semantic content among various rows (which is the goal in this case), which means these answers are based upon the fit among the CSV.

For more background Chunking Strategies for LLM Applications is a good source.

Nick Magnanini - preprocess.co On 06 March 2024 at 15:27

It seems like chunk size is not relevant to your task. Chunk size matters for:

max length allowed by your embedding model (this can vary a lot, 512 tokens is usually the right one) if you are using vector/hybrid search
how many records it's optimal to be loaded in the LLM prompt

The latter may be important due to your context window size and how good the LLM is at cherry-picking information in the context.

How to select chunk size of data for embedding with an LLM?

There are 2 best solutions below

Related Questions in CSV

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in CHUNKING

Related Questions in OPENAIEMBEDDINGS

Trending Questions

Popular # Hahtags

Popular Questions