How to select chunk size of data for embedding with an LLM?

352 Views Asked by At

I have structured data (CSV) that has a column of semantically rich text of variable length. I could mine the data so the CSV file has a max length per row of data by using an LLM to summarize semantically rich text to a max size. I’m using OpenAI GPT 3.5Turbo.

Is it important to pick a chunk size that accommodates the max possible size of a row? Or does it matter very little and I can work with a variable row size, select a median chunk size for my data, and let the LLM deal with receiving some records that are split into separate chunks?

2

There are 2 best solutions below

0
Lance Kind On BEST ANSWER

For CSV data, it’s best to fit a row of data alone within a single chunk. For different types of data (non-csv or not-record based) this answer may not apply. This answer can be generalized to all row based data, independent of CSV format.

Background: Since it is CSV data, it’s implied that content within a row has a strong semantic relationship and that there is little to no semantic relationship with the next row or previous, ie, row ordering can be random because the rows are independent of each other.

So when generating embedding for this kind of data, where the LLM is to generate responses using semantic meaning between rows, the goal is that each row of the CSV becomes a vector so that when a LLM is queried, it generates answers oriented around the semantic content among various rows (which is the goal in this case), which means these answers are based upon the fit among the CSV.

For more background Chunking Strategies for LLM Applications is a good source.

1
Nick Magnanini - preprocess.co On

It seems like chunk size is not relevant to your task. Chunk size matters for:

  • max length allowed by your embedding model (this can vary a lot, 512 tokens is usually the right one) if you are using vector/hybrid search
  • how many records it's optimal to be loaded in the LLM prompt

The latter may be important due to your context window size and how good the LLM is at cherry-picking information in the context.