Cannot mimic manual document split in Azure, programatically, using Azure SplitSkill

106 Views Asked by At

I am going from a manual setup of my RAG solution in Azure to setting up everything programmatically using the azure python sdk. I have a container with a single pdf. When setting up manually is see that the Document count under the created index is 401 when setting the chunking to 256. When using my custom skillset:

split_skill = SplitSkill(
    name="split",
    description="Split skill to chunk documents",
    context="/document",
    text_split_mode="pages",
    default_language_code="en",
    maximum_page_length=300,  # why cannot this be set to 256 if I can do this with a manual setup?
    page_overlap_length=30,
    inputs=[  
        InputFieldMappingEntry(name="text", source="/document/content"),  
    ],  
    outputs=[  
        OutputFieldMappingEntry(name="textItems", target_name="pages")  
    ],
)

I get 271. I want to mimic my manual chunking setup as much as possible as I already have good performance. What am I missing? Alternatively, could somebody point me to the default setup for chunking when it is performed by hand?

22 FEB EDIT

Answering @JayashankarGS

According to this doc the minimum value you need give is 300. learn.microsoft.com/en-us/azure/search/… Chunking in RAG is not as same as maximumPageLength in split skillset.

To me it looks like maximum_page_length is exactly chunking_size. But you are right, as of today, there is nothing to do regarding selecting a chunk size of less than 300 using SplitSkill...

enter image description here

1

There are 1 best solutions below

3
JayashankarGS On

You can mimic the splitting; however, the text split skill has a minimum length limit of 300, which is not the case in your manual setup.

Since the text split skill doesn't accept a maximum_page_length less than 300, you can split your documentation using the LLM_RAG_CRACK_AND_CHUNK_AND_EMBED built-in component found in the Azure ML registry. Then, create an index on the resulting dataset from this component.

Refer to this Stack Overflow solution regarding LLM_RAG_CRACK_AND_CHUNK_AND_EMBED.