Problem with SDV (synthetic data vault): Getting back identical synthetic datasets

177 Views Asked by At

I'm using the following code from the SDV library to create a synthetic dataset that's the same shape as my original dataset. While each synthetic dataset is different than the original dataset, all synthetic datasets are identical to each other. I would have thought there would be some randomness built into the synthetic data generation process so that each output would be slightly different. This occurs across sessions even when I set a different random seed. How should I interpret what's happening?

    metadata.detect_from_dataframe(data=input_data)
    synthesizer = SingleTablePreset(metadata=metadata,name='FAST_ML')
    synthesizer.fit(data=input_data)
    synthetic_data = synthesizer.sample(num_rows=len(input_data))```
1

There are 1 best solutions below

0
Neha Patki On

I believe SDV synthesizers set an internal seed when they run, which explains the determinism you're seeing. This is expected behavior.

If you want different data, you can call the sample method multiple times. Every subsequent run should give you different data. In the code below, all 3 samples of synthetic data will be different.

synthetic_data_1 = synthesizer.sample(num_rows=len(input_data))
synthetic_data_2 = synthesizer.sample(num_rows=len(input_data))
synthetic_data_3 = synthesizer.sample(num_rows=len(input_data))

For more info, see the sampling docs, particularly the reset_sampling method to get back to the initial state.

BTW the team is always looking for feedback. For supporting more randomization options, you can file a feature request directly on the GitHub.