I have a csv dataset with 160MM rows that is not possible to import directly through Pandas (RAM memory is not enough). How could I draw a random sample of 5% from the original dataset (in this case, a sample with roughly 8MM rows)??? Amy insight is appreciated... Cheers, Marcelo
I have tried using chunks, but it did not work.
If that's not fast enough, here's another approach:
It uses https://man7.org/linux/man-pages/man1/shuf.1.html which is probably faster than what you'll end up with in Python alone.