I have a tar.gz file that I am downloading from this link: http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html
What is the best way to fully integrate this TSV data into kedro, perhaps with an API dataset first, and then a node to extract it?
Tar.gz files are not a default supported kedro dataset type.
Indeed, trying to read it directly from pandas fails:
Your best bet in this case is to create a custom dataset that decompresses the file, and then use a normal
pandas.CSVDataSetto read it from disk.It's very similar to my unfinished attempt of having a
KaggleDataSethttps://github.com/astrojuanlu/kedro-kaggle-dataset/tree/kaggle-fsThat might be too much work - alternatively, you can have a process that downloads & untars the file separately, and make the Kedro project take care of the rest. The disadvantage would be that not all your end-to-end data pipeline would be covered by Kedro, but on the flip side is simpler to start with.