in Kedro, how to handle tar.gz archives from the web

60 Views Asked by At

I have a tar.gz file that I am downloading from this link: http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html

What is the best way to fully integrate this TSV data into kedro, perhaps with an API dataset first, and then a node to extract it?

Tar.gz files are not a default supported kedro dataset type.

1

There are 1 best solutions below

0
astrojuanlu On

Indeed, trying to read it directly from pandas fails:

>>> import pandas as pd
>>> pd.read_csv("http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-1K.tar.gz", sep="\t")
...
ValueError: Multiple files found in TAR archive. Only one file per TAR archive: ['lastfm-dataset-1K', 'lastfm-dataset-1K/userid-profile.tsv', 'lastfm-dataset-1K/README.txt', 'lastfm-dataset-1K/userid-timestamp-artid-artname-traid-traname.tsv']

Your best bet in this case is to create a custom dataset that decompresses the file, and then use a normal pandas.CSVDataSet to read it from disk.

It's very similar to my unfinished attempt of having a KaggleDataSet https://github.com/astrojuanlu/kedro-kaggle-dataset/tree/kaggle-fs

That might be too much work - alternatively, you can have a process that downloads & untars the file separately, and make the Kedro project take care of the rest. The disadvantage would be that not all your end-to-end data pipeline would be covered by Kedro, but on the flip side is simpler to start with.