I'm using pandas with read_csv because the dataset that is available in TSV.
https://datasets.imdbws.com/ - dataset link
(I'm using name.basics.tsv in the example below. https://www.imdb.com/interfaces/ - link to details of each file
What's my problem? By chance, at the time I started work, I thought it would be better to use MongoDB to store all the processing I've been doing.
Coming now to the end of the work, I come across a problem that is preventing me from continuing to create some Dashboards with Metabase.
I need to make sure that when reading the TSV, or when processing the data when creating the DataFrame before going to MongoDB, a field is created as an array, because it contains more than one piece of information, separated by a comma.
I know that within MongoDB you can do this (change the field type manually to array). But how to do this through Python? I made some attempts at treatment and alteration but without success. I'll show you how I'm currently reading the file.
Some dataset fields have more than one piece of information, and for the solution I'm using now to plot it, it would help me a lot if some fields were a string array type (below an example)
import pandas as pd
from pymongo import MongoClient
import numpy as np
from ast import literal_eval
mc = MongoClient("mongodb://localhost:27017/")
db = mc._database_default_options(name='imdbdash')
file = "../IMDBDATASET/name.basics.tsv"
dfname = pd.read_csv(filepath_or_buffer=file,
sep = '\t',
#dtype={'primaryProfession': np.array , 'knownForTitles':'np.array'},
#converters={'primaryProfession': literal_eval,'knownForTitles': literal_eval,},
low_memory=False,
header=0,
na_values=["\\N","nan","NaN"],
keep_default_na=False,
na_filter=True,
verbose=False)
knownForTitles
0 tt0072308,tt0045537,tt0053137,tt0050419
1 tt0117057,tt0037382,tt0038355,tt0071877
2 tt0049189,tt0056404,tt0057345,tt0054452
3 tt0072562,tt0077975,tt0078723,tt0080455
4 tt0060827,tt0083922,tt0050976,tt0050986
... ...
12169467 tt11657662,tt2455546,tt14069590
12169468 NaN
12169469 tt8736744
12169470 NaN
12169471 NaN
------------------
primaryProfession
0 soundtrack, actor, miscellaneous
1 actress, sound track
2 actress, soundtrack, music_department
3 actor, soundtrack, writer
4 writer, director, actor
... ...
12169467 animation_department,art_department
12169468
12169469 cinematographer
12169470 cinematographer
12169471
I thought about changing DB and playing in MySQL/SQL but I had a recent hope to not have to redo everything.
MongoDB has a function called $lookup and how to make relationships between collations, but now I need some fields to be saved as an array.
The Pandas documentation says that the dtype can be a dict, but the field is not formatted as such, they are huge files, this one for example has 12m of records.
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Does anyone have an idea how to do this?
I've already tried:
converters={'primaryProfession': ast.literal_eval,'knownForTitles': ast.literal_eval},
test = np.array(dfname['knownForTitles'].tolist())
"','".join(map(str,dfname['knownForTitles']))
dtype={'primaryProfession': np.array , 'knownForTitles':'np.array'}