using python 3.7, tensorflow 2.9
My goal is to extract data stored as parquet file in AWS S3 in parallel to reduce I/O. First while trying to read parquet files in local disk I keep on facing InvalidArgumentError
code is as follows:
def read_pq_file(parquet_path):
pq_file = pd.ParquetDataset(parquet_path)
df pq_file.read().to_pandas()
X = df[x_features].astype('float32').values
y = df[label].astype('float32').values
yield (X, y)
def read_and_preprocess(parquet_path):
dataset = tf.data.Dataset.from_generator(
lambda: read_pq_file(parquet_path), (tf.float32, tf.float32))
# train_pq_path = [.... list of parquet file paths ..]
train_dataset_paths = tf.data.Dataset.from_tensor_slices(train_pq_path)
train_dataset = train_dataset_paths.interleave(read_and_preprocess, num_parallel_calls=2, cycle_length=2, deterministic=False).batch(batch_size)
model.fit(train_dataset)
Any help would be greatly appreciated as I've been stuck on this for a while.
Try # 1:
I've noticed that tensors contained in train_dataset_path has path as bytes rather than string (ex: b'/path/to/parquet_file/') So I've tried changing the function
def read_pq_file(parquet_path):
print(parquet_path)
pq_file = pd.ParquetDataset(parquet_path.numpy().decode())
df pq_file.read().to_pandas()
X = df[x_features].astype('float32').values
y = df[label].astype('float32').values
yield (X, y)
but it outputs UnknownError:AttributeError: 'Tensor' object has no attribute 'numpy' even though tf.executing_eagerly() = True. I've added print statement and it print Tensor("args_0:0", shape=(), dtype=string). Where did "args_0:0" even come from and why is tensor same as train_dataset_path printed?