tensorflow tf.data.Dataset.interleave() outputs InvalidArgumentError: TypeError: not a path-like object

41 Views Asked by At

using python 3.7, tensorflow 2.9

My goal is to extract data stored as parquet file in AWS S3 in parallel to reduce I/O. First while trying to read parquet files in local disk I keep on facing InvalidArgumentError

code is as follows:

def read_pq_file(parquet_path):
    pq_file = pd.ParquetDataset(parquet_path)
    df pq_file.read().to_pandas()
    X = df[x_features].astype('float32').values
    y = df[label].astype('float32').values
    yield (X, y)

def read_and_preprocess(parquet_path):
    dataset = tf.data.Dataset.from_generator(
        lambda: read_pq_file(parquet_path), (tf.float32, tf.float32))

# train_pq_path = [.... list of parquet file paths ..]
train_dataset_paths = tf.data.Dataset.from_tensor_slices(train_pq_path)
train_dataset = train_dataset_paths.interleave(read_and_preprocess, num_parallel_calls=2, cycle_length=2, deterministic=False).batch(batch_size)

model.fit(train_dataset)

Any help would be greatly appreciated as I've been stuck on this for a while.

Try # 1:

I've noticed that tensors contained in train_dataset_path has path as bytes rather than string (ex: b'/path/to/parquet_file/') So I've tried changing the function

def read_pq_file(parquet_path):
    print(parquet_path)
    pq_file = pd.ParquetDataset(parquet_path.numpy().decode())
    df pq_file.read().to_pandas()
    X = df[x_features].astype('float32').values
    y = df[label].astype('float32').values
    yield (X, y)

but it outputs UnknownError:AttributeError: 'Tensor' object has no attribute 'numpy' even though tf.executing_eagerly() = True. I've added print statement and it print Tensor("args_0:0", shape=(), dtype=string). Where did "args_0:0" even come from and why is tensor same as train_dataset_path printed?

0

There are 0 best solutions below