Tensorflow Data Pipeline: Read text files from sub-directories for Seq2Seq Models

12 Views Asked by At

I am trying to create a data pipeline using Tensorflow's text_dataset_from_directory method for training a seq-to-seq model. The folder structure as below:

BBC News Articles
   |_ News Articles
      |_Business
          |_001.txt, 002.txt
      |_Sports
          |_001.txt, 002.txt
   |_Summaries
      |_Business
          |_001.txt, 002.txt
      |_Sports
          |_ 001.txt, 002.txt

How do I create a tensorflow pipeline that reads the text files where the data is read from the folders. News Articles are is the input and Summaries is the target. I have tried the following but it doesn't maintain the index of the inputs and the target

articles_path = "./BBC News Summary/News Articles/"
summary_path = "./BBC News Summary/Summaries/"

batch_size = 32
seed = 42

articles_data = utils.text_dataset_from_directory(
                    articles_path,
                    labels=None,
                    batch_size = batch_size,
                    validation_split=0.2,
                    subset='training',
                    seed=seed)

summary_data = utils.text_dataset_from_directory(
                    summary_path,
                    labels=None,
                    batch_size = batch_size,
                    validation_split=0.2,
                    subset='training',
                    seed=seed)

If I just pass "./BBC News Articles" as the path then it reads all the files under one training set and doesn't create labels which would be the Summaries. Your help is much appreciated. Thank you.

0

There are 0 best solutions below