Is there a way to list the directories in a using PySpark in a notebook?

383 Views Asked by At

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.

Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.

1

There are 1 best solutions below

1
Daniel Argüelles On

Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.

wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.