Data cleaning file names in iBooks directory with python

215 Views Asked by At

I'm trying to print a list of all the files in the specified directory that end in .pdf

Once that is running I want to expand it to print out the number of files that are named "unnamed document" or end in .pdf.pdf.pdf which is a problem among the 1,200 or so books I've collected in iBooks.

After it prints out the .pdf file I'm trying to get it to trim off the excess .pdf extensions and somehow prompt me to edit each file which I'll have to do manually after reviewing the first few pages of each "unnamed document"

While I would love all the code spelled out for me, I would more appreciate hints or tips on how to go about learning how to do this.

The directory was found from this page. https://www.idownloadblog.com/2018/05/24/ibooks-library-location-mac/

The script I've found was from here Get list of pdf files in folder

I print this currently and get EOF errors and type errors so I'm asking for help on how to structure or revise this script for the start of a larger data cleaning project.

While this can be done with regex, I prefer this done with python. Remove duplicate filename extensions

Thanks!

First version

#!/usr/bin/env python3

import os

all_files = []
for dirpath, dirnames, filenames in os.walk("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents"):
    for filename in [f for f in filenames if f.endswith(".pdf")]:
        all_files.append(os.path.join(dirpath, filename)

# print (files ending in .pdf.pdf.etc)

# trim file names with duplicate .pdf names

# print(files named "unnamed document")

End of the First version

Start of the Second version

The second version, well I've now realized after reading a few other blogs that this is a relatively known and solved problem. So the script I've changed to using was found online from 2013 that uses hashes to compare the files and it claims quite quickly. Then the script as shown below just needs the name of a subdirectory in the terminal (where you will need to run it) and press enter.

testmachine at testmachine-MacPro in ~sandbox/test 
$python3 dupFinder.py venv $testdirectory

results in

Duplicates Found:
The following files are identical. The name could differ, but the content is identical
___________________
        venv/bin/easy_install
        venv/bin/easy_install-3.6
___________________
        venv/bin/pip
        venv/bin/pip3
        venv/bin/pip3.6
___________________
        venv/bin/python
        venv/bin/python3
___________________
        venv/lib/python3.6/site-packages/six.py
        venv/lib/python3.6/site-packages/pip/_vendor/six.py
___________________
        venv/lib/python3.6/site-packages/wq-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.app-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.core-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.db-1.1.2-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.io-1.1.0-py3.6-nspkg.pth
        venv/lib/python3.6/site-packages/wq.start-1.1.1-py3.6-nspkg.pth

I really like the simplicity of this script and posting it here but also giving credit to [pythoncentral][1] for creating this baller code back in 2013. That still runs despite six years but also works flawlessly.

# dupFinder.py
import os, sys
import hashlib


def findDup(parentFolder):
    # Dups in format {hash:[names]}
    dups = {}
    for dirName, subdirs, fileList in os.walk(parentFolder):
        print('Scanning %s...' % dirName)
        for filename in fileList:
            # Get the path to the file
            path = os.path.join(dirName, filename)
            # Calculate hash
            file_hash = hashfile(path)
            # Add or append the file path
            if file_hash in dups:
                dups[file_hash].append(path)
            else:
                dups[file_hash] = [path]
    return dups


# Joins two dictionaries
def joinDicts(dict1, dict2):
    for key in dict2.keys():
        if key in dict1:
            dict1[key] = dict1[key] + dict2[key]
        else:
            dict1[key] = dict2[key]


def hashfile(path, blocksize=65536):
    afile = open(path, 'rb')
    hasher = hashlib.md5()
    buf = afile.read(blocksize)
    while len(buf) > 0:
        hasher.update(buf)
        buf = afile.read(blocksize)
    afile.close()
    return hasher.hexdigest()


def printResults(dict1):
    results = list(filter(lambda x: len(x) > 1, dict1.values()))
    if len(results) > 0:
        print('Duplicates Found:')
        print('The following files are identical. The name could differ, but the content is identical')
        print('___________________')
        for result in results:
            for subresult in result:
                print('\t\t%s' % subresult)
            print('___________________')

    else:
        print('No duplicate files found.')


if __name__ == '__main__':
    if len(sys.argv) > 1:
        dups = {}
        folders = sys.argv[1:]
        for i in folders:
            # Iterate the folders given
            if os.path.exists(i):
                # Find the duplicated files and append them to the dups
                joinDicts(dups, findDup(i))
            else:
                print('%s is not a valid path, please verify' % i)
                sys.exit()
        printResults(dups)
    else:
        print('Usage: python dupFinder.py folder or python dupFinder.py folder1 folder2 folder3')


  [1]: https://www.pythoncentral.io/finding-duplicate-files-with-python/

The third version and forward needs to evolve a few things.

There is room for improvement in the UI, a terminal or headless gui, saving the results in a log or csv file. Eventually moving this to a Flask or Django app could prove beneficial. Course this being a pdf document scrubber I could create a que of files named "unnamed document" that the machine could log its hash or create an index that it would save could be improved speed next time it wouldnt scan whole again just show me the "unnamed documents" that need work. Work could be defined as scrubbing the name, deduping, finding the cover page, adding keywords or even creating a que file for the reader to actually read each document. Maybe there is an API for good reads ? Any menu or gui would need to include error handling as well as starting with somekind of cron job and saving results somewhere.well as the intelligence behind this so it starts to learn the steps you are taking over time.

Ideas ?

1

There are 1 best solutions below

7
abhilb On

I would suggest you to have a look at the pathlib library. link

  • To list out all the files with extension pdf
from pathlib import Path

pdf_files = list(map(str,Path("~/Library/Mobile Documents/iCloud~com~apple~iBooks/Documents").glob("**/*.pdf")))

print(pdf_files)
  • To trim extra .pdf
pdf_files = [x[:-4] if x.endswith('.pdf.pdf') else x for x in pdf_files]