git filter-branch clear notebook outputs all branches

82 Views Asked by At

I have been working out of jupyter notebooks for a while and never bothered to erase all notebook cells before commiting to my git repo. Those notebooks are full of embedded graphs/html/images/logs etc... and start to blow up my git repository.

My goal is to purge all the *.ipynb notebooks through the entire repo AND THEIR HISTORY of their cell outputs. Only doing it on the branch HEAD isnt enough since git would keep all the history of the output and the repo size would decrease.

My purge command is the following one:

python -m nbconvert --ClearOutputPreprocessor.enabled=True --inplace *.ipynb

To apply it to all the files of branch and its history, I use the following

git filter-branch -f --prune-empty --tag-name-filter cat --tree-filter "python -m nbconvert --ClearOutputPreprocessor.enabled=True --inplace *.ipynb **/*.ipynb || true" &> notebook_clean.log

The problem with git filter is that it can mess up the git history. So I run the following git magic commands which I inherited from someone else.

# Update local repo references: remove the old reference that points the same commit
git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin

# Expire reference cache: more of the same
git reflog expire --expire=now --all

# Cleanup: force git prune to reduce repo size
git gc --aggressive --prune=now

This seems to work for the branch I am currently on: the size of all notebooks on this branch go from 281MB to less than 2MB

Now I would like to do it on all branches and their history. Based on the manual and forums I use git filter-branch <options> <command> -- --all

git filter-branch -f --prune-empty --tag-name-filter cat --tree-filter "python -m nbconvert --ClearOutputPreprocessor.enabled=True --inplace *.ipynb **/*.ipynb || true" -- --all &> notebook_clean.log

I see outputs such as

Rewrite 016bf1...5aa874 (616/772) (2937 seconds passed, remaining 743 predicted)
[NbConvertApp] Converting notebook XXX.ipynb to notebook
[NbConvertApp] Writing 2602 bytes to XXX.ipynb

which I take it to mean it found 772 commits (refs) and for each commit, it creates a new ref and purges the notebook. It ends with

Ref 'refs/heads/master' was rewritten
Ref 'refs/heads/dev' was rewritten
Ref 'refs/remotes/origin/master' was rewritten
Ref 'refs/remotes/origin/some_task001' was rewritten
Ref 'refs/remotes/origin/dev' was rewritten

sounds good

I finish with the previous cleanup operations then git push --force --all with nothing special to notice

089e7r...76df80 master -> master (forced update)
47125j...gh54a3 dev -> dev (forced update)

However when I checkout another branch and some older commits from git log, some notebooks, such as XXX.ipynb in particular, (but it seems not all) are still big and fat and ladden with outputs. I would have expected all notebooks through all branches and all historical commits to be clean.

Despite my best efforts I do not understand why the -- --all option does not seem to work. (FYI the true span of my repository is about 20 branches). The .git folder is still the same size, if not bigger. I assume that if some notebooks got cleaned but not all, the diff from commit to commit got even bigger.

Could you please help me? Thank you

0

There are 0 best solutions below