Reducing Huge repo size with simple nginx fileserving
The problem I faced with my current deployment of this website, was that my last post bloated the size of my git repo with large binary files.
Solutions online seemed to say that I should use S3, or another paid service. But I already pay for my VPS, and the files already exist in my VPS in the git repo (multiple times over!)
Then I remembered that it is dead simple to host files with nginx! This is all that is needed (In nginx.conf or a file in sites-available
that is hardlinked to sites-enabled
:
server {
root /home/james/file_server;
server_name files.jamesdesmond.org;
location / {
autoindex on;
autoindex_localtime on;
}
}
I then moved all the large files over into the file_server
directory.
The remaining steps are:
- Find and replace in the index.html of the vqgan-fun markdown any existing media paths to the new, correct file_server path
- Purge the git history of those large binary files forever
- Redeploy website and test
Replacing filepaths with chatGPT and sed
To figure out the sed script required to replace the file paths, I used chatGPT. I asked: “I am looking for a sed script that will replace all instances of links to files, with a slightly different path.
For example, "[reach](reach.png)"
Should become "[reach](https://files.jamesdesmond.org/james_blog_content/vqgan-fun/reach.png)"
”
and I received:
sed 's#\[\([^]]*\)\](\([^)]*\))#[\1](https://files.jamesdesmond.org/james_blog_content/vqgan-fun/\2)#g' inputfile > outputfile
Along with a lengthy explanation of what this command did. It worked on the first try. I used this command to process index.md
to correct the paths.
I then pushed my changes to git, and made sure the page still loads correctly.
Purging git history with filter-branch
Next, I need to purge the git history of these large media files.
This is more challenging that you would think, the internet and chatGPT say to use git filter-branch
but I cannot install it with pip, pip3, copying the source from git, or apt.’
The server is on ubuntu 18.04, and git is not updated enough to even use filter-branch (requires >2.22.0). Instead I used RELEASE_UPGRADER_ALLOW_THIRD_PARTY=1 do-release-upgrade
(The env var is required for digital ocean instances that use 3rd party mirrrors, like mine)
Having to do this upgrade, on my ‘production’ server, and realizing the steps I would have to replicate to bring it back up, I think I should really try and containerize both this website, and my 311 alerting/caching system. Right now my GH actions sshs into the server, but really I should have more portable system, probably using docker. It is tough, the simplest option is to use github pages to host the website statically. But then my VPS would exist only as a file server for hosting images / data files for the brython scripts (which might be broken on gh pages), so then I may as well just use S3 of GH LFS to host the images. Now I don’t even have a VPS at all. But, that is kind of the goal, that I never have anything I need to worry about updating… But the control I would hand over, also is a bit of the fun for me.
What I really want is the ability to just deploy whatever containers I want, easily. So I should keep my VPS, and run a container management software on it. Then I can write spec files that define the containers that will run and the ports / volumes they need. That sounds like I get to keep control, learn, and have a more robust system.
I guess, I can just try to recover onto a new instance with the DO image backups I have taking place, and see how it goes. If that works, I don’t really care if I break this system, I’ll just restore from a backup.
One issue I just ran into was I had to pip3 install -r requirements.txt
on my 311 alert script, because the upgrade cleared the twilio dependency or something.
I still am unable to use apt or pip to install git-filter-branch. I am now going to try to use the bfg
tool.
I used: bfg --delete-folders vqgan-fun
in the git repo, and then ran the reccomended git reflog commands, and went from a 100+MB repo to a 5MB repo. (There are still some images in other directories I need to clean out, but I want to look more into how to have a Hugo shortcode handle linking the image tags.)
How will I handle large files in the future
Well, I know I can use the file_server to hold them, but it is a bit annoying to type all this URL for each image path. I may instead make some hugo shortcode that links the images, matching filepaths like vqgan-fun
from content/posts
and file_server/james_blog_content/
.