Steps:


I was embedding a large dataset, my home server kept crashing.

What I did:

These two optimizations prevented OOM Kills, and to be honest, they should've been there from the beginning. As usual for my ML Projects, I rented out two Hetzner servers with 32Gig RAM and 16 Cores of RAW POWER haha, they costed me 10c an hour each.

I downloaded the dataset one more, which was a breeze because of DataCenter Speeds, split it into two halves, and used schollz/croc to move the second chunk to the second server. I threw up a VectorStore Server on the first server using Docker and configured my script to connect to it.

Dataset -> Chunk -> For record in JSONL Chunk -> Parse, Embed, Insert into VectorStore Server -> Next Record

Once the processing was done (8 Hours later, slept, it was almost finished when I woke up), I destroyed the second server and took a snapshot of the first, where the VectorStore was located. I used docker save and docker commit to get a snapshot of the docker container so I could spin it back up on my local without data-loss. I spun up another Hetzner server with low RAM/CPU but high storage (this is for data-center bandwidth, I can move the file from Server One incredibly quickly and leave it here for a while, because the storage server is much cheaper. Then I can peacefully download the image from my residential line haha), installed croc on it and moved the docker image there. Destroyed the first instance, and we're done!