Embedding on Expensive Hardware

Steps:

Read Article
Implement your AI Idea
???
Profit

I was embedding a large dataset, my home server kept crashing.

What I did:

I modified the JSON Data file and converted it to a JSONL Format, one array entry per line. This fixed the RAM Usage Issue causing an OOM Kill
I started a vector server instead of using FAISS and once a document was parsed and created from the JSONL Object, I embedded and inserted it immediately instead of appending it to an array and embedding/committing those all at once. Saved RAM, again. (I had to do this, because I used a file-based VectorStore instead of a Server/Client at the beginning.)

These two optimizations prevented OOM Kills, and to be honest, they should've been there from the beginning. As usual for my ML Projects, I rented out two Hetzner servers with 32Gig RAM and 16 Cores of RAW POWER haha, they costed me 10c an hour each.

I downloaded the dataset one more, which was a breeze because of DataCenter Speeds, split it into two halves, and used schollz/croc to move the second chunk to the second server. I threw up a VectorStore Server on the first server using Docker and configured my script to connect to it.

Dataset -> Chunk -> For record in JSONL Chunk -> Parse, Embed, Insert into VectorStore Server -> Next Record

Once the processing was done (8 Hours later, slept, it was almost finished when I woke up), I destroyed the second server and took a snapshot of the first, where the VectorStore was located. I used docker save and docker commit to get a snapshot of the docker container so I could spin it back up on my local without data-loss. I spun up another Hetzner server with low RAM/CPU but high storage (this is for data-center bandwidth, I can move the file from Server One incredibly quickly and leave it here for a while, because the storage server is much cheaper. Then I can peacefully download the image from my residential line haha), installed croc on it and moved the docker image there. Destroyed the first instance, and we're done!