What the hell are you talking about?

That’s the typical response I’ve gotten from other engineers whenever I mention the issues with using OpenAI in production. I get looked at like every old man about 15 minutes into a horror movie that tells those wild ass kids not to party in that old abandoned house. But they eventually figure it out; I am the harbinger of death.

429s from OpenAI

Getting throttled by a SaaS company isn’t an experience that most developers think they’ll have to spend their future time on when they do an integration. If someone told me:

Bro, you’re going to at least have some sort of exponential backoff in place when you hit the Stripe API.

I’d laugh at them. That’s what I pay Stripe a portion of my sales to deal with.

Then, reality hits

In the summer of 2023, I left my role at Vault to chase my startup dreams. It had been a long time since I actually practiced what I preached to the founders I worked with at Vault, and I was stoked (still am, too) about the advances companies had made with Transformers (mostly NLP with LLMs at the time). It was a watershed moment in technology and I was itching to hop on the hype train.

I joined a startup founded by Bill Klein with the mission to solve the cognition problems with LLMs and build insanely cool agents with unlimited utility.

So we built an app on top of OpenAI, and it was good.

Then we got people to use it.

Disaster.

429s all over the place. This was untenable.

Azure to the Rescue (but not really)

The natural next step for any company that outgrows OpenAI is to move to their offering on Azure.

It’s enterprise, after all. The good stuff.

All your scaling problems are over. Papa Satya’s gotchu.

wrong wrong wrong wrong wrong

We switched over to Azure and things were looking marginally better. We were able to load balance our requests to multiple regions with an exponential backoff when a region got too hot and got a nice bump in throughput.

Then the cycle happened all over again and we hit another wall. Regions on Azure started getting hit hard by other companies on the same path and we had to do some wild stuff to keep running. We even got told by our Microsoft account team “YOLO, try the France region. Doesn’t seem like a lot of people are using it”.

Time to pay the piper

Azure offers something called “Provisioned Throughput Units” for companies that really need to scale. We figured we fit in that category, very loosely. Even though we hadn’t even officially launched yet, we wanted to understand how the pricing worked.

…

Let’s just say a vast majority of companies, and most startups, would find these offerings very cost prohibitive.

This is still a huge problem

So we pivoted and started Y2 that exposed some of the solutions we worked up to deal with the scaling issues we encountered. I’ll write about that journey another time.

Do I think you should use Azure if you’re getting throttled by OpenAI?

As of the time of writing this…no. Some of the “solutions” that Azure have offered up to solve this problem (even with PTUs as I understand), are laughable at best.

Why yes, let’s spin up an entire expensive service on Azure to implement exponential backoffs to your shitty infra.

What’s the solution?

Hell if I know. Bill and Y2 are still doing awesome things and you should definitely reach out to him if you find yourself in the same boat.

Scaling any LLM SaaS is a really hard problem. GPUs are not cheap and neither is the infrastructure and talented people that make those tokens go brrrr.

That said, Open Source models are impressive right now. MoE models like Mistral are stupid good, and probably good enough for whatever you’re doing.

Git gud and run your own models. At least then all you’ll have to blame is yourself when your app goes down.

OpenAI Infra - Throttle All The Things

August 2024 update