toby

toby founders Vincent Wilmet and Lucas Campa came to Baseten one week away from their startup’s Product Hunt launch. Their product, an AI-powered real-time translation service, allows people to have a live video call while speaking different languages. Vincent and Lucas worked with Baseten engineers to migrate from their development infrastructure to an ultra-low-latency production-ready deployment on Baseten and reached #3 on Product Hunt on launch day with zero minutes of downtime.

‍

"A week ago we reached out with a hefty goal and within days your team helped us get set up and stable for a launch. It went smoothly, entirely thanks to you guys. 100% couldn’t have gone live without the software and hardware support you guys worked through the weekend to get us. The custom optimized Whisper on Baseten’s autoscaling L4 GPUs saved us." - Vincent Wilmet, Co-founder and CTO @ toby

‍

Challenge

Homegrown inference infrastructure isn’t ready for peak demand.

Vincent and Lucas met in New York and began experimenting with AI tools for eliminating the language barrier in the global workforce. Both founders have lived in many countries, which gave them the opportunity to learn at least three languages each. That background helped them realize that talented workers around the world are cut off from high-paying jobs by language. They decided to begin solving this problem by building a real-time translation layer into video calls.

‍

"To run a transcription model for their video chat service, the toby team got started by spinning up an A10G-backed virtual machine on AWS EC2. Our deployment process was literally just SSHing into the box and working on the GPUs directly." - Vincent Wilmet, Co-founder and CTO @ toby

‍

Once they got beta users, they realized they needed to stay online 24/7, not just when they were actively working on the product. With plenty of spare AWS credits to burn, this was simple enough: they spun up a second VM, called it “prod”, and left it running full-time.

They knew this was a temporary solution at best. The designated production VM had a habit of crashing, and with no monitoring in place, they’d be notified of outages by beta testers. Vincent explained their process for rebooting and hotfixing: “if [the servers] would go down, we’d SSH in and mash Ctrl+C to restart them manually.”

This instability during testing turned into an existential issue when preparing for launch. Vincent and Lucas learned that they could expect as many as 5,000 users from Product Hunt on launch day. This spike in usage would quickly overwhelm their existing VM, and they realized that they needed to quickly upgrade to production-grade model serving infrastructure.

‍

Solution

Serve low-latency models on Baseten’s autoscaling infrastructure to seamlessly handle unpredictable traffic.

The toby team, now five engineers in total, first spent a couple of weeks trying to set up their own Kubernetes cluster in AWS and building an autoscaling model deployment system. They found building their own infrastructure to be unintuitive and frustrating. Realizing they had bigger fish to fry – namely product and model improvements – Vincent and Lucas began looking for a provider to handle production model serving.

‍

"We care more that it's done than done by us. Bonus points if it’s done really well!" -Vincent Wilmet, Co-founder and CTO @ toby

‍

They reached out to Baseten one week ahead of their set launch date and paired with a Baseten forward-deployed engineer to get their infrastructure ready in time for launch. After talking through GPUs, models, application structure, traffic expectations, and timelines, engineering work began on a Thursday.

By the end of that first day, they migrated their entire setup from AWS to Baseten. By end of day Friday, their deployments were fully ready for load testing and stress testing ahead of launch — all they had to do was point their existing setup at the new API endpoint deployed on Baseten. After successful tests and tweaks over the weekend, the toby team was ready for a Tuesday launch.

‍

"The team at Baseten was absolutely exceptional. They took our panic and made us calm and serene. They worked with us, coded with us, and gave us access to beta services and new models. Above and beyond – elite!" - Vincent Wilmet, Co-founder and CTO @ toby

‍

toby uses a deployment of Whisper for live conversation transcription. Their rearchitected model allows for streamed inputs and ensures that proper nouns and product names are spelled and translated correctly and contextually. For example, when talking about programming in Spanish, the model would transcribe “Python” as “Python,” not “pitón 🐍.”

This transcription model runs on L4 GPUs on Baseten and is optimized for batch inference to handle large numbers of concurrent users from Product Hunt. The model deployment automatically scales up and down in response to traffic, and the toby team has full observability from logs and metrics.

‍

Results

Key result #1: Zero downtime during and after launch with autoscaling deployments.

Running on Baseten’s model serving infrastructure, toby launched on schedule and reached #3 on Product Hunt with zero minutes of downtime.

But the traffic didn’t stop there. As influential Twitter accounts and newsletters unexpectedly amplified the toby launch over the course of the week, Baseten’s autoscaling infrastructure effortlessly absorbed the unanticipated traffic spikes — even when Vincent and Lucas were sleeping.

"At 6am the day after launch we were dead asleep and got a massive traffic spike off of a newsletter shout-out. We’re so lucky that Baseten has autoscaling and kept our service live, otherwise we would have slept through downtime." - Vincent Wilmet, Co-founder and CTO @ toby

‍

Key result #2: Low latency for real-time translation

toby runs Whisper as part of enhancing the user experience with a live transcript, so every millisecond of latency is crucial. Even during traffic spikes, latency can’t be sacrificed.

Ahead of launch, Baseten’s forward-deployed engineers worked with the toby team to optimize the performance of their Whisper model on L4 GPUs using TensorRT-LLM. The optimized deployment beat performance targets, averaging just 200 milliseconds of end-to-end latency.

‍

Key result #3: Massive cost savings vs scaling existing infrastructure

The combination of traffic-based autoscaling, optimized model performance, and switching to L4 GPUs — which cost less per hour to operate than A10G GPUs — made the new model serving stack an estimated 95% more cost-efficient at scale versus manually scaling up their original infrastructure deployed on AWS.

Now, Vincent, Lucas, and their team can focus on developing the next set of features for toby and continue to break down the language barrier in the global workforce without worrying about infrastructure.

‍

"Baseten went above and beyond. An elite team, an exceptional experience. Deeply grateful. I’m looking forward to the continued partnership." - Lucas Campa, Co-founder and CEO @ toby

toby 与 Baseten 合作