Sarvam AI ran into trouble while testing one of its own systems. That test ended up shaping its latest product.
During an internal 24-hour sprint, the Bengaluru-based company asked an AI setup to extract financial and business data from company reports. The task involved pulling around 200 metrics from 27 PDF files covering several years of a listed company’s performance.
The system didn’t get through the job.
The first version relied on a single autonomous agent. It produced no usable output. None of the extracted data could be trusted.
Co-founder Pratyush Kumar said the instructions were clear and the task itself was routine. Even so, the agent stalled within 15 to 20 minutes and attempted fewer than half of the required fields before stopping.
A routine test that exposed deeper flaws
The team then reworked the setup multiple times. Over six iterations, using Claude Code and internal debugging tools, they changed how the work was split and how results were checked. Each revision improved things slightly. By the final run, the system was able to complete the workflow with results the team felt were usable.
Kumar later shared that the rebuilt system, called Arya and running on GPT-4.1 mini, delivered close to five times higher accuracy while costing about one-tenth as much as a Claude Code agent swarm. Logs from earlier runs showed repeated problems with unit handling, reporting periods, and the way financial metrics were interpreted.
Those problems weren’t fixed by switching models. They were fixed by changing how the system was organised.
The team moved away from a single-agent setup and rebuilt the pipeline with multiple agents working in parallel. Steps were more clearly defined, and basic controls were added to track progress and recover when something went wrong. Once those changes were in place, the system stopped failing at random points.
Sarvam AI says this kind of breakdown is common with agent systems today. Many of them work in controlled demos but struggle on real workloads. Execution order can be unclear. State can be lost. Failures often go unnoticed until someone reviews the output.
That makes them difficult to rely on for regulated or high-risk business tasks.
Arya is meant to sit between simple prompt chains and more complex coding agents. Prompt chains handle basic work but don’t scale well. Coding agents can take on harder problems but require close supervision and tend to fail abruptly. Arya is designed to bring more structure without turning workflows into rigid scripts.
The company also pointed to a simple reliability issue. Even when individual steps work most of the time, long workflows often don’t. Saving state, retrying failed steps, and keeping execution predictable makes a noticeable difference.
Sarvam AI plans to release Arya as open source. The release will include a containerised runtime, tools to inspect and replay runs, and support for building and managing agent workflows. The company describes it as shared infrastructure rather than a standalone product.
For companies testing AI agents in areas such as document processing, compliance, and financial analysis, the lesson is practical. These jobs look straightforward until systems start skipping steps or losing context mid-run. Sarvam AI believes reliability will matter more than raw capability as these tools move into everyday use.
The launch comes during an active period for the company. Over the past two weeks, Sarvam AI has announced several other products, including Bulbul 3 for text-to-speech, Saaras for speech-to-text, and vision and ASR models that support 22 Indian languages.
Also Read: Airtable Steps Into the AI Agent Game With Superagent











