Speeding Up Chandra

Product Updates

3 mins

Speeding Up Chandra

November 17, 2025

TL;DR: We made Chandra 3× faster by training an Eagle3 speculative decoding model. This reduces latency significantly while keeping accuracy unchanged.

We recently released Chandra, our SOTA OCR model, to Huggingface and our API.

However, we found that deploying the base model did not fulfill the latency and throughput requirements that we wanted to support.

So we trained an Eagle3 model for Chandra, a variant of Speculative Decoding, to reduce our API’s p99 latency by 3x without reducing accuracy.

Speculative Decoding

LLM’s generate tokens one by one. To generate a full response consisting of T tokens, there needs to be T forward passes of the model. Speculative decoding is a method that predicts multiple tokens ahead in one single pass. This technique utilizes a smaller model (draft model), to generate and propose multiple candidate tokens. The candidate tokens are then verified in parallel by the target model.

Speculative decoding leverages the fact that the next word(s) can be easy to generate. For example, generating the next few tokens for a sentence like this is easy:

The quick brown fox ...

Whereas a sentence like this has infinite valid answers:

My favorite food is ...

In the case where the next few tokens are easy to predict, we can leverage the draft model, a more efficient version, to generate the tokens faster than the target model without a loss of accuracy.

Eagle3

Eagle3 is an improvement over vanilla speculative decoding. Instead of drafting one linear path, Eagle3 drafts multiple potential drafts in a tree-like structure. It uses the outputs from three layers of the target model to generate multiple potential completions which are then verified by the target model.

Chandra Improvements

To support faster inference in our API, we trained our own Eagle3 draft model for Chandra. When benchmarking, we found Eagle3 to increase throughput by 40%, reduce p50 latency by 25%, and reduced p99 latency by 3x.

In short: the Chandra API is faster across both average and tail workloads.

Using Chandra

You can try Chandra in a number of ways:

Using the model through Huggingface or our Chandra Github package
Test it out through our free playground
Using our hosted API (you can get $5 of free credits after signing up)

If you’re over the revenue limit for the open source version, or you want deeper support, we can help you get started. Reach out to us at [email protected] for more information.

What’s Next

We are continuing to improve Chandra across several dimensions:

Broader language support
Better handling of documents with visually rich components like charts
Faster and more efficient inference like our Eagle3 work

We'd also like to shout out David Wang, Jason Mancuso, and the Modal team for help deploying Chandra with Eagle3. Using Modal, we were able to run our training and specdec work without having to worry about scale, performance, or GPU capacity.

Reach out to us at [email protected] for any questions! You can also find me on Twitter for any ML related questions.

‍