Your tailored LLM Router, leveled up by Data Clustering and Caching

Faster response times, lower inference costs, enhanced data visibility

Homepage banner


Instantly decode and route each prompt to the optimal model

Ensuring optimal performance at minimal cost

Cluster-Based LLM Router

Unlike traditional LLM benchmarks that rely on sample data, our Cluster-based LLM Router evaluates performance and costs using your actual prompts in real-time. It begins by identifying distinct use cases from your prompts, categorizing them into semantically similar clusters. The system then assesses various LLM models’ output quality against your current model on a cluster-by-cluster basis. Once trained on your unique data, each query is intelligently routed to the model that optimally balances performance and cost. This approach provides complete transparency and deep insights into how our Router evaluates and directs queries.

Insightful Data Clustering & Visualization

Our clustering functionality transforms large, unstructured data into semantically meaningful clusters, providing deep insights into your use cases – integral to our LLM Router. It accurately determines the semantic density of your prompts, including a distinct ‘noise’ group for semantically non-conforming inputs. This deep understanding of your data is crucial for ensuring the accuracy of LLM output quality assessments. Each cluster comes with vital metrics such as cost, performance score of each LLM model, cluster quality, and inter-cluster similarity. These intuitive yet comprehensive insights enable rapid assessments of LLM performance and costs per cluster, enabling informed decisions when selecting the best-fit model for your specific, actual prompts.


Efficient LLM Caching

Alongside our LLM Router, our efficient LLM Caching solution minimizes unnecessary LLM expenses for AI applications frequently handling repetitive queries. It cleverly reuses past LLM outputs for similar, new prompts through semantic search, and allows you to define the degree of similarity between new and cached prompts, ensuring an optimal balance of quality and cost for you. This also reduces computational resources, accelerates response times, and manages throughput and requests per second (RPS) constraints.


See how it works


Curious about how our solution will work for your AI applications?


Frequently Asked Questions

How do I implement Aguru?

Aguru connects to your LLM model through API, compatible with Python and node.js environments. It’s built for fast, easy implementation. Once you’ve created your trial account, you’ll see an explicit user guide that guides you through the integration in just a few clicks.

Of course. You have 3 options:
1. Set up a demo: We’ll schedule an online demo with you to give you a thorough walkthrough of Augur’s LLM Router, Caching, Clustering, and answer any of your questions. 
2. Use observation mode: By simply deactivating LLM Router, Caching, or both, you’ll activate observation mode. This means you’ll see how your new queries are answered through cache, and/or how the answers from other LLM models compared to your original LLM, without having the feature(s) impact your new queries. 
3. Use historical data: You can upload your historical dataset into Aguru to see how Aguru’s LLM Router and Caching work, instead of applying the functionality on new prompts. This feature isn’t activated in trial account by default, but can be added quickly. If you prefer this option, contact us

We use BERTScore to measure the output quality to each query. And you’ll see the different scores of different LLM models for the same query in our LLM Router, with 1 being the highest performance. For more details, check out this blog post

When LLM Router and Caching are both activated, Aguru will first compare a new prompt to past ones to see if any past response can be reused. If not, it’ll pass the new query to LLM Router, routing it to a more cost-effective model based on your quality and cost tradeoff threshold. Coupling LLM Router and Caching will bring enhanced cost optimization to businesses constantly receiving semantically repetitive queries and a substantial amount of queries in general. 

Of course you can. All the 3 features are included in the trial, and it’s your decision which feature you want to experiment with. It only takes a click to activate or deactivate LLM Router or Caching. Clustering is set to be activated all the time, in order to provide you visibility into your user interaction with your app. 

We’re flexible with the trial length. Our primary goal is to build a solution that truly adds value to your business, so we’re flexible with the time you need to test our solution and share with us your feedback on the product and future features you’ll need. 


Our Team

Designed and built by passionate AI engineers Backed by technology business leaders

Oleg Smirnov

Product & Engineering

Derek O’Carroll

Funding Strategy, Operations & Investment

Nick Shaw

GTM Strategy & Revenue


Jacqui Coombs

Business Operations