Intriguing Initial Results: Our First Partner/Customer Benchmark Results

Deep Indexing Pays Off

Jul 01, 2025

With the help of our first customer, we created 14 hard questions about a their codebase—covering architecture, algorithms, data flow, state management, and feature implementation. We tested ChromeBird's two Question answering modes against to agentic AI tools: Claude Code, Windsurf as well as a “classic” search tool Sourcegraph. We measured the key factors: response time, technical depth, and whether developers can actually use the information.

NB: ChromeBird Deep and Streaming are both only using Gemini Flash 2.0!

Here:

Average Response Time: How long each agent takes to answer (in seconds)
Technical Depth Score (1-5): How specific the answer is - counts file paths, method names, line numbers (5 = very specific)
Actionability (1-5): How immediately useful for coding - can you start implementing right away? (5 = ready to code)
High-Level Win Rate: Percentage of times users preferred this agent for quick understanding and decisions
Deep-Dive Win Rate: Percentage of times users preferred this agent for actual implementation work

What's the story here?

When asking quick conceptual questions: ChromeBird Stream delivers ~10-second responses and dominates high-level scenarios, winning 71% of quick decision questions (10 out of 14). When it can't handle complex architectural questions, it's honest about its limitations and tends to refuse to answer. That's intentional design.

When asking quick in depth questions: ChromeBird Deep takes on average ~146 seconds but achieves perfect scores on both technical depth (5.0/5) and actionability (5.0/5). It wins 79% of deep-dive scenarios (11 out of 14) because when you need to implement something, you need comprehensive, specific information.

Why does it work so well?

ChromeBird’s advantage is deep knowledge indexing. We create synthetic documents that act as a map for the LLM, representing huge sections of the codebase without overwhelming the model.

Since large codebases evolve slowly, this gives us two direct benefits: a massive drop in token costs and a substantial boost in performance.

The result? ChromeBird running on Gemini 2.0 Flash is already outperforming Anthropic's Claude 3.7 Sonnet despite the huge gap of the base models.

ChromeBird's Enterprise AI Bi-Weekly

Discussion about this post