Check out my report, the website, and the GitHub repo.
Knowledge-enhanced retrieval (and NLP in general) has come far, but so far it’s mostly used for professional use cases like coding, customer service, and so on. That’s why Mitra Labs developed Coco, an agentic RAG system with access to recordings of conversations made with a small wearable. It can act as a friend, a “social performance tracker”, a personal assistant, or whatever you want to use it for.
I joined the team as part of my IDP at TUM and mostly tried to figure out how to build the information retrieval and benchmark it.
Normal retrieval-then-generation RAG was out of the question since it lacks a few capabilities we needed:
- Perform multi-hop retrieval to answer complex questions (“What were my New Year’s resolutions and am I sticking to them?”).
- Filter chunks by metadata like time (“What tasks have I left from last week?”) or speakers (“What did I discuss with Alice?”).
- Retrieve chunks based on language emotion (“In which situations did I sound nervous?”).
So instead, we built a tool calling agent that can dynamically query the chunk database as it pleases.
To measure how well Coco works, I created the Mitra Dataset, a synthetic German conversational QA benchmark focusing on these capabilities.
Our experiments showed that the Coco agent outperforms a RAG baseline, even with small tool calling models that can be run locally like Llama 70b. As you move to larger and smarter models, the agent’s performance margin increases. So Coco’s intelligence is currently capped by small models’ tool calling capabilities.