Outline

Summary

Introduction

Methodology

Results

Discussion

Code

Summary

Multi agent systems (MAS) typically communicate through text, forcing the extra step of decoding latent representations into tokens before passing to the next agent. The LatentMAS framework, proposed by Zou et al. 2025 [1], engineers direct sharing of the transformer's key value (KV) caches, providing significant speed ups, accuracy gains, and up to 80% less token usage. However, this framework introduces a new challenge - KV caches grow linearly in the number of agents in the system. This work explores the k nearest neighbor retrieval over cached keys from the Memorizing Transformers paper by Wu et al. 2022 [6] as a potential mechanism to limit KV size. In the end, we are able to trim the KV cache memory by 40% and speedup answer generation by 29% while maintaining near full LatentMAS accuracy. We discuss the full process of experimentation and provide intuition for our results. Ultimately, these findings suggest that latent communication contains structured layer-dependent information that can be selectively compressed without significantly compromising performance, providing an avenue for further development of efficient latent MAS design.

Introduction

The "complaint tablet to Ea-nāṣir" is considered to be the oldest written complaint, delivered from a customer to a merchant in 1750 BCE [5]. Nearly 4000 years later, humans and machines are still communicating through text; the fact that you are learning through reading this blog is a prime example. Even in the realm of multi agent systems (MAS), systems of AI agents that work together on problems, recent research and innovations still utilize tokens (short pieces of text) as the primary form of communication between one machine and another [2][3][4]. Research is often focused on agent behavior and structure, with far less attention paid to the underlying mode of communication between agents.

The LatentMAS framework introduced in Zou et al. 2025 [1] proposes an alternative means of communication between agents in MAS. Instead of token level communication, the agents interact by directly sharing activations - stored and transferred through KV caches. Without the need to map the final activations to the constrained text space, the design can achieve lossless information exchange while saving the decoding process entirely [1]. The entire process involves no extra training, allowing it to sit on top of theoretically any LLM. Compared to a baseline text-based MAS on common benchmarks, this process can achieve 4x speedups and 80% reductions in token usage while maintaining, and often increasing, accuracy on all benchmarks. There is clearly high expressive power in the latent states.

While the performance is certainly promising, the authors did not address a potentially important issue: as more agents are added, in sequence or parallel, there is a linearly increasing demand for memory to store the KV caches. These caches grow with each step, increasing memory demands and potentially degrading performance in resource constrained environments. Interestingly, however, there was a conceptually parallel paper that addressed memory efficiency proposed 3 years earlier in the "Memorizing Transformers" framework [6]. The method gives transformers the ability to memorize information by adding a non-differentiable external memory that stores KV representations. The information is then retrieved through an efficient approximate kNN search, enabling scalable memory usage and long-context performance.

Bridging these two paradigms may give a natural answer to the issue of memory management in the LatentMAS framework. Specifically, might kNN-based retrieval offer a principled method for trimming LatentMAS's KV cache, preserving only the most relevant information? This blog investigates this intersection, integrating kNN memory lookup into the latent collaboration framework with the goal of significantly reducing memory and computational overhead while preserving accuracy.

Methodology

Theoretical Framework

This section will first aim to summarize the architecture of the LatentMAS system. Then we will discuss the novel kNN process implemented on top of the LatentMAS system.

Figure 1: Single LatentMAS agent architecture

In a traditional transformer, let f_θ(·) represent the function the transformer applies to the inputs, parameterized by θ. For a given input X = (x₁, x₂, ..., x_a), the model first encodes each of the input tokens into ℝ^d_h, where d_h is the model's hidden dimension. Let's denote the resulting embedded input as E = (e₁, e₂, ..., e_a). After passing E through f_θ(·)'s L hidden layers, we obtain a sequence of hidden states H₁ = (h₁, h₂, ..., h_a) with each h_i ∈ ℝ^d_h. To determine the next token,

f_θ(x_a+1|x) = softmax(h_aW_out)
where W_out is the matrix mapping d_h to the vocabulary space.

In the LatentMAS method seen in Figure 1, X is still encoded and H₁ is generated in full. However, instead of using h_a to generate x_a+1, h_a ∈ ℝ^d_h is directly fed into the model as if it was an encoded input token. This generates h_a+1, which can be fed back into the model again. This autoregressive process is repeated for p "latent steps" producing H₂ = (h_a+1, h_a+2, ..., h_a+p). H₂ can be thought of as the latent hidden thoughts of the model.

Figure 2: Multi Agent LatentMAS model

Now let us treat the aforementioned process as a single agent A_i in a series of agents A₁, A₂, ..., A_d as seen in Figure 2. This agent could have been tasked with generating a plan, critiquing the plan, refining the plan etc. Regardless of its role, the agent's H₂ hidden thoughts contain valuable information that can be used by the next agent. In more traditional text based MAS, the text output from A_i would be included in the input X of A_i+1. Working in the latent hidden representation state, latent MAS prepends the key and value caches from A_i into A_i+1. In more formal terms, we have a latent hidden KV memory of

S_i = {(K^l_{A_i}, V^l_{A_i}) | l ∈ 1, 2, ..., L}
where

K^l_{A_i} = [K^l_{A_i,1}, K^l_{A_i,2}, ..., K^l_{A_i,a+p}]
and

V^l_{A_i} = [V^l_{A_i,1}, V^l_{A_i,2}, ..., V^l_{A_i,a+p}]
The latent hidden memory of A_i contains KV caches from all layers, for all tokens from the input text to the end of the latent hidden tokens. Agent A_i gives agent A_i+1 the entirety of S by prepending the key and value caches to respective layers in A_i+1 before any A_i+1 operations. In a series of many agents, this process is repeated for all i until the last agent, stacking KV caches through each layer. For benchmark verification and to ensure it is human interpretable, the final layer decodes directly to text instead of a latent hidden state. This is done with full latent context from all agents that came before it.

It is clear that the Latent MAS transfers the KV cache of all layers and all tokens from agent to agent, causing the KV cache to grow linearly with the number of agents. We introduce k-nearest neighbors, inspired by the Memorizing Transformers paper, to try to limit the size of the KV cache. After generating all latent steps in A_i, before transferring the KV cache to A_i+1, we add a kNN filtering step. The input query text of A_i+1 is embedded and averaged across all input tokens to produce a single vector e_avg ∈ d_h. We then calculate the cosine similarity between each key in the middlemost layer of A_i and e_avg. These keys, and the corresponding values, are preserved in their original order and layer-wise prepended to A_i+1's KV cache.

Figure 3: kNN filtering process in LatentMAS

Implementation

To test the efficacy of kNN based KV cache filtering in Latent MAS, we utilize the GPQA-diamond dataset used in the Latent MAS paper [7]. GPQA-diamond is the most difficult of the GPQA benchmarks. It features 198 graduate-level multiple choice questions written by experts in physics, chemistry, and biology. The dataset requires strong cross disciplinary reasoning, multi step reasoning, and depth of knowledge in the sciences. While the original Latent MAS paper tests on multiple other simpler benchmarks, a difficult reasoning benchmark provides the clearest feedback on if a strategy is effective; small mistakes will lead to wrong answers, requiring the model and the KV cache pruning strategy to be prudent.

There were a number of factors we kept consistent as controls from the original Latent MAS paper. Firstly, we kept the sequential MAS agent role delegation the same. The planner agent generates a plan, a critic agent critiques the plan and generates feedback, a refiner refines the plan based on the critic's comments, and finally the solver takes the updated plan and produces a final result. This was a framework first introduced in Zhang et al. 2024 [8]. Next, we ran the same baseline models as a benchmark:

"Baseline": Qwen3-4B model evaluating the problem solo, with no sequential MAS
"TextMAS": Sequential system of four Qwen3-4B models playing the roles of planner, critic, refiner, and solver. Communication from one model to another is done through text.
"LatentMAS": Sequential system of four Qwen3-4B models playing the roles of planner, critic, refiner, and solver. Communication between one model to another is done latently, as described in the theoretical framework section.

There were also a number of hyperparameters kept constant across all runs of the baselines and the novel kNN pruned LatentMAS: max samples = 100, max new tokens = 8096, latent steps = 10, temperature = 0.6, top-p = 0.95, and agent specific prompts. These were the defaults used in the original LatentMAS paper. The computation was done on NVIDIA 4090 GPUs with the Huggingface instance of the Qwen3-4B model.

For the kNN, while traditional kNN implies a constant "k" neighbors chosen, the variability in input text length makes it difficult to choose a constant k for all agents. The final implementation takes a percentage kNN, where the top k% most similar KV pairs were selected.

Results

Initial Results

Figure 4 reports the accuracy and time per response for each of the baseline models tested, as well as variations of the kNN-LatentMAS model. The Baseline model sits at 21% accuracy while TextMAS has the highest accuracy at 46%. However, TextMAS does take significantly longer due to the inference needed to generate text between each agent. The LatentMAS model sits at 44% accuracy. For the kNN-LatentMAS models, the top 90% of KV pairs model had the best accuracy of 45%, performing better than the baseline and ended up just shy of the TextMAS model. However, the rest of the kNN-LatentMAS models did not fare as well. Accuracy falls off very quickly and falls under the expected random guess accuracy of 25% at the k = 80% mark. Within the kNN-LatentMAS models, it is interesting to note that time per question initially falls off quickly as pruned percentage increases, but the time per question then increases near k = 50%. Checking the logs, one potential explanation could be that for "middle" k values (80%, 70%), the final solver agent thinks minimally before answering due to some perceived confidence. As the k value continues to decrease however, the model starts to ramble out incoherent thoughts, taking a long time to decide it is done and ending with the wrong answer. Based off of these initial experiments, it seems like cutting large amounts of KV cache is not possible without significant losses in accuracy.

Figure 4: Initial results showing accuracy of Baseline, TextMas, LatentMAS, and kNN-LatentMAS models. "Top xx%" refers to kNN-LatentMAS models keeping only the top xx% most similar KV pairs.

Bottom and Random K Ablation Tests

While top k is what was suggested by Memorizing Transformers (Wu et al 2022) [6], it is not the only choice. In fact, without baselines within the kNN-LatentMAS model, the accuracy numbers don't mean much at all. We repeat the experiment of evaluating the kNN-LatentMAS model, this time introducing two new controls:

"Bottom k": k% of KV pairs that are the LEAST similar to the query vector
"Random k": k% of KV pairs selected at random

Similar to the Top k models, we maintain the order of the selected KV pairs when prepending to the next agent's cache. Figure 5 reports the results. The Bottom k model outperforms the Top k model at k = 90% by 2% while the random model does not do significantly worse. As the k% decreases, the Bottom and Random model continue to outperform the Top k model. Furthermore, the Bottom k model always outperforms the Top k and Random k model at every level of k. This may seem counterintuitive at first, but in fact may stem from the prompting format. If we look at the prompts that the LatentMAS paper utilized, they were all short formal prompts that first gave context then asked the agent to do a specific task. Given that they all had similar formatting tone and content, it may be the case that KVs that are the least similar with the query text have the most "novel" information. In the Top-k model, we may have been inadvertently selecting for KV pairs that were mostly about the prompt, cutting out the more useful latent space content.

Figure 5: Bottom-k vs Top-k vs Random-k ablation comparison

Latent Space Visualization

Let's look at two particular visualizations of the latent space. Firstly, Figure 6 below shows the cosine similarity between a subsequent agent's input text embeddings and the prior agent's KV cache. Each row represents a layer in the transformer, and each column is a cached key. Naturally, as the KV caches are additively transferred from one model to the next, the number of cached keys increases. In this particular case, the planner passes 201 keys to the critic, the critic adds 242 new keys and passes 443 keys to the refiner, and the refiner adds 228 new keys and makes the final pass to the solver. It is also interesting to see that, in general, the similarity between input text embedding and cached key varies more depending on the layer than within any particular layer. Most layers are either almost all positive cosine similarity (green) or all negative cosine similarity (orange). There are very few layers that contain both. Additionally, although a weaker trend, there is generally a higher concentration of orange in the top right of each plot (higher layer of the transformer, more recent key cache). This is logical as the higher layers in transformers deviate away from the universal, general purpose features present in low layers and instead move towards task-specific specialization. The more recent caches also likely move beyond reasoning with the prompt and instead are reasoning. Combining this intuition may explain why the Bottom-k model outperformed Top-k and Random-k. The Bottom-k was capturing the novel, synthesized information that deviated furthest from the input text.

Figure 6: Cosine similarity heatmaps across all layers

We can also take a deeper dive and investigate the heatmap of cosine similarities in the exact layer the model utilized rather than all layers, as seen in Figure 7 below. To elucidate the differences in similarity, we min-max normalize the data. There are two notable observations we can instantly derive from these plots. Firstly, the key caches at the end of each sequence tend to have low similarity. These are specifically the key caches associated with the newly generated hidden latent thoughts. Secondly, most apparent in the final pass plot of the refiner to the judger, there are low similarity regions that correspond with the hidden latent thoughts generated at the end of each of the agent's reasoning. The Top-k is unlikely to get to these cached KVs associated with the hidden latent thoughts because there are so many high similarity text input tokens. Bottom-k ends up prioritizing the hidden latent thoughts.

Figure 7: Min-max normalized similarity in the middlemost layer

Prompt Engineering Adjustment

As it stands, the prompts of the planner, critic, refiner, and solver do not share any instructions. Given that prompts are embedded and used to calculate similarities with the prior agent's keys, it may be beneficial, given the insights derived from the heatmaps, to maximize similarity between prompts. This should theoretically maximize the relative difference between prompt focused key caches and novel reasoning caches. Applying bottom kNN to these new key caches should retrieve a subset of KV caches that are richer in reasoning and novel ideas.

We test this by stacking the input prompts in the same fashion as the KV caches. The entire prompt of the planner is included in the prompt of the critic, the entire prompt of the critic included in the prompt of the refiner, and the entire prompt of the refiner included in the prompt of the solver. Again, we test with the same hyperparameters and vary the percentage k used. In Figure 8, we see that performance on the Top-k and Random-k models are about the same across all k's. However, there is a highly significant preservation of accuracy through k = 60% in the Bottom-k model. More specifically, the accuracy at k = 60% in the Bottom-k model was 35% with the new prompts while only 11% with the old prompts. Assuming a base accuracy of 42% from the full LatentMAS model over n = 100 questions, an accuracy of 35% has a p-value of 12%. This is insignificant at a 5% level. The model is able to do this with an average time per question of 86 seconds, which is less than the single model Baseline. Furthermore, the result is obtained utilizing 60% of the memory of the LatentMAS model at a speed 29% faster. This is a very promising development, indicating that latent state pruning can achieve significant efficiency gains at the cost of minimal accuracy changes.

Stacking prompts increases redundancy, making the novel reasoning KVs more distinct.

Figure 8: Performance with stacked prompts across different k values

Discussion

Limitations and Further Research

Although kNN-based pruning is promising for reducing memory consumption in LatentMAS, there are limitations to the methodology we employed.

The pruning mechanism relies on cosine similarity of an average embedding of the input prompt. This collapses the semantic structure of the prompt, removing nuance of the desired task at hand. Future research could explore a more expressive query system like a query set of tokens to provide more reliable relevance signals for KV selection.
Currently, the kNN only evaluates keys from the middlemost layer of the transformer when determining similarity. As seen in Figure 6, the heatmaps suggest strong layer dependent structure, with lower layers encoding general language structure and higher layers encoding more abstract features. An arbitrary middle layer may or may not provide the most reliable signal. A selection process that spans the layer range may provide greater signals as to what is important to keep.
We did not vary the baseline model (Qwen3-4B) or the benchmark (GPQA-diamond). It is unclear how these results generalize to easier/harder benchmarks, larger/smaller models, and other reasoning tasks. There is always the possibility of the superiority we saw in the bottom-k strategy with the new prompts to be artifacts of the model, agent roles, or domain. Further research may expand on the model selection and datasets to explore this strategy more comprehensively.

Conclusion

Despite the limitations, this research demonstrates that there are nuanced ways to prune the KV cache in latent MAS. Inspired by Memorizing Transformers (Wu et al., 2022) [6], we implemented kNN based KV cache pruning to the LatentMAS model to decrease memory usage in the MAS. A straightforward top k selection process performs poorly, losing significant accuracy at k = 80% of the KV cache preserved. However, noticing that bottom kNN-LatentMAS model consistently outperformed the Top and Random models, we showed that the key caches that are the least similar with the text input query are associated with the hidden latent state generations. Designing new prompts to emphasize this difference yielded strong results, maintaining near full LatentMAS accuracy while decreasing memory by 60% and time spent on each problem by 28%. Overall, these results set the foundation for further research in the area of latent communication, emphasizing the importance of understanding what latent hidden states are actually encoding.

Future work could explore multi-token query representations.

Figure 9: Summary of key findings and future directions

Code

All code and instructions for running it can be found on GitHub here.

References:

[1] Zou, J., Yang, X., Qiu, R., Li, G., Tieu, K., Lu, P., ... & Yang, L. (2025). Latent Collaboration in Multi-Agent Systems. arXiv preprint arXiv:2511.20639.

[2] Zhou, H., Geng, H., Xue, X., Kang, L., Qin, Y., Wang, Z., ... & Bai, L. (2025). Reso: A reward-driven self-organizing llm-based multi-agent system for reasoning tasks. arXiv preprint arXiv:2503.02390.

[3] Jin, B., Zeng, H., Yue, Z., Yoon, J., Arik, S., Wang, D., ... & Han, J. (2025). Search-r1: Training llms to reason and leverage search engines with reinforcement learning. arXiv preprint arXiv:2503.09516.

[4] Pezeshkpour, P., Kandogan, E., Bhutani, N., Rahman, S., Mitchell, T., & Hruschka, E. (2024). Reasoning capacity in multi-agent systems: Limitations, challenges and human-centered solutions. arXiv preprint arXiv:2402.01108.

[5] Kalinauskas, N. (2015, March 10). At the British Museum, oldest recorded customer-service complaint on display. Yahoo News. https://ca.news.yahoo.com/blogs/daily-buzz/at-the-british-museum-oldest-recorded-184633671.html

[6] Wu, Y., et al. (2022). Memorizing Transformers. arXiv preprint.

[7] Rein, D., Hou, B. L., Stickland, A. C., Petty, J., Pang, R. Y., Dirani, J., ... & Bowman, S. R. (2024, July). Gpqa: A graduate-level google-proof q&a benchmark. In First Conference on Language Modeling.

[8] Zhang, Y., Sun, R., Chen, Y., Pfister, T., Zhang, R., & Arik, S. (2024). Chain of agents: Large language models collaborating on long-context tasks. Advances in Neural Information Processing Systems, 37, 132208-132237.