Generative AI Paper Reading Context Rot: How Increasing Input Tokens Impacts LLM


Details
Join us for a paper discussion on “Context Rot: How Increasing Input Tokens Impacts LLM Performance”
Exploring how input length alone degrades reliability across simple retrieval, conversation QA, and replication tasks.
https://research.trychroma.com/context-rot
Featured Paper:
“Context Rot: How Increasing Input Tokens Impacts LLM Performance” (Hong, Troynikov, Huber, 2025)
Discussion Topics:
Unifying Findings
- Evaluation across 18 models shows non-uniform performance decline as input length grows, even with fixed task complexity.
- Standard NIAH overstates robustness; semantically oriented tasks and distractors expose sharper drops.
Needle-in-a-Haystack Extensions
- Varying needle–question similarity (lexical to semantic) reveals faster degradation at lower similarity.
- Distractors reduce accuracy; impact is non-uniform and model-family dependent.
Haystack Effects
- Needle–haystack topical similarity shows mixed effects; results depend on domain pairing.
- Shuffled haystacks outperform coherent ones, indicating input structure influences attention under long context.
LongMemEval (Conversational QA)
- Models perform well on focused excerpts but degrade on full 113k-token histories due to added retrieval burden.
- Thinking modes help but do not close the gap between focused and full inputs.
Repeated Words (Replication)
- Simple exact-copy tasks degrade with longer sequences; failures include refusals, random tokens, and off-by-position errors.research.trychroma
- Position accuracy declines as the unique token appears later; word-count under/over-generation grows with length
Implementation Notes
- Inputs tested across 8 lengths and 11 positions; temperature 0; budgeted reasoning where applicable.
- LLM-judge alignment tuned to >0.99 against human labels for reported tasks.
Key Points of Interest:
- Performance declines with input length even when difficulty is held constant.
- Distractors and structural coherence amplify degradation; content and order matter.
- Benchmark design should isolate input length from task complexity; context engineering is critical
Silicon Valley Generative AI has two meeting formats:
1. Paper Reading - Every second week we meet to discuss machine learning papers. This is a collaboration between Silicon Valley Generative AI and Boulder Data Science.
2. Talks - Once a month we meet to have someone present on a topic related to generative AI. Speakers can range from industry leaders, researchers, startup founders, subject matter experts and those with an interest in a topic and would like to share. Topics vary from technical to business focused. They can be on how the latest in generative models work and how they can be used, applications and adoption of generative AI, demos of projects and startup pitches or legal and ethical topics. The talks are meant to be inclusive and for a more general audience compared to the paper readings.
If you would like to be a speaker or suggest a paper email us @ svb.ai.paper.suggestions@gmail.com or join our new discord !!!
---

Every 2 weeks on Monday
Generative AI Paper Reading Context Rot: How Increasing Input Tokens Impacts LLM