Needle placements in full sweep (top) vs. last 2K tokens sweep (bottom): In the last 2K setup, placement positions are aligned in different context lengths, unlike the proportion-based positioning in full sweep.

Needle placements in full sweep (top) vs. last 2K tokens sweep (bottom): In the last 2K setup, placement positions are aligned in different context lengths, unlike the proportion-based positioning in full sweep.

Source publication
Preprint
Full-text available
Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fac...

Context in source publication

Context 1
... further investigate this issue, we devise an alternative setup that focuses on analyzing the last 2K tokens instead of sweeping across the full context. Therefore, we align the placement positions in the last 2K tokens for all context lengths (see Figure 4). This makes that for a certain token depth the only changing factor in each plotline would be the context length, which in turn means that the model has more tokens that can be attended to. ...

Similar publications

Preprint
Full-text available
Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, the evaluation criteria remain highly unstandardized, leading to fragmented efforts in developing metrics and meta-e...