A Ladder Of Reasoning: Testing The Power Of Imagination In LLMs
Microsoft, Wednesday, July 23rd, 2025
Reasoning systems have emerged as a focus of research on language models (LMs), as the field moves beyond surface-level language ability to target deeper cognitive skills. Reasoning, in this context, can be defined as the ability to follow a coherent sequence of steps in order to draw logical inferences, synthesize information, and construct solutions - rather than merely recalling facts or patterns.
The distinction between a coherent reasoning process and 'mere recall' raises a core question: Given a language model, can we tell whether it is truly reasoning, or if its performance on math, logic, and coding benchmarks is still indicative only of strong pattern recognition and memorization?1
Part of what makes this question difficult is the way reasoning skills are typically measured. Most contemporary methods for testing reasoning skills in LMs evaluate only the final answer, not the process by which solutions are derived. This creates an evaluation gap, allowing reasoning skills to appear stronger than they truly are. T