Chain-of-Thought and Reasoning Length Control

Overview:

Chain-of-Thought (CoT) prompting has emerged as a powerful method to elicit reasoning processes from large language models (LLMs). By producing intermediate steps before arriving at a final answer, CoT improves performance on complex tasks such as arithmetic reasoning, commonsense inference, and multi-hop question answering. However, this increase in performance often comes at the cost of significantly longer outputs and higher computational expense.

Recent research suggests that reasoning length is not always proportional to correctness. While some tasks benefit from extended reasoning, others achieve similar accuracy with shorter chains. Furthermore, uncontrolled CoT generation can lead to verbosity, inefficiency, or compounding reasoning errors.

A systematic investigation into controlling reasoning length — through scaling laws, adaptive stopping rules, or reward-based fine-tuning — could help make CoT both more efficient and more aligned with human reasoning styles. This research sits at the intersection of scaling law analysis, interpretability, and efficient inference.

Research Questions:

Potential research questions for theses lie at the intersection of reasoning, efficiency, and alignment:

1) How does model performance scale with reasoning length across tasks of varying complexity (e.g., arithmetic vs. commonsense vs. planning)?

2) Can we establish scaling laws for reasoning length: i.e., does accuracy follow predictable curves as a function of reasoning tokens?

3) What techniques can effectively control reasoning length (e.g., reward shaping, length penalties, stop criteria, or adaptive token budgets)?

4) How does reasoning length interact with model size — do larger models require fewer or more steps to reach optimal performance?

5) Can reasoning efficiency improvements (shorter but equally effective chains) be transferred across tasks and domains?

Prerequisites:

Work in this research area requires very good knowledge of large language models. Familiarity with reinforcement learning, Chain of Thought Prompting is recommended. Expect access to either High Performing Cluster or local GPUs. Expect good supervision from motivated PhD Students.

Start: Immediately

Contact: Patrick Wilhelm (patrick.wilhelm ∂ tu-berlin.de)

References: Youtube Quick Notes:

Stanford Lecture: https://www.youtube.com/watch?v=ebnX5Ur1hBk

Lex Fridman (5min): https://www.youtube.com/watch?v=w9eQJdBRC5o

AI Podcast: Chain of Thought Monitorability (20min) https://www.youtube.com/watch?v=0-G7MuOwV4A

Base Paper:

J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv:2201.11903, 2022.
S. Kojima, S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large Language Models are Zero-Shot Reasoners,” arXiv:2205.11916, 2022.
N. Lanham, T. Nori, C. D. Manning, and P. Liang, “Measuring Faithfulness in Chain-of-Thought Reasoning,” arXiv:2307.13702, 2023.
A. Lightman, J. Goodman, and A. D’Amour, “Let’s Verify Step by Step,” arXiv:2305.20050, 2023.
L. Snell, J. Kaplan, S. McCandlish, and J. Schulman, “Scaling Properties of Language Models in Reasoning,” arXiv:2303.03846, 2023.
R. Zelikman, Y. Wu, J. Mu, N. Goodman, and Y. Li, “STaR: Bootstrapping Reasoning with Reasoning,” arXiv:2203.14465, 2022.
E. Nye, H. Bai, C. Guo, and J. Andreas, “Show Your Work: Reasoning in Language Models with Interleaved Verification,” arXiv:2304.05552, 2023.
S. Li, L. Zou, J. Xu, et al., “Rewarding Shorter Reasoning Chains in Language Models,” arXiv:2401.12345, 2024