Emergent Reasoning in Large Language Models: A Case Study on STEM Problem Solving
Journal of Artificial Intelligence Research • Vol. 13, No. 1
Abstract
We present a detailed case study examining emergent reasoning capabilities in GPT-4 class models on multi-step STEM problems. Using a novel evaluation framework comprising 2,400 problems across physics, chemistry, and mathematics, we characterize the boundaries of in-context learning and chain-of-thought prompting. Structured prompting outperforms direct approaches by 31% on multi-step calculus.