Statistics and Data Science Seminar Series Yuejie Chi
Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent
Abstract: Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve…