Views Navigation

Event Views Navigation

Calendar of Events

S Sun

M Mon

T Tue

W Wed

T Thu

F Fri

S Sat

0 events,

0 events,

0 events,

0 events,

0 events,

1 event,

Statistics and Data Science Seminar Series Yuejie Chi

0 events,

0 events,

0 events,

0 events,

0 events,

0 events,

1 event,

Statistics and Data Science Seminar Series Weijie Su

0 events,

0 events,

0 events,

0 events,

0 events,

0 events,

1 event,

Statistics and Data Science Seminar Series Navid Azizan

0 events,

0 events,

0 events,

0 events,

0 events,

0 events,

1 event,

0 events,

0 events,

0 events,

0 events,

0 events,

0 events,

1 event,

Statistics and Data Science Seminar Series Vardan Papyan

0 events,

Transformers Learn Generalizable Chain-of-Thought Reasoning via Gradient Descent

Yuejie Chi (Yale University)
E18-304

Abstract: Transformers have demonstrated remarkable chain-of-thought reasoning capabilities, yet, the underlying mechanisms by which they acquire and extrapolate these capabilities remain limited. This talk presents a theoretical analysis of transformers trained via gradient descent for symbolic reasoning and state tracking tasks with increasing problem complexity. Our analysis reveals the coordination of multi-head attention to solve…

Find out more »

Do Large Language Models (Really) Need Statistical Foundations?

Weijie Su (University of Pennsylvania)
E18-304

Abstract: In this talk, we advocate for developing statistical foundations for large language models (LLMs). We begin by examining two key characteristics that necessitate statistical perspectives for LLMs: (1) the probabilistic, autoregressive nature of next-token prediction, and (2) the inherent complexity and black box nature of Transformer architectures. To demonstrate how statistical insights can advance…

Find out more »

Hard-Constrained Neural Networks

Navid Azizan (MIT)
E18-304

Abstract: Incorporating prior knowledge and domain-specific input-output requirements, such as safety or stability, as hard constraints into neural networks is a key enabler for their deployment in high-stakes applications. However, existing methods often rely on soft penalties, which are insufficient, especially on out-of-distribution samples. In this talk, I will introduce hard-constrained neural networks (HardNet), a…

Find out more »

Attention Sinks: A ‘Catch, Tag, Release’ Mechanism for Embeddings

Vardan Papyan (University of Toronto)
E18-304

Abstract: Large language models (LLMs) often concentrate their attention on a small set of tokens—referred to as attention sinks. Common examples include the first token, a prompt-independent sink, and punctuation tokens, which are prompt-dependent. Although these tokens often lack inherent semantic meaning, their presence is critical for model performance, particularly under model compression and KV-caching.…

Find out more »


MIT Institute for Data, Systems, and Society
Massachusetts Institute of Technology
77 Massachusetts Avenue
Cambridge, MA 02139-4307
617-253-1764