## December 2021

## SES & IDPS Dissertation Defense – Paolo Bertolotti

Paolo Bertolotti (IDSS)

E18-304

Inference and Diffusion in Networks ABSTRACT Networks provide a powerful and unified framework to study complex systems. By abstracting systems down to entities and their connections, network models provide insight into the structure and dynamics of critical systems across multiple domains. In this thesis, we study diffusion in social networks. Diffusion through networked systems corresponds to numerous consequential processes, and we focus on epidemic spread and information diffusion. We study these processes by applying and extending ideas from statistical inference.…

Find out more »## The Optimality of Coarse Menues

Dirk Bergemann (Yale University)

E18-304

Please join us on Monday, December 6, 2021 at 4:00pm for the Distinguished Speaker Seminar with Dirk Bergemann (Yale University).

Find out more »## The Geometry of Particle Collisions: Hidden in Plain Sight

Jesse Thaler (MIT)

E18-304

Abstract: Since the 1960s, particle physicists have developed a variety of data analysis strategies for the goal of comparing experimental measurements to theoretical predictions. Despite their numerous successes, these techniques can seem esoteric and ad hoc, even to practitioners in the field. In this talk, I explain how many particle physics analysis tools have a natural geometric interpretation in an emergent "space" of collider events induced by the Wasserstein metric. This in turn suggests new analysis strategies to interpret generic…

Find out more »## November 2021

## What Happens Next: Hacking the Hackathon for Long-Term Impact

E18-304

Speakers: Nina Peluso (MIT PolHack 2020, chair), Freddy Nguyen (MIT COVID-19 Hackathon, MIT Hacking Racism, organizer), Ragini Sreenath (MIT PolHack 2020, MIT EnergyHack, organizer) Whether related to the environment, Internet + cybersecurity, health, future of education + work, housing + planning, or another area, hackathons have become a popular way to engage with students, researchers, and practitioners alike on complex issues. Usually a weekend event, they involve taking real-life data and/or situations and brainstorming solutions. But how impactful (and in…

Find out more »## Precise high-dimensional asymptotics for AdaBoost via max-margins & min-norm interpolants

Pragya Sur (Harvard University)

E18-304

Abstract: This talk will introduce a precise high-dimensional asymptotic theory for AdaBoost on separable data, taking both statistical and computational perspectives. We will consider the common modern setting where the number of features p and the sample size n are both large and comparable, and in particular, look at scenarios where the data is asymptotically separable. Under a class of statistical models, we will provide an (asymptotically) exact analysis of the max-min-L1-margin and the min-L1-norm interpolant. In turn, this will characterize the…

Find out more »## Characterizing the Type 1-Type 2 Error Trade-off for SLOPE

Cynthia Rush (Columbia University)

E18-304

Abstract: Sorted L1 regularization has been incorporated into many methods for solving high-dimensional statistical estimation problems, including the SLOPE estimator in linear regression. In this talk, we study how this relatively new regularization technique improves variable selection by characterizing the optimal SLOPE trade-off between the false discovery proportion (FDP) and true positive proportion (TPP) or, equivalently, between measures of type I and type II error. Additionally, we show that on any problem instance, SLOPE with a certain regularization sequence outperforms the Lasso,…

Find out more »## Asymptotics of learning on dependent and structured random objects

Morgane Austern (Harvard University)

E18-304

Abstract: Classical statistical inference relies on numerous tools from probability theory to study the properties of estimators. However, these same tools are often inadequate to study modern machine problems that frequently involve structured data (e.g networks) or complicated dependence structures (e.g dependent random matrices). In this talk, we extend universal limit theorems beyond the classical setting. Firstly, we consider distributionally \structured" and dependent random object i.e random objects whose distribution are invariant under the action of an amenable group. We…

Find out more »## Designing AI for Racial Equity: Translating Ethics into Practice

S. Craig Watkins (MLK Visiting Professor - MIT)

E18-304

Please join us on Monday, November 1, 2021 at 4:00pm for the Distinguished Speaker Seminar with S. Craig Watkins (MLK Visiting Professor).

Find out more »## October 2021

## Revealing the simplicity of high-dimensional objects via pathwise analysis

Ronen Eldan (Weizmann Inst. of Science and Princeton)

E18-304

Abstract: One of the main reasons behind the success of high-dimensional statistics and modern machine learning in taming the curse of dimensionality is that many classes of high-dimensional distributions are surprisingly well-behaved and, when viewed correctly, exhibit a simple structure. This emergent simplicity is in the center of the theory of “high-dimensional phenomena”, and is manifested in principles such as “Gaussian-like behavior” (objects of interest often inherit the properties of the Gaussian measure), “dimension-free behavior” (expressed in inequalities which do…

Find out more »## Instance Dependent PAC Bounds for Bandits and Reinforcement Learning

Kevin Jamieson (University of Washington)

E18-304

Abstract: The sample complexity of an interactive learning problem, such as multi-armed bandits or reinforcement learning, is the number of interactions with nature required to output an answer (e.g., a recommended arm or policy) that is approximately close to optimal with high probability. While minimax guarantees can be useful rules of thumb to gauge the difficulty of a problem class, algorithms optimized for this worst-case metric often fail to adapt to “easy” instances where fewer samples suffice. In this talk, I…

Find out more »## Breaking the Sample Size Barrier in Reinforcement Learning

Yuting Wei (Wharton School at UPenn )

E18-304

Abstract: Reinforcement learning (RL), which is frequently modeled as sequential learning and decision making in the face of uncertainty, is garnering growing interest in recent years due to its remarkable success in practice. In contemporary RL applications, it is increasingly more common to encounter environments with prohibitively large state and action space, thus imposing stringent requirements on the sample efficiency of the RL algorithms in use. Despite the empirical success, however, the theoretical underpinnings for many popular RL algorithms remain…

Find out more »## Recent results in planted assignment problems

Yihong Wu (Yale University)

E18-304

Abstract: Motivated by applications such as particle tracking, network de-anonymization, and computer vision, a recent thread of research is devoted to statistical models of assignment problems, in which the data are random weight graphs correlated with the latent permutation. In contrast to problems such as planted clique or stochastic block model, the major difference here is the lack of low-rank structures, which brings forth new challenges in both statistical analysis and algorithm design. In the first half of the…

Find out more »## Causal Matrix Completion

Devavrat Shah (MIT)

E18-304

Matrix completion is the study of recovering an underlying matrix from a sparse subset of noisy observations. Traditionally, it is assumed that the entries of the matrix are “missing completely atrandom” (MCAR), i.e., each entry is revealed at random, independent of everything else, with uniform probability. This is likely unrealistic due to the presence of “latent confounders”, i.e., unobserved factors that determine both the entries of the underlying matrix and the missingness pattern in the observed matrix. In general, these…

Find out more »## September 2021

## Representation and generalization

Boaz Barak (Harvard University)

E18-304

Abstract: Self-supervised learning is an increasingly popular approach for learning representations of data that can be used for downstream representation tasks. A practical advantage of self-supervised learning is that it can be used on unlabeled data. However, even when labels are available, self-supervised learning can be competitive with the more "traditional" approach of supervised learning. In this talk we consider "self supervised + simple classifier (SSS)" algorithms, which are obtained by first learning a self-supervised classifier on data, and…

Find out more »## Interpolation and learning with scale dependent kernels

Lorenzo Rosasco (MIT/Universita' di Genova)

E18-304

Speaker: Lorenzo Rosasco (MIT/Universita' di Genova) Title: Interpolation and learning with scale dependent kernels Abstract: We study the learning properties of nonparametric ridge-less least squares. In particular, we consider the common case of estimators defined by scale dependent (Matern) kernels, and focus on the role scale and smoothness. These estimators interpolate the data and the scale can be shown to control their stability to noise and sampling. Larger scales, corresponding to smoother functions, improve stability with respect to sampling. However, smaller…

Find out more »## Designing Equitable Algorithms for Criminal Justice and Beyond

Sharad Goel (Harvard University)

E18-304

Please join us on Tuesday, September 14, 2021 at 4:00pm for the Distinguished Speaker Seminar with Sharad Goel (Harvard University).

Find out more »## April 2020

## [POSTPONED] The Blessings of Multiple Causes

David Blei (Columbia University)

E18-304

*Please note: this event has been POSTPONED until Fall 2020* See MIT’s COVID-19 policies for more details. Title: The Blessings of Multiple Causes Abstract: Causal inference from observational data is a vital problem, but it comes with strong assumptions. Most methods require that we observe all confounders, variables that affect both the causal variables and the outcome variables. But whether we have observed all confounders is a famously untestable assumption. We describe the deconfounder, a way to do causal…

Find out more »## [POSTPONED] Guido Imbens – The Applied Econometrics Professor and Professor of Economics, Graduate School of Business, Stanford University

E18-304

IDSS will host Prof. Guido Imbens as part of the Distinguished Speaker Seminar series. Prof. Guido Imbens’ primary field of interest is Econometrics. Research topics in which he is interested include: causality, program evaluation, identification, Bayesian methods, semi-parametric methods, instrumental variables.

Find out more »## March 2020

## Safe Deep Learning in the Loop: Challenges, Methods, and Future Directions

Mahyar Fazlyab (University of Pennsylvania)

E18-304

Abstract: Despite high-profile advances in various decision-making and classification tasks, Deep Neural Networks (DNNs) have found limited application in safety-critical domains such as self-driving cars and automated healthcare. In particular, DNNs can be vulnerable to adversarial attacks and input uncertainties. This issue becomes even more complicated when DNNs are used in closed-loop systems, where a small perturbation (caused by noisy measurements, uncertain initial conditions, disturbances, etc.) can substantially impact the system being controlled. Therefore, it is of utmost importance…

Find out more »## Foundations of Resilient Collaborative Autonomy: From Combinatorial Optimization to Control and Learning

Vasileios Tzoumas (MIT)

E18-304

Abstract: Collaborative autonomous robots promise to revolutionize transportation, disaster response, and environmental monitoring. Already, micro-aerial vehicles have become a multi-billion-dollar industry; and in this new decade, teams of semi-autonomous ships, cars, and underwater exploration vehicles are being launched. A future of ubiquitous autonomy is becoming a reality, where robots can autonomously split into teams, and navigate and learn the world. However, this future is threatened by attacks and failures that can compromise the robots’ teams, control, and learning capabilities,…

Find out more »## Does Revolution Work? Evidence from Nepal

Rohini Pande (Yale University)

E18-304

The last half century has seen the adoption of democratic institutions in much of the developing world. However, the conditions under which de jure democratization leads to the representation of historically disadvantaged groups remains debated as do the implications of descriptive representation for policy inclusion. Using detailed administrative and survey data from Nepal, we examine political selection in a new democracy, the implications for policy inclusion and the role of conflict in affecting political transformation. I situate these findings in the context of…

Find out more »## February 2020

## Tales of Random Projections

Kavita Ramanan (Brown University)

E18-304

Abstract: Properties of random projections of high-dimensional probability measures are of interest in a variety of fields, including asymptotic convex geometry, and potential applications to high-dimensional statistics and data analysis. A particular question of interest is to identify what properties of the high-dimensional measure are captured by its lower-dimensional projections. While fluctuations of these projections have been well studied over the past decade, we describe more recent work on the tail behavior of such projections, and various implications. This talk…

Find out more »## Predictive Inference with the Jackknife+

Rina Foygel Barber (University of Chicago)

E18-304

Abstract: We introduce the jackknife+, a novel method for constructing predictive confidence intervals that is robust to the distribution of the data. The jackknife+ modifies the well-known jackknife (leaveoneout cross-validation) to account for the variability in the fitted regression function when we subsample the training data. Assuming exchangeable training samples, we prove that the jackknife+ permits rigorous coverage guarantees regardless of the distribution of the data points, for any algorithm that treats the training points symmetrically (in contrast, such guarantees…

Find out more »## Diffusion K-means Clustering on Manifolds: provable exact recovery via semidefinite relaxations

Xiaohui Chen (University of Illinois at Urbana-Champaign)

E18-304

Abstract: We introduce the diffusion K-means clustering method on Riemannian submanifolds, which maximizes the within-cluster connectedness based on the diffusion distance. The diffusion K-means constructs a random walk on the similarity graph with vertices as data points randomly sampled on the manifolds and edges as similarities given by a kernel that captures the local geometry of manifolds. Thus the diffusion K-means is a multi-scale clustering tool that is suitable for data with non-linear and non-Euclidean geometric features in mixed dimensions. Given…

Find out more »## Risk-Sensitive Safety Analysis and Control for Trustworthy Autonomy

Margaret Chapman (UC Berkeley)

E18-304

Abstract: Methods for managing dynamic systems typically invoke one of two perspectives. In the worst-case perspective, the system is assumed to behave in the worst possible way; this perspective is used to provide formal safety guarantees. In the risk-neutral perspective, the system is assumed to behave as expected; this perspective is invoked in reinforcement learning and stochastic optimal control. While the worst-case perspective is useful for safety analysis, it can lead to unnecessarily conservative decisions, especially in settings where uncertainties…

Find out more »## Gaussian Differential Privacy, with Applications to Deep Learning

Weijie Su (University of Pennsylvania)

E18-304

Abstract: Privacy-preserving data analysis has been put on a firm mathematical foundation since the introduction of differential privacy (DP) in 2006. This privacy definition, however, has some well-known weaknesses: notably, it does not tightly handle composition. This weakness has inspired several recent relaxations of differential privacy based on the Renyi divergences. We propose an alternative relaxation we term “f-DP”, which has a number of nice properties and avoids some of the difficulties associated with divergence based relaxations. First, f-DP preserves…

Find out more »## December 2019

## The Statistical Finite Element Method

Mark Girolami (University of Cambridge)

E18-304

Abstract: The finite element method (FEM) is one of the great triumphs of modern day applied mathematics, numerical analysis and software development. Every area of the sciences and engineering has been positively impacted by the ability to model and study complex physical and natural systems described by systems of partial differential equations (PDE) via the FEM . In parallel the recent developments in sensor, measurement, and signalling technologies enables the phenomenological study of systems as diverse as protein signalling in the…

Find out more »## Inferring the Evolutionary History of Tumors

Simon Tavaré (Columbia University)

E18-304

Abstract: Bulk sequencing of tumor DNA is a popular strategy for uncovering information about the spectrum of mutations arising in the tumor, and is often supplemented by multi-region sequencing, which provides a view of tumor heterogeneity. The statistical issues arise from the fact that bulk sequencing makes the determination of sub-clonal frequencies, and other quantities of interest, difficult. In this talk I will discuss this problem, beginning with its setting in population genetics. The data provide an estimate of the…

Find out more »## SES Dissertation Defense – Ian Schneider

Ian Schneider (MIT)

E18-304

Market Design Opportunities for an Evolving Power System

Find out more »## Flexible Perturbation Models for Robustness to Misspecification

Jeffrey Miller (Harvard University)

E18-304

Abstract: In many applications, there are natural statistical models with interpretable parameters that provide insight into questions of interest. While useful, these models are almost always wrong in the sense that they only approximate the true data generating process. In some cases, it is important to account for this model error when quantifying uncertainty in the parameters. We propose to model the distribution of the observed data as a perturbation of an idealized model of interest by using a nonparametric…

Find out more »## Automating the Digitization of Historical Data on a Large Scale

Melissa Dell (Harvard University)

E18-304

https://youtu.be/mnM7ePr6xqM Over the past two centuries, we have transitioned from an overwhelmingly agricultural world to one with vastly different patterns of economic organization. This transition has been remarkably uneven across space and time, and has important implications for some of the most central challenges facing societies today. Deepening our understanding of the determinants of economic transformation requires data on the long-run trajectories of individuals and firms. However, these data overwhelmingly remain trapped in hard copy, with cost estimates for manual…

Find out more »## November 2019

## Automated Data Summarization for Scalability in Bayesian Inference

Tamara Broderick (MIT)

E18-304

Abstract: Many algorithms take prohibitively long to run on modern, large data sets. But even in complex data sets, many data points may be at least partially redundant for some task of interest. So one might instead construct and use a weighted subset of the data (called a “coreset”) that is much smaller than the original dataset. Typically running algorithms on a much smaller data set will take much less computing time, but it remains to understand whether the output…

Find out more »## A Causal Exposure Response Function with Local Adjustment for Confounding: A study of the health effects of long-term exposure to low levels of fine particulate matter

Francesca Dominici (Harvard University)

E18-304

Abstract: In the last two decades, ambient levels of air pollution have declined substantially. Yet, as mandated by the Clean Air Act, we must continue to address the following question: is exposure to levels of air pollution that are well below the National Ambient Air Quality Standards (NAAQS) harmful to human health? Furthermore, the highly contentious nature surrounding environmental regulations necessitates casting this question within a causal inference framework. Several parametric and semi-parametric regression modeling approaches have been used to…

Find out more »## Stability of a Fluid Model for Fair Bandwidth Sharing with General File Size Distributions

Ruth J Williams (University of California, San Diego)

E18-304

Abstract: Massoulie and Roberts introduced a stochastic model for a data communication network where file sizes are generally distributed and the network operates under a fair bandwidth sharing policy. It has been a standing problem to prove stability of this general model when the average load on the system is less than the network’s capacity. A crucial step in an approach to this problem is to prove stability of an associated measure-valued fluid model. We shall describe prior work on this question done under various strong assumptions and…

Find out more »## Understanding machine learning with statistical physics

Lenka Zdeborová (Institute of Theoretical Physics, CNRS)

E18-304

Abstract: The affinity between statistical physics and machine learning has long history, this is reflected even in the machine learning terminology that is in part adopted from physics. Current theoretical challenges and open questions about deep learning and statistical learning call for unified account of the following three ingredients: (a) the dynamics of the learning algorithm, (b) the architecture of the neural networks, and (c) the structure of the data. Most existing theories are not taking in account all of those…

Find out more »## Artificial Bayesian Monte Carlo Integration: A Practical Resolution to the Bayesian (Normalizing Constant) Paradox

Xiao-Li Meng (Harvard University)

E18-304

Abstract: Advances in Markov chain Monte Carlo in the past 30 years have made Bayesian analysis a routine practice. However, there is virtually no practice of performing Monte Carlo integration from the Bayesian perspective; indeed,this problem has earned the “paradox” label in the context of computing normalizing constants (Wasserman, 2013). We first use the modeling-what-we-ignore idea of Kong et al. (2003) to explain that the crux of the paradox is not with the likelihood theory, which is essentially the same…

Find out more »## SES PhD Admissions Info Session

Ali Jadbabaie (MIT)

E18-304

Learn about admission to the Social and Engineering Systems Doctoral Program. Info session is hosted by a member of the IDSS faculty and an SES student who introduce the program and answer your questions. Please register in advance.

Find out more »## SDP Relaxation for Learning Discrete Structures: Optimal Rates, Hidden Integrality, and Semirandom Robustness

Yudong Chen (Cornell University)

E18-304

Abstract: We consider the problems of learning discrete structures from network data under statistical settings. Popular examples include various block models, Z2 synchronization and mixture models. Semidefinite programming (SDP) relaxation has emerged as a versatile and robust approach to these problems. We show that despite being a relaxation, SDP achieves the optimal Bayes error rate in terms of distance to the target solution. Moreover, SDP relaxation is provably robust under the so-called semirandom model, which frustrates many existing algorithms. Our…

Find out more »## One-shot Information Theory via Poisson Processes

Cheuk Ting Li (UC Berkeley)

E18-304

Abstract: In information theory, coding theorems are usually proved in the asymptotic regime where the blocklength tends to infinity. While there are techniques for finite blocklength analysis, they are often more complex than their asymptotic counterparts. In this talk, we study the use of Poisson processes in proving coding theorems, which not only gives sharp one-shot and finite blocklength results, but also gives significantly shorter proofs than conventional asymptotic techniques in some settings. Instead of using fixed-size random codebooks, we…

Find out more »## Causal Inference in the Age of Big Data

Jasjeet Sekhon (UC Berkeley)

E18-304

The rise of massive data sets that provide fine-grained information about human beings and their behavior offers unprecedented opportunities for evaluating the effectiveness of social, behavioral, and medical treatments. With the availability of fine-grained data, researchers and policymakers are increasingly unsatisfied with estimates of average treatment effects based on experimental samples that are unrepresentative of populations of interest. Instead, they seek to target treatments to particular populations and subgroups. Because of these inferential challenges, Machine Learning (ML) is now being…

Find out more »## October 2019

## FinTech in China and the extension of new organizational firm boundary

Zixia Sheng (New Hope Financial Services)

E18-304

Speaker: Zixia Sheng, CEO, New Hope Financial Services Abstract: Recent new technologies (Fintech and 5G) have had a profound impact on extending the boundaries of firms into more complicated financial ecology system. Nowadays in China, a typical traditional loan underwriting procedure within a bank has been fulfilled by different external parties (e.g. online portals, marketing agencies, data vendors, risk modelers, trusts, funds, invest bankers, debt collectors). How new technology could improve information sharing, reduce transaction costs/contractual costs and therefore change…

Find out more »## Accurate Simulation-Based Parametric Inference in High Dimensional Settings

Maria-Pia Victoria-Feser (University of Geneva)

E18-304

Abstract: Accurate estimation and inference in finite sample is important for decision making in many experimental and social fields, especially when the available data are complex, like when they include mixed types of measurements, they are dependent in several ways, there are missing data, outliers, etc. Indeed, the more complex the data (hence the models), the less accurate are asymptotic theory results in finite samples. This is in particular the case, for example, with logistic regression, with possibly also random effects…

Find out more »## Esther Williams in the Harold Holt Memorial Swimming Pool: Some Thoughts on Complexity

Daniel Simpson (University of Toronto)

E18-304

IDS.190 – Topics in Bayesian Modeling and Computation Speaker: Daniel Simpson (University of Toronto) Abstract: Abstract: As data becomes more complex and computational modelling becomes more powerful, we rapidly find ourselves beyond the scope of traditional statistical theory. As we venture beyond the traditional thunderdome, we need to think about how to cope with this additional complexity in our model building. In this talk, I will talk about a few techniques that are useful when specifying prior distributions and building Bayesian…

Find out more »## Towards Robust Statistical Learning Theory

Stanislav Minsker (University of Southern California)

E18-304

Abstract: Real-world data typically do not fit statistical models or satisfy assumptions underlying the theory exactly, hence reducing the number and strictness of these assumptions helps to lessen the gap between the “mathematical” world and the “real” world. The concept of robustness, in particular, robustness to outliers, plays the central role in understanding this gap. The goal of the talk is to introduce the principles and robust algorithms based on these principles that can be applied in the general framework of statistical…

Find out more »## Markov Chain Monte Carlo Methods and Some Attempts at Parallelizing Them

Pierre E. Jacob (Harvard University)

E18-304

IDS.190 – Topics in Bayesian Modeling and Computation Abstract: MCMC methods yield approximations that converge to quantities of interest in the limit of the number of iterations. This iterative asymptotic justification is not ideal: it stands at odds with current trends in computing hardware. Namely, it would often be computationally preferable to run many short chains in parallel, but such an approach is flawed because of the so-called “burn-in” bias. This talk will first describe that issue and some known…

Find out more »## The Planted Matching Problem

Cristopher Moore (Santa Fe Institute)

E18-304

Abstract: What happens when an optimization problem has a good solution built into it, but which is partly obscured by randomness? Here we revisit a classic polynomial-time problem, the minimum perfect matching problem on bipartite graphs. If the edges have random weights in , Mézard and Parisi — and then Aldous, rigorously — showed that the minimum matching has expected weight zeta(2) = pi^2/6. We consider a “planted” version where a particular matching has weights drawn from an exponential distribution…

Find out more »## Probabilistic Programming and Artificial Intelligence

Vikash Mansinghka (MIT)

E18-304

IDS.190 – Topics in Bayesian Modeling and Computation Abstract: Probabilistic programming is an emerging field at the intersection of programming languages, probability theory, and artificial intelligence. This talk will show how to use recently developed probabilistic programming languages to build systems for robust 3D computer vision, without requiring any labeled training data; for automatic modeling of complex real-world time series; and for machine-assisted analysis of experimental data that is too small and/or messy for standard approaches from machine learning and…

Find out more »## Theoretical Foundations of Active Machine Learning

Rob Nowak (University of Wisconsin - Madison)

E18-304

Title: Theoretical Foundations of Active Machine Learning Abstract: The field of Machine Learning (ML) has advanced considerably in recent years, but mostly in well-defined domains using huge amounts of human-labeled training data. Machines can recognize objects in images and translate text, but they must be trained with more images and text than a person can see in nearly a lifetime. The computational complexity of training has been offset by recent technological advances, but the cost of training data is measured in…

Find out more »## Behavior of the Gibbs Sampler in the Imbalanced Case/Bias Correction from Daily Min and Max Temperature Measurements

Natesh Pillai (Harvard University)

E18-304

IDS.190 Topics in Bayesian Modeling and Computation *Note: The speaker this week will give two shorter talks within the usual session Title: Behavior of the Gibbs sampler in the imbalanced case Abstract: Many modern applications collect highly imbalanced categorical data, with some categories relatively rare. Bayesian hierarchical models combat data sparsity by borrowing information, while also quantifying uncertainty. However, posterior computation presents a fundamental barrier to routine use; a single class of algorithms does not work well in all settings and…

Find out more »## September 2019

## Selection and Endogenous Bias in Studies of Health Behaviors

Emily Oster (Brown University)

E18-304

Abstract: Studies of health behaviors using observational data are prone to bias from selection in behavior choices. How important are these biases? Are they dynamic - that is, are they influenced by the recommendations we make? Are there formal assumptions under which we can use information we have about selection on observed variables to learn about the possible bias from unobserved selection? About the Speaker: Emily Oster is a professor of economics. Prior to coming to Brown she was an…

Find out more »## Frontiers of Efficient Neural-Network Learnability

Adam Klivans (University of Texas at Austin)

E18-304

Abstract: What are the most expressive classes of neural networks that can be learned, provably, in polynomial-time in a distribution-free setting? In this talk we give the first efficient algorithm for learning neural networks with two nonlinear layers using tools for solving isotonic regression, a nonconvex (but tractable) optimization problem. If we further assume the distribution is symmetric, we obtain the first efficient algorithm for recovering the parameters of a one-layer convolutional network. These results implicitly make use of a…

Find out more »## Power of Experimental Design and Active Learning

Aarti Singh (Carnegie Mellon University)

E18-304

Classical supervised machine learning algorithms focus on the setting where the algorithm has access to a fixed labeled dataset obtained prior to any analysis. In most applications, however, we have control over the data collection process such as which image labels to obtain, which drug-gene interactions to record, which network routes to probe, which movies to rate, etc. Furthermore, most applications face budget limitations on the amount of labels that can be collected. Experimental design and active learning are two…

Find out more »## Some New Insights On Transfer Learning

Samory Kpotufe (Columbia University)

E18-304

Abstract: The problem of transfer and domain adaptation is ubiquitous in machine learning and concerns situations where predictive technologies, trained on a given source dataset, have to be transferred to a new target domain that is somewhat related. For example, transferring voice recognition trained on American English accents to apply to Scottish accents, with minimal retraining. A first challenge is to understand how to properly model the ‘distance’ between source and target domains, viewed as probability distributions over a feature…

Find out more »## Probabilistic Modeling meets Deep Learning using TensorFlow Probability

Brian Patton (Google AI)

E18-304

IDS.190 - Topics in Bayesian Modeling and Computation Speaker: Brian Patton (Google AI) Abstract: TensorFlow Probability provides a toolkit to enable researchers and practitioners to integrate uncertainty with gradient-based deep learning on modern accelerators. In this talk we'll walk through some practical problems addressed using TFP; discuss the high-level interfaces, goals, and principles of the library; and touch on some recent innovations in describing probabilistic graphical models. Time-permitting, we may touch on a couple areas of research interest for the…

Find out more »## Automated Data Summarization for Scalability in Bayesian Inference

Tamara Broderick (MIT)

E18-304

IDS.190 - Topics in Bayesian Modeling and Computation Abstract: Many algorithms take prohibitively long to run on modern, large datasets. But even in complex data sets, many data points may be at least partially redundant for some task of interest. So one might instead construct and use a weighted subset of the data (called a "coreset") that is much smaller than the original dataset. Typically running algorithms on a much smaller data set will take much less computing time, but…

Find out more »## GANs, Optimal Transport, and Implicit Density Estimation

Tengyuan Liang (University of Chicago)

E18-304

Abstract: We first study the rate of convergence for learning distributions with the adversarial framework and Generative Adversarial Networks (GANs), which subsumes Wasserstein, Sobolev, and MMD GANs as special cases. We study a wide range of parametric and nonparametric target distributions, under a collection of objective evaluation metrics. On the nonparametric end, we investigate the minimax optimal rates and fundamental difficulty of the implicit density estimation under the adversarial framework. On the parametric end, we establish a theory for general…

Find out more »## May 2019

## Conference on Synthetic Controls and Related Methods

E18-304

Organizers are Alberto Abadie (MIT), Victor Chernozhukov (MIT), and Guido Imbens (Stanford University). The program is posted here. Participation by invitation only.

Find out more »## Counting and sampling at low temperatures

Will Perkins (University of Illinois at Chicago)

E18-304

Abstract: We consider the problem of efficient sampling from the hard-core and Potts models from statistical physics. On certain families of graphs, phase transitions in the underlying physics model are linked to changes in the performance of some sampling algorithms, including Markov chains. We develop new sampling and counting algorithms that exploit the phase transition phenomenon and work efficiently on lattices (and bipartite expander graphs) at sufficiently low temperatures in the phase coexistence regime. Our algorithms are based on Pirogov-Sinai…

Find out more »## Design and Analysis of Two-Stage Randomized Experiments

Kosuke Imai (Harvard University)

E18-304

Abstract: In many social science experiments, subjects often interact with each other and as a result, one unit's treatment can influence the outcome of another unit. Over the last decade, a significant progress has been made towards causal inference in the presence of such interference between units. In this talk, we will discuss two-stage randomized experiments, which enable the identification of the average spillover effects as well as that of the average direct effect of one's own treatment. In particular,…

Find out more »## Stochastics and Statistics Seminar Series

Tracy Ke (Harvard University)

E18-304

## April 2019

## Robust Estimation: Optimal Rates, Computation and Adaptation

Chao Gao (University of Chicago)

E18-304

Abstract: Chao Gao will discuss the problem of statistical estimation with contaminated data. In the first part of the talk, I will discuss depth-based approaches that achieve minimax rates in various problems. In general, the minimax rate of a given problem with contamination consists of two terms: the statistical complexity without contamination, and the contamination effect in the form of modulus of continuity. In the second part of the talk, I will discuss computational challenges of these depth-based estimators. An…

Find out more »## Stochastics and Statistics Seminar Series

Dylan Foster (MIT)

E18-304

Logistic regression is a fundamental task in machine learning and statistics. For the simple case of linear models, Hazan et al. (2014) showed that any logistic regression algorithm that estimates model weights from samples must exhibit exponential dependence on the weight magnitude. As an alternative, we explore a counterintuitive technique called improper learning, whereby one estimates a linear model by fitting a non-linear model. Past success stories for improper learning have focused on cases where it can improve computational complexity.…

Find out more »## Exponential line-crossing inequalities

Aaditya Ramdas (Carnegie Mellon University)

E18-304

Abstract: This talk will present a class of exponential bounds for the probability that a martingale sequence crosses a time-dependent linear threshold. Our key insight is that it is both natural and fruitful to formulate exponential concentration inequalities in this way. We will illustrate this point by presenting a single assumption and a single theorem that together strengthen many tail bounds for martingales, including classical inequalities (1960-80) by Bernstein, Bennett, Hoeffding, and Freedman; contemporary inequalities (1980-2000) by Shorack and Wellner,…

Find out more »## A Particulate Solution: Data Science in the Fight to Stop Air Pollution and Climate Change | IDSS Distinguished Speaker Seminar

Francesca Dominici (Harvard University)

E18-304

Abstract: What if I told you I had evidence of a serious threat to American national security – a terrorist attack in which a jumbo jet will be hijacked and crashed every 12 days. Thousands will continue to die unless we act now. This is the question before us today – but the threat doesn’t come from terrorists. The threat comes from climate change and air pollution. We have developed an artificial neural network model that uses on-the-ground air-monitoring data…

Find out more »## March 2019

## Optimization of random polynomials on the sphere in the full-RSB regime

Eliran Subag (New York University)

E18-304

Abstract: The talk will focus on optimization on the high-dimensional sphere when the objective function is a linear combination of homogeneous polynomials with standard Gaussian coefficients. Such random processes are called spherical spin glasses in physics, and have been extensively studied since the 80s. I will describe certain geometric properties of spherical spin glasses unique to the full-RSB case, and explain how they can be used to design a polynomial time algorithm that finds points within small multiplicative error from…

Find out more »## Subvector Inference in Partially Identified Models with Many Moment Inequalities

Alex Belloni (Duke University)

E18-304

Abstract: In this work we consider bootstrap-based inference methods for functions of the parameter vector in the presence of many moment inequalities where the number of moment inequalities, denoted by p, is possibly much larger than the sample size n. In particular this covers the case of subvector inference, such as the inference on a single component associated with a treatment/policy variable of interest. We consider a min-max of (centered and non-centered) Studentized statistics and study the properties of the…

Find out more »## Univariate total variation denoising, trend filtering and multivariate Hardy-Krause variation denoising

Aditya Guntuboyina (UC Berkley)

E18-304

Abstract: Total variation denoising (TVD) is a popular technique for nonparametric function estimation. I will first present a theoretical optimality result for univariate TVD for estimating piecewise constant functions. I will then present related results for various extensions of univariate TVD including adaptive risk bounds for higher-order TVD (also known as trend filtering) as well as a multivariate extension via the Hardy-Krause Variation which avoids the curse of dimensionality to some extent. I will also mention connections to shape restricted…

Find out more »## Univariate total variation denoising, trend filtering and multivariate Hardy-Krause variation denoising

Aditya Guntuboyina (UC Berkley)

E18-304

Abstract: Total variation denoising (TVD) is a popular technique for nonparametric function estimation. I will first present a theoretical optimality result for univariate TVD for estimating piecewise constant functions. I will then present related results for various extensions of univariate TVD including adaptive risk bounds for higher-order TVD (also known as trend filtering) as well as a multivariate extension via the Hardy-Krause Variation which avoids the curse of dimensionality to some extent. I will also mention connections to shape restricted…

Find out more »## Why Aren’t Network Statistics Accompanied By Uncertainty Statements?

Eric Kolaczyk (Boston University)

E18-304

Abstract: Over 500K scientific articles have been published since 1999 with the word “network” in the title. And the vast majority of these report network summary statistics of one type or another. However, these numbers are rarely accompanied by any quantification of uncertainty. Yet any error inherent in the measurements underlying the construction of the network, or in the network construction procedure itself, necessarily must propagate to any summary statistics reported. Perhaps surprisingly, there is little in the way of…

Find out more »## February 2019

## Capacity lower bound for the Ising perceptron

Nike Sun (MIT)

E18-304

Abstract: The perceptron is a toy model of a simple neural network that stores a collection of given patterns. Its analysis reduces to a simple problem in high-dimensional geometry, namely, understanding the intersection of the cube (or sphere) with a collection of random half-spaces. Despite the simplicity of this model, its high-dimensional asymptotics are not well understood. I will describe what is known and present recent results. This is joint work with Jian Ding. Biography: Nike Sun is a faculty…

Find out more »## TAP free energy, spin glasses, and variational inference

Zhou Fan (Yale University)

E18-304

Abstract: We consider the Sherrington-Kirkpatrick model of spin glasses with ferromagnetically biased couplings. For a specific choice of the couplings mean, the resulting Gibbs measure is equivalent to the Bayesian posterior for a high-dimensional estimation problem known as "Z2 synchronization". Statistical physics suggests to compute the expectation with respect to this Gibbs measure (the posterior mean in the synchronization problem), by minimizing the so-called Thouless-Anderson-Palmer (TAP) free energy, instead of the mean field (MF) free energy. We prove that this identification…

Find out more »## Medical Image Imputation

Polina Golland (MIT CSAIL)

E18-304

Abstract: We present an algorithm for creating high resolution anatomically plausible images that are consistent with acquired clinical brain MRI scans with large inter-slice spacing. Although large databases of clinical images contain a wealth of information, medical acquisition constraints result in sparse scans that miss much of the anatomy. These characteristics often render computational analysis impractical as standard processing algorithms tend to fail when applied to such images. Our goal is to enable application of existing algorithms that were originally…

Find out more »## Optimization of the Sherrington-Kirkpatrick Hamiltonian

Andrea Montanari (Stanford University )

E18-304

Andrea Montanari Professor, Department of Electrical Engineering, Department of Statistics Stanford University This lecture is in conjunction with the LIDS Student Conference. Abstract: Let A be n × n symmetric random matrix with independent and identically distributed Gaussian entries above the diagonal. We consider the problem of maximizing xT Ax over binary vectors with ±1 entries. In the language of statistical physics, this amounts to finding the ground state of the Sherrington-Kirkpatrick model of spin glasses. The asymptotic value of this…

Find out more »## December 2018

## Large girth approximate Steiner triple systems

Lutz Warnke (Georgia Institute of Technology)

E18-304

Abstract: In 1973 Erdos asked whether there are n-vertex partial Steiner triple systems with arbitrary high girth and quadratically many triples. (Here girth is defined as the smallest integer g \ge 4 for which some g-element vertex-set contains at least g-2 triples.) We answer this question, by showing existence of approximate Steiner triple systems with arbitrary high girth. More concretely, for any fixed \ell \ge 4 we show that a natural constrained random process typically produces a partial Steiner triple…

Find out more »## Reducibility and Computational Lower Bounds for Some High-dimensional Statistics Problems

Guy Bresler (MIT)

E18-304

Abstract: The prototypical high-dimensional statistics problem entails finding a structured signal in noise. Many of these problems exhibit an intriguing phenomenon: the amount of data needed by all known computationally efficient algorithms far exceeds what is needed for inefficient algorithms that search over all possible structures. A line of work initiated by Berthet and Rigollet in 2013 has aimed to explain these gaps by reducing from conjecturally hard problems in computer science. However, the delicate nature of average-case reductions has…

Find out more »## November 2018

## Bias Reduction and Asymptotic Eﬃciency in Estimation of Smooth Functionals of High-Dimensional Covariance

Vladimir Koltchinskii (Georgia Institute of Technology)

E18-304

Abstract: We discuss a recent approach to bias reduction in a problem of estimation of smooth functionals of high-dimensional parameters of statistical models. In particular, this approach has been developed in the case of estimation of functionals of covariance operator Σ : Rd → Rd of the form f(Σ), B based on n i.i.d. observations X1, . . . , Xn sampled from the normal distribution with mean zero and covariance Σ, f : R → R being a suﬃciently…

Find out more »## IDSS Science Speed Dating Event

E18-304

Join IDSS faculty, postdocs, and graduate students for the first IDSS Science Speed Dating Event on Thursday, November 29. The purpose of this event is to help the participants to expand their network, find new research partners, and strengthen the IDSS community. The event includes lunch. To register for the IDSS Science Speed Dating Event, please sign up here. Please contact Doro Unger-Lee at dor@mit.edu with any questions you may have about this event.

Find out more »## Model-X knockoffs for controlled variable selection in high dimensional nonlinear regression

Lucas Janson (Harvard University )

E18-304

Abstract: Many contemporary large-scale applications, from genomics to advertising, involve linking a response of interest to a large set of potential explanatory variables in a nonlinear fashion, such as when the response is binary. Although this modeling problem has been extensively studied, it remains unclear how to effectively select important variables while controlling the fraction of false discoveries, even in high-dimensional logistic regression, not to mention general high-dimensional nonlinear models. To address such a practical problem, we propose a new…

Find out more »## The Regression Discontinuity Design: Methods and Applications

Rocio Titiunik (University of Michigan)

E18-304

Abstract: The Regression Discontinuity (RD) design is one of the most widely used non-experimental strategies for the study of treatment effects in the social, behavioral, biomedical, and statistical sciences. In this design, units are assigned a score and a treatment is offered if the value of that score exceeds a known threshold---and withheld otherwise. In this talk, I will discuss the assumptions under which the RD design can be used to learn about treatment effects, and how to make valid…

Find out more »## Joint estimation of parameters in Ising Model

Sumit Mukherjee (Columbia University)

E18-304

Abstract: Inference in the framework of Ising models has received significant attention in Statistics and Machine Learning in recent years. In this talk we study joint estimation of the inverse temperature parameter β, and the magnetization parameter B, given one realization from the Ising model, under the assumption that the underlying graph of the Ising model is completely specified. We show that if the graph is either irregular or sparse, then both the parameters can be estimated at rate n−1/2…

Find out more »## September 2018

## Jingbo Liu

MIT

E18-304

Abstract Concentration of measure refers to a collection of tools and results from analysis and probability theory that have been used in many areas of pure and applied mathematics. Arguably, the first data science application of measure concentration (under the name ‘‘blowing-up lemma’’) is the proof of strong converses in multiuser information theory by Ahlswede, G'acs and K"orner in 1976. Since then, measure concentration has found applications in many other information theoretic problems, most notably the converse (impossibility) results in…

Find out more »## August 2018

## Resource-efficient ML in 2 KB RAM for the Internet of Things

Prateek Jain (Microsoft Research)

E18-304

Abstract: We propose an alternative paradigm for the Internet of Things (IoT) where machine learning algorithms run locally on severely resource-constrained edge and endpoint devices without necessarily needing cloud connectivity. This enables many scenarios beyond the pale of the traditional paradigm including low-latency brain implants, precision agriculture on disconnected farms, privacy-preserving smart spectacles, etc. Towards this end, we develop novel tree and kNN based algorithm, called Bonsai and ProtoNN, for efficient prediction on IoT devices — such as those based…

Find out more »## May 2018

## Fitting a putative manifold to noisy data

Hariharan Narayanan (Tata Institute of Fundamental Research, Mumbai)

E18-304

Abstract: We give a solution to the following question from manifold learning. Suppose data belonging to a high dimensional Euclidean space is drawn independently, identically distributed from a measure supported on a low dimensional twice differentiable embedded compact manifold M, and is corrupted by a small amount of i.i.d gaussian noise. How can we produce a manifold $M_o$ whose Hausdorff distance to M is small and whose reach (normal injectivity radius) is not much smaller than the reach of M?…

Find out more »## April 2018

## Optimality of Spectral Methods for Ranking, Community Detections and Beyond

Jianqing Fan (Princeton University)

E18-304

Abstract: Spectral methods have been widely used for a large class of challenging problems, ranging from top-K ranking via pairwise comparisons, community detection, factor analysis, among others. Analyses of these spectral methods require super-norm perturbation analysis of top eigenvectors. This allows us to UNIFORMLY approximate elements in eigenvectors by linear functions of the observed random matrix that can be analyzed further. We first establish such an infinity-norm pertubation bound for top eigenvectors and apply the idea to several challenging problems…

Find out more »## May 2017

## Fast Rates for Bandit Optimization with Upper-Confidence Frank-Wolfe

Vianney Perchet (ENS Paris-Saclay)

E18-304

The Stochastics and Statistics Seminar is a weekly meeting organized by the Statistics and Data Science Center (SDSC). It consists of a series of one-hour presentations by worldwide leaders making cutting edge contributions to methodological and theoretical advances in data science. These fields include probability, statistics, optimization, and applied mathematics. The seminar also regularly features experts in applications domains such as biology or engineering. This intellectual diversity has contributed to the organic assembly of a dynamic and diverse audience articulated…

Find out more »## Invariance and Causality

Jonas Peters (University of Copenhagen)

E18-304

The Stochastics and Statistics Seminar is a weekly meeting organized by the Statistics and Data Science Center (SDSC). It consists of a series of one-hour presentations by worldwide leaders making cutting edge contributions to methodological and theoretical advances in data science. These fields include probability, statistics, optimization, and applied mathematics. The seminar also regularly features experts in applications domains such as biology or engineering. This intellectual diversity has contributed to the organic assembly of a dynamic and diverse audience articulated…

Find out more »## Some related phase transitions in phylogenetics and social network analysis

Sebastien Roch (University of Wisconsin-Madison)

E18-304

The Stochastics and Statistics Seminar is a weekly meeting organized by the Statistics and Data Science Center (SDSC). It consists of a series of one-hour presentations by worldwide leaders making cutting edge contributions to methodological and theoretical advances in data science. These fields include probability, statistics, optimization, and applied mathematics. The seminar also regularly features experts in applications domains such as biology or engineering. This intellectual diversity has contributed to the organic assembly of a dynamic and diverse audience articulated…

Find out more »## April 2017

## The Landscape of Some Statistical Problems

Andrew Montanari (Stanford University)

E18-304

The LIDS Seminar Series features distinguished speakers in the information and decision sciences who provide an overview of a research area, as well as exciting recent progress in that area. Intended for a broad audience, seminar topics span the areas of communications, computation, control, learning, networks, probability and statistics, optimization, and signal processing.

Find out more »## Active learning with seed examples and search queries

Daniel Hsu (Columbia University)

E18-304

## Sample-optimal inference, computational thresholds, and the methods of moments

David Steurer (Cornell University)

E18-304

## March 2017

## Jagers-Nerman stable age distribution theory, change point detection and power of two choices in evolving networks

Shankar Bhamidi (UNC)

E18-304

## Robust Statistics, Revisited

Ankur Moitra (MIT)

E18-304

## Computing partition functions by interpolation

Alexander Barvinok (University of Michigan)

E18-304

## February 2017

## Estimating the number of connected components of large graphs based on subgraph sampling

Yihong Wu (Yale University)

E18-304

## Dynamic Learning in Strategic Pricing Games

John Birge (University of Chicago)

E18-304

The LIDS Seminar Series features distinguished speakers in the information and decision sciences who provide an overview of a research area, as well as exciting recent progress in that area. Intended for a broad audience, seminar topics span the areas of communications, computation, control, learning, networks, probability and statistics, optimization, and signal processing.

Find out more »## Causal Discovery in Systems with Feedback Cycles

Frederick Eberhardt (CalTech)

E18-304

## Slope meets Lasso in sparse linear regression

Pierre Bellec (Rutgers University)

E18-304

## Non-classical Berry-Esseen inequality and accuracy of the weighted bootstrap

Mayya Zhilova (Georgia Tech)

E18-304

## November 2016

## Stochastics and Statistics Seminar Series

Po-Ling Loh (University of Pennsylvania)

E18-304

## October 2016

## Matrix estimation by Universal Singular Value Thresholding

Sourav Chatterjee (Stanford University)

E18-304