Loading Events
  • This event has passed.
IDSS Distinguished Seminar Series

Automating the Digitization of Historical Data on a Large Scale

December 2, 2019 @ 4:00 pm - 5:00 pm

Melissa Dell (Harvard University)


Over the past two centuries, we have transitioned from an overwhelmingly agricultural world to one with vastly different patterns of economic organization. This transition has been remarkably uneven across space and time, and has important implications for some of the most central challenges facing societies today. Deepening our understanding of the determinants of economic transformation requires data on the long-run trajectories of individuals and firms. However, these data overwhelmingly remain trapped in hard copy, with cost estimates for manual digitization totaling millions of dollars for even relatively modestly sized datasets. Automation has the potential to massively scale up the extraction of historical quantitative data from hard copy documents, significantly expanding and democratizing access. However, the synthesis of methodology required to digitize and catalog most historical data is not available off-the-shelf through commercial OCR software, which performs poorly at recognizing irregular document layouts. Off-the-shelf tools for assembling raw unstructured output into structured databases likewise do not exist.

We develop methods for automating the digitization and classification of historical data on a large scale, illustrating their application to a rich corpus of historical Japanese documents about firms and individuals. An array of methods from computer vision, natural language processing, and machine learning are used to detect complex document layouts and assemble a rich structured dataset that tracks the evolution of network relationships between Japanese managers, government officials, and firms across the 20th century.

About the Speaker: Melissa Dell is a professor in the Economics Department and a faculty research fellow at the National Bureau of Economic Research. Her research focuses on long-run economic development, primarily in Latin America and Asia. She has examined the impacts of weather on economic growth and is currently conducting research about the long-run effects of agrarian reform and agricultural technology investments in Mexico and East Asia. She received a PhD in Economics from MIT, a masters degree in Economics from Oxford, and a BA from Harvard College.

Reception to follow.

© MIT Institute for Data, Systems, and Society | 77 Massachusetts Avenue | Cambridge, MA 02139-4307 | 617-253-1764 |