Econometrics meets Natural Language Processing: from Topic Analysis to Large Language Models

Perugia, 22-26 July 2024



Juri Marcucci
Bank of Italy
Via Nazionale 91, 00184 Rome, Italy



  • José Luis Montiel Olea (Cornell University)
  • Jordan Lee Boyd-Graber (University of Maryland)



A prior knowledge of statistics, econometrics and some programming skills are required


Course description

This intensive summer school course bridges the gap between econometrics and natural language processing (NLP), guiding participants from the foundational concepts of topic analysis to the cutting-edge advancements in large language models (LLMs). Designed for researchers, PhD students, and professionals with an interest in data science, econometrics, and computational linguistics, the program offers a unique opportunity to explore how NLP tasks can enhance econometric methods and vice versa. Through a combination of lectures and handson workshops, participants will gain a deep understanding of how to apply econometric techniques in topic modeling and leverage the power of NLP and LLMs for economic data analysis and beyond.

a) Natural Language Processing

  • Basics of NLP, Text preprocessing and feature Extraction in NLP
  • The evolution of Language models: from RNNs to Transformers
  • Understanding BERT, GPT, and other Transformer Architectures
  • Fine-Tuning LLMs for custom applications. Hands-on with Transformers and Hugging Face
  • Future Directions in LLMs: mitigating hallucinations, explainability

b) Econometrics

  • Introduction to Topic models: LDA and Beyond
  • Identification of Topic models
  • Econometric applications with Topic models
  • Anchor words
  • Hands-on with topic models in econometrics and anchor words

Lecture notes, slides, codes, and data will be provided.


Schedule of the course:

Day 1 Morning/Afternoon

  • How we got here: a brief history of modern natural language processing and what made large language models possible
  • Distributed and distributional representations
    • tf-idf
    • word2vec
  • Exercise: Nearest neighbor search


Day 2 Morning/Afternoon

  • Review of Logistic Regression and SGD
  • Learning word2vec representations
  • Language Modeling “Big Picture”
  • What’s a LLM
  • History of LLMs: DAN and ELMO
  • Transformers


Day 3 Morning

  • Topic Models for Text Analysis: Identification
    • Partial Identification and the Nonnegative Matrix Factorization Problem
    • Identification via “Anchor Words”
    • Identification via the “Sufficiently Scattered Condition”

Day 3 Afternoon

  • Topic Models for Text Analysis: Estimation
    • Optimal Estimation of sparse Topic Models with Anchor Words
    • Likelihood estimation of sparse topic distributions in topic models with Anchor Words
    • Likelihood estimation of Topic Models satisfying the sufficiently scattered condition


Day 4 Morning

  • Finetuning and RLHF
  • Workshop: Question Answering

Day 4 Afternoon

  • Topic Models for Text Analysis: Testability of Identifying Assumptions
    • Testability of the anchor words assumption in topic models
    • Checking the Sufficiently Scattered Condition using Non-Convex Optimization

Day 5 Morning

  • A menagerie of NLP tasks and datasets
    • Machine translation
    • Entailment
    • Retrieval
    • Fact Checking
    • Parsing
    • Summarization
  • Evaluating NLP models
    • Perplexity
    • Precision / Recall
    • Sequence to Sequence
    • Structured Prediction
    • Interpretability

Day 5 Afternoon

  • Topic Models for Text Analysis and LLMs
    • Are LLMs related to latent variable models?
    • Topic distributions in LLMs.


Venue and timetables

The Module will be held in the Bank of Italy's Scuola di Automazione per Dirigenti Bancari (S.A.Di.Ba.), via San Marco n.54, Perugia. Participants will be accommodated at S.A.Di.Ba.. (in case of reduced availability of rooms in the Centre, they will be accommodated in local hotels).
Lectures and tutorials will be in English, with the following schedule:

  • Monday to Friday: lectures 9:00-12:30, 14:30-18:00