Textual analysis and machine learning with applications to economics and finance

Perugia, 11-15 July 2022



Juri Marcucci
Bank of Italy
Via Nazionale 91, 00184 Rome, Italy
Email: juri.marcucci@bancaditalia.it,   juri.marcucci@gmail.com




The activation of the course in presence is conditional to the recruitment of a minimum of 15 participants.

The maximum number of allowed participants in presence is 30.

Basic knowledge of statistics. Participants should have a basic understanding of computer programming. It is possible to follow the tutorial available at https://www.learnpython.org/ to learn or review the basics of programming in Python. Participants must install Anaconda (https://www.anaconda.com/products/individual) to have a functional programming environment before the beginning of the course.


Reference textbooks for the course:

  • Altig, D., Baker, S., Barrero, J. M., Bloom, N., Bunn, P., Chen, S., ... & Thwaites, G. (2020). Economic uncertainty before and during the COVID-19 pandemic. Journal of Public Economics, 191, 104274.
  • Kearney, C., & Liu, S. (2014). Textual sentiment in finance: A survey of methods and models. International Review of Financial Analysis, 33, 171-185.
  • Picault, M., Pinter, J., & Renault, T. (2022). Media sentiment on monetary policy: determinants and relevance for inflation expectations. Journal of International Money and Finance, Forthcoming.
  • Picault, M., & Renault, T. (2017). Words are not all created equal: A new measure of ECB communication. Journal of International Money and Finance, 79, 136-156.
  • Loughran, T., & McDonald, B. (2016). Textual analysis in accounting and finance: A survey. Journal of Accounting Research, 54(4), 1187-1230.
  • Renault, T. (2020). Sentiment analysis and machine learning in finance: a comparison of methods and models on one million messages. Digital Finance, 2(1), 1-13.
  • Renault, T. (2017). Intraday online investor sentiment and return patterns in the US stock market. Journal of Banking & Finance, 84, 25-40.
  • Thorsrud, L. A. (2020). Words are the new numbers: A newsy coincident index of the business cycle. Journal of Business & Economic Statistics, 38(2), 393-409.
  • Mitchell, R. (2018). Web scraping with Python: Collecting more data from the modern web. " O'Reilly Media, Inc.".
  • Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with python: Enabling language-aware data products with machine learning. " O'Reilly Media, Inc.".
  • Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.".


Course description

The objective of this course is study how we can use the millions of textual contents published on the Internet and social media every day to improve our understanding of various economic and financial phenomena. After an introduction to the Python programming language, we will start by seeing how it is possible to extract online content via the use of existing APIs or the implementation of web scraping tools. We will create an application to collect articles from a major media site and we will use an API to extract tweets from a social network dedicated to finance. Next, we will see how to analyse a text using Natural Language Processing (NLP) methods. We will apply this to the speeches made by the European Central Bank to show how it is possible to give structure to unstructured data. The next session will be dedicated to sentiment analysis and will present the different methods (dictionary approach and machine learning). We will analyse Twitter data to build a sentiment indicator capturing the well-being of individuals in a country. Then, we will introduce the unsupervised methods of textual analysis with a particular focus on topic modelling methods. We will perform an application of a Latent Dirichlet Allocation on a large corpus of Wikipedia articles. Finally, the last session will be devoted to advanced methods of textual analysis to open the field of possibilities by introducing different methods of machine learning, word embedding and data structuring.

For the different sessions, we will first present both the related theories and methods - in a language accessible to non-mathematicians - and their latest applications in the economic and financial literature. We will then study and share with the participants’ scripts and codes to realize different tasks in Python. We will also offer participants the opportunity to present their research and/or projects, and if possible, we will assist them with their projects - both on the data collection side and on the data analysis side.


Schedule of the course:

Mon 11 Jul  9:00 - 12:30  Introduction to Python

                  14:30 -18:00  Application: How to get data from API and websites

Tue 12 Jul  9:00 - 12:30  Natural Language Processing

                   14:30 -18:00  Application: NLP to analyse central bank

Wed 13 Jul  9:00 - 12:30  Sentiment Analysis

                  14:30 -18:00   Application:  Measuring well-being on Twitter

Thu 14 Jul  9:00 - 12:30   Unsupervised methods for textual analysis

                  14:30 -18:00   Application: Latent Dirichlet Allocation on Wikipedia

Fri 15 Jul    9:00 - 12:30   Paper presentation

                  14:30 -18:00   Advanced methods in text mining


Venue and timetables

The Module will be held in the Bank of Italy's Scuola di Automazione per Dirigenti Bancari (S.A.Di.Ba.), via San Marco n.54, Perugia. Participants will be accommodated at S.A.Di.Ba.. (in case of reduced availability of rooms in the Centre, they will be accommodated in local hotels).
Lectures and tutorials will be in English, with the following schedule:

  • Monday to Friday: lectures 9:00-12:30, 14:30-18:00