An Introduction to Machine Learning and Text Mining

for economists using Stata, Python and R

THE COURSE IS DELIVERED IN ONLINE MODE.

July, 26th - 30th 2021

 

 

Coordinator:

Juri Marcucci
Bank of Italy
Via Nazionale 91, 00184 Rome, Italy
Email: juri.marcucci@bancaditalia.itjuri.marcucci@gmail.com

 

Lecturers

Programs are conditional to the recruitment of a minimum of 15 participants

 

Basic Requirements

Knowledge of basic statistics and econometrics including:

  • notion of conditional expectation and related properties
  • point and interval estimation
  • regression model and related properties
  • probit, logit and multinomial regression

A working knowledge of Stata and R are preferable.

 

Course description:

Machine Learning (ML) is a relatively new approach to data analytics, which places itself at the intersection between statistics, computer science, and artificial intelligence. The primary objective of ML is turning information into knowledge and value by “letting the data speak”. To this purpose, ML limits prior assumptions on data structure, and relies on a model-free philosophy supporting algorithm development, computational procedures, and graphical inspection more than tight assumptions, algebraic development, and analytical solutions. Computationally unfeasible a few years ago, ML is a product of the computer’s era, of today machines’ computing power and ability to learn, of hardware development, and of continuous software upgrading.

This course is a primer to ML techniques for economists and social scientists using three popular software, i.e. Stata, Python, and R. These software platforms own many built-in packages to run easily ML algorithms. This course aims at making participants familiar with (and knowledgeable of) the potential of these packages to draw knowledge and value form raw, large, and possibly noisy data. The teaching approach will be mainly based on the graphical language and intuition rather than on algebra. The training will make use of instructional as well as real-world examples, and will evenly balance theory and practical sessions. Relatedly, the course will also offer a one-day training on recent advances of Text Mining and Sentiment Analysis, covering useful methodologies for quantitatively analyzing texts, discovering significant patterns, and identifying useful information in textual data.

After the course, participants are expected to have an improved understanding of the potential to perform ML and text mining, thus becoming able to master research tasks including, among others, factor-importance detection, signal-from-noise extraction, correct model specification, and model-free classification, both from a data-mining and a predictive perspective.

 

Course Schedule:

 

Monday July 26, 2021

10:00AM-11:30AM Lecture (first part)

Fundamentals of Machine Learning: definition, rationale, and usefulness

11:30AM-11:45AM Coffee break

11:45AM-1:15PM Lecture (second part)

Test-error estimation: information criteria and resampling techniques

1:15PM-1:30PM Q&A

1:30PM-2:30PM Lunch break

2:30PM-3:30PM Lecture (third part)

Regularized regression: Lasso, Ridge, and Elastic Net regressions

3:30PM-3:45PM Coffee break

3:45PM-4:45PM Computer Lab

Regularized regression with Stata and R

4:45PM-5:30PM Q&A

 

Tuesday July 27, 2021

10:00AM-11:30AM Lecture (first part)

Optimal model selection: exhaustive, forward, and backward methods

11:30AM-11:45AM Coffee break

11:45AM-1:15PM Lecture (second part)

Classification: discriminant analysis and the nearest-neighbor algorithm

1:15PM-1:30PM Q&A

1:30PM-2:30PM Lunch break

2:30PM-3:30PM Computer Lab

Discriminant analysis using Stata

3:30PM-3:45PM Coffee break

3:45PM-4:45PM Computer Lab

Nearest neighbor classification and regression with Stata and R

4:45PM-5:30PM Q&A

 

Wednesday July 28, 2021

10:00AM-11:30AM Lecture (first part)

Tree-based regression and classification

11:30AM-11:45AM Coffee break

11:45AM-1:15PM Lecture (second part)

Bagging, Random forests, and Boosting

1:15PM-1:30PM Q&A

1:30PM-2:30PM Lunch break

2:30PM-3:30PM Computer Lab

Tree-based methods using R

3:30PM-3:45PM Coffee break

3:45PM-4:45PM Computer Lab

Bagging, Random forests, and Boosting with R

4:45PM-5:30PM Q&A

 

Thursday July 29, 2021

10:00AM-11:30AM Lecture (first part)

Neural networks and meta-learning

11:30AM-11:45AM Coffee break

11:45AM-1:15PM Lecture (second part)

An introduction to the Python Scikit-learn platform

1:15PM-1:30PM Q&A

1:30PM-2:30PM Lunch break

2:30PM-3:30PM Computer Lab

Neural networks in Python

3:30PM-3:45PM Coffee break

3:45PM-4:45PM Computer Lab

Super-learning machine using Scikit-learn

4:45PM-5:30PM Q&A

 

Friday July 30, 2021

10:00AM-11:30AM Lecture (first part)

Text mining for economic analysis  

11:30AM-11:45AM Coffee break

11:45AM-1:15PM Lecture (second part)

Sentiment analysis for economic applications

1:15PM-1:30PM Q&A

1:30PM-2:30PM Lunch break

2:30PM-3:30PM Computer Lab

Text mining using R

3:30PM-3:45PM Coffee break

3:45PM-4:45PM Computer Lab

Sentiment analysis using R 

4:45PM-5:30PM Q&A

 

Software and Tutorials

The course will R, Python and Stata (version 16 preferable). The attendees need also to have installed in their laptop the software R, RStudio, and Python 3.7. The Anaconda distribution of Python is appropriate.

 

References

Machine Learning

Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013), An Introduction to Statistical Learning with Applications in R, Springer, New York, 2013.

Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2008), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition, Springer.

Raschka, S., Mirjalili, V. 2019. Python Machine Learning. 3rd Edition, Packt Publishing.

Cerulli, G. (2020), Improving econometric prediction by machine learning, Applied Economics Letter, forthcomimg, doi: https://doi.org/10.1080/13504851.2020.1820939.

Text Mining

Jo, T. (2019). Text mining. Studies in Big Data. Cham: Springer International Publishing.

Sentiment Analysis

Liu, B. (2020). Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge university press.
 

Venue and timetables

The Module will last one week and will be delivered ONLINE.

Lectures and tutorials will be in English, with the following schedule:

Monday to Friday: lectures and tutorials 10:00-13:30, 14:30-17:30;

 

Fees and Enrollment

  •  Students, PhD students and temporary university staff: 390€
  •  University staff: 490€ 
  •  Others: 1500€

In case of enrollment in two or more courses, for a maximum of three, Student and Staff participants are entitled to a discount of 100 euros on each course. Other participants are entitled to a discount of 300 euros on each course.

 

Renounce and refund:

To submit a renounce request, please send an email to admin@side-iea.it.

You can give up immediately after the notification of acceptance or later.
After the payment, you can submit your renounce up to one week (7 days) from the beginning of the course (within the terms for refund) and ask for a refund with motivated reasons (health reasons to be documented, for study purposes or personal reasons). We will refund your fee with a deduction for administrative and organization costs: 150 Euro for the course in presence, 100 Euro for online course.
Over the terms for refund (less of 7 days from the beginning of the course) you need to motivate your request (as indicated above), which will be submitted to SIdE President.

 

Important dates:

Application Deadline: May 30th, 2021

Notification of acceptance will be posted by the 15th June, 2021

Deadline for Fee Payment is June 30th, 2021

 

Contacts