Machine Learning for Greenhorns – Winter 2019/20

Machine learning is reaching notable success when solving complex tasks in many fields. This course serves as in introduction to basic machine learning concepts and techniques, focusing both on the theoretical foundation, and on implementation and utilization of machine learning algorithms in Python programming language. High attention is paid to the ability of application of the machine learning techniques on practical tasks, in which the students try to devise a solution with highest performance.

Python programming skills are required, together with basic probability theory knowledge.

About

SIS code: NPFL129
Semester: winter
E-credits: 5
Examination: 2/2 C+Ex
Guarantor: Milan Straka

Timespace Coordinates

  • lecture: the lecture is held on Monday 14:00 in S3; first lecture is on Oct 07
  • practicals: there are two parallel practicals, on Tuesday 9:00 in SU1 and on Tuesday 12:20 in SU2; first practicals are on Oct 08

Lectures

1. Introduction to Machine Learning Slides PDF Slides linear_regression_manual linear_regression_l2 linear_regression_competition

2. Linear Regression II, SGD, Perceptron Slides PDF Slides feature_engineering linear_regression_sgd perceptron binary_classification_competition


Requirements

To pass the practicals, you need to obtain at least 80 points (excluding the bonus points), which are awarded for home assignments. Note that up to 40 points above 80 will be transfered to the exam.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

The lecture content, including references to study materials. The main study material is the Pattern Recognition and Machine Learning by Christopher Bishop, referred to as PRML.

References to study materials cover all theory required at the exam, and sometimes even more – the references in italics cover topics not required for the exam.

1. Introduction to Machine Learning

 Oct 07 Slides PDF Slides linear_regression_manual linear_regression_l2 linear_regression_competition

2. Linear Regression II, SGD, Perceptron

 Oct 14 Slides PDF Slides feature_engineering linear_regression_sgd perceptron binary_classification_competition

The tasks are evaluated automatically using the ReCodEx Code Examiner. The evaluation is performed using Python 3.6, scikit-learn 0.21.3, pandas 0.25.1 and NumPy 1.17.2.

You can install all required packages either to user packages using pip3 install --user scikit-learn==0.21.3 pandas==0.25.1, or create a virtual environment using python3 -m venv VENV_DIR and then installing the packages inside it by running VENV_DIR/bin/pip3 install scikit-learn==0.21.3 pandas==0.25.1.

Teamwork

Working in teams of size 2 (or at most 3) is encouraged. All members of the team must submit in ReCodEx individually, but can have exactly the same sources/models/results. However, each such solution must explicitly list all members of the team to allow plagiarism detection using this template.

linear_regression_manual

 Deadline: Oct 20, 23:59  3 points

Starting with the linear_regression_manual.py template, solve a linear regression problem using the algoritm from the lecture which explicitly computes the matrix inversion. Then compute root mean square error on the test set.

linear_regression_l2

 Deadline: Oct 20, 23:59  3 points

Starting with the linear_regression_l2.py template, use scikit-learn to train regularized linear regression models and print the results of the best of them.

linear_regression_competition

 Deadline: Oct 20, 23:59  3 points+5 bonus

This assignment is a competition task. Your goal is to perform linear regression on the data from a rental shop. The train set contains 1000 instances, each instance consists of 12 features, both integral and real.

The linear_regression_competition.py template show how to load the linear_regression_competition.train.npz available in the repository. Furthermore, it shows how to save a trained estimator, how to load it, and it shows recodex_predict method which is called during ReCodEx evaluation.

The performance of your system is measured using root mean squared error and your goal is to achieve RMSE less than 130. Note that you can use any sklearn algorithm to solve this exercise.

feature_engineering

 Deadline: Nov 03, 23:59  4 points

Starting with the feture_engineering.py template, learn how to perform basic feature engineering using scikit-learn.

linear_regression_sgd

 Deadline: Nov 03, 23:59  5 points

Starting with the linear_regression_sgd.py, implement minibatch SGD for linear regression. Evaluate it using cross-validation and compare the results to an explicit linear regression solver.

perceptron

 Deadline: Nov 03, 23:59  3 points

Starting with the perceptron.py template, implement the perceptron algorithm.

binary_classification_competition

 Deadline: Nov 03, 23:59  4 points+5 bonus

This assignment is a competition task. Your goal is to perform binary classification on the data from contract approval. The train set contains 500 instances, each instance consists of 15 features, both integral and real.

Rest of the details to appear later.

In the competitions, your goal is to train a model and then predict target values on the test set available only in ReCodEx.

Submitting to ReCodEx

When submitting a competition solution to ReCodEx, you can include any number of files of any kind. However, these should be exactly one Python source (.py) containing a top-level method recodex_predict. This method is called with the test input data in a Numpy array and should return the predictions again as a Numpy array.

If your submission contains a trained model(s), you should also submit the Python source you used to train it.

Evaluation in ReCodEx

ReCodEx starts the evaluation by importing all Python sources and checking if they export recodex_predict method. Then it executes it, evaluates the prediction, and returns one of the following results:

  • Failed, 0%: Either there was not exactly one Python source with recodex_predict, or it crashed during prediction, or it generated an output with incorrect size.
  • OK, 1-99%: Output of correct size was returned by recodex_predict, but it did not achieve required performance. The percentage returned is either achieved/required or required/achieved (depending on whether the goal is to get over/under the requirement). No points are awarded.
  • OK, 100%: The required level of performance was reached; however, the exact performance is unknown.

After the deadline, the exact performance becomes visible for all submissions.

Note that in any case, the exit code of your solution is reported as 0.

Points for Competition Submission

Everyone surpassing the required performance immediately gets the regular points for the assignment.

Furthermore, after the deadline, the latest submission of every user passing the required baseline is collected, and bonus points are awarded depending on relative ordering of performance of the selected submissions.

What Is Allowed

  • You can use only the given annotated data for training. However, you can use any unannotated or manually created data.
  • Do not use test set annotations in any way.
  • You can use generally any algorithm present in sklearn, numpy or scipy, or anything you implement yourself. Do not use deep network frameworks like TensorFlow or PyTorch.