Faculty of Mathematics and Physics

**Machine learning** is reaching notable success when solving complex tasks in
many fields. This course serves as in **introduction** to **basic machine learning**
concepts and techniques, focusing both on the **theoretical foundation**, and on
**implementation and utilization** of machine learning algorithms in **Python**
programming language. High attention is paid to the ability of application of
the machine learning techniques on practical tasks, in which the students try to
devise a solution with highest performance.

**Python** programming skills are required, together with basic
**probability theory** knowledge.

SIS code: NPFL129

Semester: winter

E-credits: 5

Examination: 2/2 C+Ex

Guarantor: Milan Straka

**lecture**: the lecture is held on Monday 14:00 in S3; first lecture is on**Oct 07****practicals**: there are two parallel practicals, on Tuesday 9:00 in SU1 and on Tuesday 12:20 in SU2; first practicals are on**Oct 08**

1. Introduction to Machine Learning Slides PDF Slides linear_regression_manual linear_regression_l2 linear_regression_competition

2. Linear Regression II, SGD, Perceptron Slides PDF Slides feature_engineering linear_regression_sgd perceptron binary_classification_competition

To pass the practicals, you need to obtain at least **80** points (excluding
the bonus points), which are awarded for home assignments. Note that up to
**40** points above 80 will be transfered to the exam.

To pass the exam, you need to obtain at least 60, 75 and 90 out of 100 points for the written exam (plus up to 40 points from the practicals), to obtain grades 3, 2 and 1, respectively.

The lecture content, including references to study materials. The main study material is the Pattern Recognition and Machine Learning by Christopher Bishop, referred to as PRML.

References to study materials cover **all theory required** at the exam,
and sometimes even more – the references in *italics* cover topics
**not required** for the exam.

Oct 07 Slides PDF Slides linear_regression_manual linear_regression_l2 linear_regression_competition

Oct 14 Slides PDF Slides feature_engineering linear_regression_sgd perceptron binary_classification_competition

The tasks are evaluated automatically using the ReCodEx Code Examiner. The evaluation is performed using Python 3.6, scikit-learn 0.21.3, pandas 0.25.1 and NumPy 1.17.2.

You can install all required packages either to user packages using
`pip3 install --user scikit-learn==0.21.3 pandas==0.25.1`

,
or create a virtual environment using `python3 -m venv VENV_DIR`

and then installing the packages inside it by running
`VENV_DIR/bin/pip3 install scikit-learn==0.21.3 pandas==0.25.1`

.

Working in teams of size 2 (or at most 3) is encouraged. All members of the team
must submit in ReCodEx individually, but can have exactly the same
sources/models/results. **However, each such solution must explicitly list all
members of the team to allow plagiarism detection using
this template.**

Deadline: Oct 20, 23:59 3 points

Starting with the linear_regression_manual.py template, solve a linear regression problem using the algoritm from the lecture which explicitly computes the matrix inversion. Then compute root mean square error on the test set.

Deadline: Oct 20, 23:59 3 points

Starting with the linear_regression_l2.py
template, use `scikit-learn`

to train regularized linear regression models
and print the results of the best of them.

Deadline: Oct 20, 23:59 3 points+5 bonus

This assignment is a competition task. Your goal is to perform linear regression on the data from a rental shop. The train set contains 1000 instances, each instance consists of 12 features, both integral and real.

The linear_regression_competition.py
template show how to load the
linear_regression_competition.train.npz
available in the repository. Furthermore, it shows how to save a trained
estimator, how to load it, and it shows `recodex_predict`

method which
is called during ReCodEx evaluation.

The performance of your system is measured using *root mean squared error*
and your goal is to achieve RMSE less than 130. Note that you can use
**any sklearn algorithm** to solve this exercise.

Deadline: Nov 03, 23:59 4 points

Starting with the feture_engineering.py template, learn how to perform basic feature engineering using scikit-learn.

Deadline: Nov 03, 23:59 5 points

Starting with the linear_regression_sgd.py, implement minibatch SGD for linear regression. Evaluate it using cross-validation and compare the results to an explicit linear regression solver.

Deadline: Nov 03, 23:59 3 points

Starting with the perceptron.py template, implement the perceptron algorithm.

Deadline: Nov 03, 23:59 4 points+5 bonus

This assignment is a competition task. Your goal is to perform binary classification on the data from contract approval. The train set contains 500 instances, each instance consists of 15 features, both integral and real.

Rest of the details to appear later.

In the competitions, your goal is to train a **model** and then **predict**
target values on the **test set** available only in ReCodEx.

When submitting a competition solution to ReCodEx, you can include any
number of files of any kind. However, these should be exactly one
Python source (`.py`

) containing a top-level method `recodex_predict`

.
This method is called with the test input data in a Numpy array
and should return the predictions again as a Numpy array.

If your submission contains a trained model(s), you should also submit
the **Python source you used to train** it.

ReCodEx starts the evaluation by importing **all** Python sources and checking
if they export `recodex_predict`

method. Then it executes it, evaluates the
prediction, and returns one of the following results:

**Failed, 0%**: Either there was not exactly one Python source with`recodex_predict`

, or it crashed during prediction, or it generated an output with incorrect size.**OK, 1-99%**: Output of correct size was returned by`recodex_predict`

, but it did not achieve required performance. The percentage returned is either`achieved/required`

or`required/achieved`

(depending on whether the goal is to get over/under the requirement).**No points**are awarded.**OK, 100%**: The required level of performance was reached; however, the exact performance is unknown.

After the deadline, the exact performance becomes visible for all submissions.

Note that in any case, the exit code of your solution is reported as 0.

Everyone surpassing the required performance immediately gets the **regular points**
for the assignment.

Furthermore, after the deadline, the **latest submission** of every user
**passing the required baseline** is collected, and **bonus points** are awarded
depending on relative ordering of performance of the selected submissions.

- You can use
**only the given annotated data**for training. However, you can use any unannotated or manually created data. **Do not use test set**annotations in any way.- You can use generally
**any algorithm**present in`sklearn`

,`numpy`

or`scipy`

, or anything you implement yourself. Do**not**use deep network frameworks like TensorFlow or PyTorch.