Data Intensive Computing – Summer 2016

In summer 2016 the course is held on Monday 15:40 in SU1.

Pass Conditions

To pass the course, you need to obtain at least 30 points. Points are awarded for

  • home assignments
  • talk (contact me if you are interested)
  • optional project (depending on complexity, up to 30 points can be awarded)

Content of Lectures

DateLecture Content
Feb 29

Introduction to MapReduce framework, discussing previous experiences and interesting advanced topics.

Mar 07

Finished introduction to MapReduce framework.

Introduction to SGE – normal and array jobs, dependency between jobs.

Mar 14

Solving tasks from Handout 1.

Mar 21

Solving tasks from Handout 1.

Apr 04

Introduction to Apache Spark.

Apr 11

Solving tasks from Handout 2.

Apr 18

Still solving tasks from Handout 2.

Apr 25

Introducing DataFrame API.

May 02

Graph calculations using Spark API.

May 09

Graph calculations using Spark API, machine learning using pyspark.ml API

Home Assignments

TaskPointsDue ToTask Description
unique_words3Apr 04
15:39

Create a list of unique words used in the articles. Convert them to lowercase to ignore case.

Because the article data is not tokenized, use provided /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer, which reads untokenized UTF-8 text from standard input and produces tokenized UTF-8 text on standard output. It preserves line breaks and separates tokens on each line by exactly one space.

inverted_index4Apr 04
15:39

Compute inverted index – for every lowercased word from the articles, compute ascending (article id, ascending positions of occurrences as word indices) pairs. In order to do so, number the articles using consecutive integers and produce also a list of articles representing this mapping (the article on line i is the article with id i; you can use the example articles.sh).

The output should be a file with list of articles ordered by article id, and a file with one word on a line in this format:

word \t article_id \t space separated occurrences \t article_id \t space separated occurrences ...

Once again use provided tokenizer.

wordsim_index4Apr 04
15:39

In order to implement word similarity search, compute for each lemma with at least three occurrences all contexts in which it occurs, including their number of occurrences.

Given N (either 1, 2 or 3) on the command line, the context of a lemma occurrence is N lemmas preceding it and N lemmas following it (ignore sentence boundaries).

To compute the lemmas for a given article, use provided /dlrc_share/data/wiki/lemmatizer/{cs,en}_lemmatizer, which works just like the tokenizer – it reads untokenized UTF-8 text from standard input and produces tokenized and lemmatized UTF-8 text on standard output, each lemma separated by exactly one space.

The output should be a file with one lemma on a line in the following format:

lemma \t context \t counts \t context \t counts ...
wordsim_find3Apr 04
15:39

Let S be natural number given on the command line. Using the index created in wordsim_index, find for each lemma S most similar lemmas. The similarity of two lemmas is computed using cosine similarity as (CA · CB) / (|CA| · |CB|), where CL is a vector of occurrences of lemma L contexts.

The output should be a file with one lemma on a line in the following format:

lemma \t most similar lemma \t cosine similarity \t 2. most similar lemma \t cosine similarity ...
spark_unique_words2May 02
15:39

Create a list of unique words used in the Wiki articles using Spark. Convert them to lowercase to ignore case.

Use HDFS /data/wiki/{cs,en} as input and either /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer or nltk.tokenize.wordpunct_tokenize as a tokenizer.

anagrams2May 02
15:39

Two words are anagrams if one is a letter permutation of the other (ignoring case).

For a given input, find all anagram classes that contain at least A words. Output each anagram class on a separate line.

Use HDFS /data/wiki/{cs,en} as input and either /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer or nltk.tokenize.wordpunct_tokenize as a tokenizer.

sort3May 02
15:39

You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format:

  • /dlrc_share/data/numbers-txt/numbers-small: 3MB
  • /dlrc_share/data/numbers-txt/numbers-medium: 184MB
  • /dlrc_share/data/numbers-txt/numbers-large: 916MB

You can assume that the integers are uniformly distributed.

Your task is to sort these data, comparing the key numerically and not lexicographically. The lines in the output must be the same as in the input, only in different order.

Your solution should work for TBs of data. For that reason, you must use multiple machines. If your job is executed using m machines, the output consists of m files, which when concatenated would produce sorted (key, value) pairs. In other words, each of the output files contains sorted (integer, data) pairs and all keys in one file are either smaller or larger than in other file. Your solution should work for any number of machines specified.

Obviously, do not use sort nor sortByKey method in your solution.

nonuniform_sort4May 02
15:39

Improve the sort to handle nonuniform data. You can use the following exponentially distributed data:

  • /dlrc_share/data/numbers-txt/nonuniform-small: 3MB
  • /dlrc_share/data/numbers-txt/nonuniform-medium: 160MB
  • /dlrc_share/data/numbers-txt/nonuniform-large: 797MB

Assume we want to produce m output files. One of the solutions is the following:

  • Go through the data and sample only a small fraction of the keys.
  • Find best m-1 separators using the sampled data.
  • Run the second pass using the computed separators.
spark_inverted_index2May 02
15:39

Compute inverted index in Spark – for every lowercased word from the articles, compute (article name, ascending positions of occurrences as word indices) pairs.

The output should be a file with one word on a line in the following format:

word \t article name \t space separated occurrences \t article name \t space separated occurrences ...

You will get 2 additional points if the articles will be numbered using consecutive integers. In that case, the output is ascending (article id, ascending positions of occurrences as word indices) pairs, together with a file containing list of articles representing this mapping (the article on line i is the article with id i).

Use HDFS /data/wiki/{cs,en} as input and either /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer or nltk.tokenize.wordpunct_tokenize as a tokenizer.

no_references3May 02
15:39

An article A is said to reference article B, if it contains B as a token (ignoring case).

Run a Spark job which for each article B counts how many references for article B there exist in the whole wiki (summing references in a single article).

Use HDFS /data/wiki/{cs,en} as input and either /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer or nltk.tokenize.wordpunct_tokenize as a tokenizer.

spark_wordsim_index3May 02
15:39

In order to implement word similarity search, compute for each form with at least three occurrences all contexts in which it occurs, including their number of occurrences. List the contexts in ascending order.

Given N (either 1, 2 or 3) on the command line, the context of a form occurrence is N forms preceding it and N forms following it (ignore sentence boundaries, use empty words when article boundaries are reached).

The output should be a file with one form on a line in the following format:

form \t context \t counts \t context \t counts ...

Use HDFS /data/wiki/{cs,en} as input and either /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer or nltk.tokenize.wordpunct_tokenize as a tokenizer.

spark_wordsim_find3May 02
15:39

Let S be natural number given on the command line. Using the index created in spark_wordsim_index, find for each form S most similar forms. List only forms with non-zero similarity. The similarity of two forms is computed using cosine similarity as (CA · CB) / (|CA| · |CB|), where CF is a vector of occurrences of form F contexts.

The output should be a file with one form on a line in the following format:

form \t most similar form \t cosine similarity \t 2. most similar form \t cosine similarity ...
kmeans6May 02
15:39

Implement K-means clustering algorithm as described on http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm.

The user specifies number of iterations and the program run specified number of K-means clustering algorithm iterations.

You can use the following training data. Each line contains space separated coordinates of one points. The coordinates in one input naturally have the same dimension.

Path in /dlrc_sharePointsDimensionClusters
/data/points-txt/small100005050
/data/points-txt/medium100000100100
/data/points-txt/large500000200200

You will get 2 additional points if the algorithm stops when none of the centroid positions change more than given ε.

4May 30
15:39

For two given Wikipedia pages, find out the shortest path of page links between them, if it exists.

transitive_closure4May 30
15:39

Compute transitive closure of the Wikipedia link graph. In other words, compute for each page all pages reachable from it.

page_rank5May 30
15:39

Compute PageRank of Wikipedia pages and output the pages sorted by the rank. Use given number of iterations of the iterative algorithm http://en.wikipedia.org/wiki/PageRank#Iterative with damping factor 0.85.

english_tagging6Jun 06
15:39

Implement a machine learning pipeline which trains to perform POS tagging using train.json file and computes accuracy on test.json dataset. Try using a MultilayerPerceptronClassifier, with either none or one hidden layer. Note that the current state-of-the-art accuracy on the test set is about 97.7.