Data Intensive Computing – Summer 2014

In summer 2014 the course is held on Monday 17:20 in SU2.

Pass Conditions

To pass the course, you need to obtain at least 30 points. Points are awarded for

  • home assignments
  • talk (contact me if you are interested)
  • optional project (depending on complexity, up to 30 points can be awarded)

Content of Lectures

DateLecture Content
Feb 24

Introduction to MapReduce framework, discussing previous experiences and interesting advanced topics, setting up accounts.

Mar 03

Introduction to SGE – normal and array jobs, dependency between jobs. Solving simple tasks: given Wikipedia articles (consider Czech with 450k articles of size 2GB, and English with 4.5M articles of size 44GB), find list of unique words used in the articles. Also assign unique IDs to the articles and then compute inverted index – for every word, create ascending sequence of pairs (document, occurrences within the document as ascending word indices).

Mar 10

Setting up accounts to SGE cluster, individual work on tasks listed in the handout.

Mar 17

Continue working on tasks using SGE cluster.

Mar 24

Starting with Hadoop – simple Python API, working with web interface and HDFS.

Mar 31

Getting more Hadoop experiences – solving and implementing example tasks.

Apr 07

Becoming Dumbo professionals.

Apr 14

Becoming Dumbo masters.

Apr 21

Easter Monday!

Apr 28

Programming Hadoop in Java.

May 04

No more Hadoop :-)

Introduction to Spark, both in Python and Scala.

May 11

Becomming Spark maniacs.

  • Handout 10 (nearly identical to Hadoop 9, but reference is sorted and task description slightly modified)
May 18

Last lecture :–(

MLlib and machine learning in Spark; new features in Spark 1.0.0.

Home Assignments

TaskPointsDue ToTask Description
unique_words3Sep 30
23:59

Create a list of unique words used in the articles. Convert them to lowercase to ignore case.

Because the article data is not tokenized, use provided /dlrc_share/data/wiki/tokenizer/{cs,en}_tokenizer, which reads untokenized UTF-8 text from standard input and produces tokenized UTF-8 text on standard output. It preserves line breaks and separates tokens on each line by exactly one space.

inverted_index4Sep 30
23:59

Compute inverted index – for every lowercased word from the articles, compute ascending (article id, ascending positions of occurrences as word indices) pairs. In order to do so, number the articles using consecutive integers and produce also a list of articles representing this mapping (the article on line i is the article with id i).

The output should be a file with list of articles ordered by article id, and a file with one word on a line in the following format:

word \t article_id \t space separated occurrences \t article_id \t space separated occurrences ...

Once again use provided tokenizer.

wordsim_index4Sep 30
23:59

In order to implement word similarity search, compute for each lemma with at least three occurrences all contexts in which it occurs, including their number of occurrences. List the contexts in ascending order.

Given N (either 1, 2, 3 or 4), the context of a lemma occurrence is N lemmas preceding this occurrence and N lemmas following this occurrence (ignore sentence boundaries, use empty words when article boundaries are reached).

To compute the lemmas for a given article, use provided /dlrc_share/data/wiki/lemmatizer/{cs,en}_lemmatizer, which works just like the tokenizer – it reads untokenized UTF-8 text from standard input and produces tokenized and lemmatized UTF-8 text on standard output, each lemma separated gy exactly one space.

The output should be a file with one lemma on a line in the following format:

lemma \t context \t counts \t context \t counts ...
wordsim_find4Sep 30
23:59

Let S be given natural number. Using the index created in wordsim_index, find for each lemma S most similar lemmas. The similarity of two lemmas is computed using cosine similarity as (CA · CB) / (|CA| · |CB|), where CL is a vector of occurrences of lemma L contexts.

The output should be a file with one lemma on a line in the following format:

lemma \t most similar lemma \t cosine similarity \t second most similar lemma \t cosime similarity ...
dumbo_unique_words2Sep 30
23:59

Create a list of unique words used in the articles using Dumbo. Convert them to lowercase to ignore case.

Use the nltk.tokenize.wordpunct_tokenize as a tokenizer.

article_initials2Sep 30
23:59

Run a Dumbo job which uses Hadoop counters to count the number of articles according to their first letter, ignoring the case and merging all non-Czech initials.

dumbo_inverted_index2Sep 30
23:59

Compute inverted index in Dumbo – for every lowercased word from the articles, compute (article name, ascending positions of occurrences as word indices) pairs.

Use the nltk.tokenize.wordpunct_tokenize as a tokenizer.

The output should be a file with one word on a line in the following format:

word \t article name \t space separated occurrences \t article name \t space separated occurrences ...

You will get 2 additional points if the articles will be numbered using consecutive integers. In that case, the output is ascending (article id, ascending positions of occurrences as word indices) pairs, together with a file containing list of articles representing this mapping (the article on line i is the article with id i).

no_references3Sep 30
23:59

An article A is said to reference article B, if it contains B as a token (ignoring case).

Run a Dumbo job which counts for each article how many references there exists for the given article (summing all references in a single article).

You will get one extra point if the result is sorted by the number of references (you are allowed to use 1 reducer in the sorting phase).

Use the nltk.tokenize.wordpunct_tokenize as a tokenizer.

dumbo_wordsim_index4Sep 30
23:59

In order to implement word similarity search, compute for each form with at least three occurrences all contexts in which it occurs, including their number of occurrences. List the contexts in ascending order.

Given N (either 1, 2, 3 or 4), the context of a form occurrence is N forms preceding this occurrence and N forms following this occurrence (ignore sentence boundaries, use empty words when article boundaries are reached).

Use the nltk.tokenize.wordpunct_tokenize as a tokenizer.

The output should be a file with one form on a line in the following format:

form \t context \t counts \t context \t counts ...
dumbo_wordsim_find4Sep 30
23:59

Let S be given natural number. Using the index created in dumbo_wordsim_index, find for each form S most similar forms. The similarity of two forms is computed using cosine similarity as (CA · CB) / (|CA| · |CB|), where C_F is a vector of occurrences of form F contexts.

The output should be a file with one form on a line in the following format:

form \t most similar form \t cosine similarity \t second most similar form \t cosine similarity ...
dumbo_anagrams2Sep 30
23:59

Two words are anagrams if one is a letter permutation of the other (ignoring case).

For a given input, find all anagram classes that contain at least A words. Output each anagram class on a separate line.

Use the wordpunct_tokenize as a tokenizer.

dumbo_kmeans6Sep 30
23:59

Implement K-means clustering algorithm as described on http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm.

The user specifies number of iterations and the program run specified number of K-means clustering algorithm iterations.

You can use the following training data. The key of the data is point ID, the value of the data is list of integral coordinates. The coordinates in one input naturally have the same dimension.

HDFS pathPointsDimensionClusters
/data/points-tb/small100005050
/data/points-tb/medium100000100100
/data/points-tb/large500000200200

You will get 2 additional points if the algorithm stops when none of the centroid positions change more than given ε.

java_unique_words2Sep 30
23:59

Create a list of unique words used in the articles using Java Hadoop. Convert them to lowercase to ignore case.

java_anagrams2Sep 30
23:59

Two words are anagrams if one is a letter permutation of the other (ignoring case).

For a given input, find all anagram classes that contain at least A words. Output each anagram class on a separate line.

java_sort3Sep 30
23:59

You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format:

  • /data/numbers-txt/numbers-small: 3MB
  • /data/numbers-txt/numbers-medium: 184MB
  • /data/numbers-txt/numbers-large: 916MB

You can assume that the integers are uniformly distributed.

Your task is to sort these data, comparing the key numerically and not lexicographically. The lines in the output must be the same as in the input, only in different order.

Your solution should work for TBs of data. For that reason, you must use multiple reducers. If your job is executed using r reducers, the output consists of r files, which when concatenated would produce sorted (key, value) pairs. In other words, each of the output files contains sorted (integer, data) pairs and all keys in one file are either smaller or larger than in other file. Your solution should work for any number of reducer specified.

You can use the Sort.java example template.

java_nonuniform_sort4Sep 30
23:59

Improve the java_sort to handle nonuniform data. You can use the following exponentially distributed data:

  • /data/numbers-txt/nonuniform-small: 3MB
  • /data/numbers-txt/nonuniform-medium: 160MB
  • /data/numbers-txt/nonuniform-large: 797MB

Assume we want to produce r output files. One of the solutions is to perform two Hadoop jobs:

  • Go through the data and sample only a small fraction of the keys. As there are not so many of them, they can fit in one reducer.
  • Find best r-1 separators using the sampled data.
  • Run the second pass using the computed separators.
java_inverted_index2Sep 30
23:59

Compute inverted index in Java Hadoop – for every lowercased word from the articles, compute (article name, ascending positions of occurrences as word indices) pairs.

The output should be a file with one word on a line in the following format:

word \t article name \t space separated occurrences \t article name \t space separated occurrences ...

You will get 2 additional points if the articles will be numbered using consecutive integers. In that case, the output is ascending (article id, ascending positions of occurrences as word indices) pairs, together with a file containing list of articles representing this mapping (the article on line i is the article with id i).

java_no_references3Sep 30
23:59

An article A is said to reference article B, if it contains B as a token (ignoring case).

Run a Java Hadoop job which for each article B counts how many references for article B there exist in the whole wiki (summing references in a single article).

You will get one extra point if the result is sorted by the number of references (you are allowed to use 1 reducer in the sorting phase).

java_wordsim_index4Sep 30
23:59

In order to implement word similarity search, compute for each form with at least three occurrences all contexts in which it occurs, including their number of occurrences. List the contexts in ascending order.

Given N (either 1, 2, 3 or 4), the context of a form occurrence is N forms preceding this occurrence and N forms following this occurrence (ignore sentence boundaries, use empty words when article boundaries are reached).

The output should be a file with one form on a line in the following format:

form \t context \t counts \t context \t counts ...
java_wordsim_find4Sep 30
23:59

Let S be given natural number. Using the index created in java_wordsim_index, find for each form S most similar forms. The similarity of two forms is computed using cosine similarity as (CA · CB) / (|CA| · |CB|), where C_F is a vector of occurrences of form F contexts.

The output should be a file with one form on a line in the following format:

form \t most similar form \t cosine similarity \t second most similar form \t cosine similarity ...
java_kmeans6Sep 30
23:59

Implement K-means clustering algorithm as described on http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm.

The user specifies number of iterations and the program run specified number of K-means clustering algorithm iterations.

You can use the following training data. Each line contains space separated coordinates of one points. The coordinates in one input naturally have the same dimension.

HDFS pathPointsDimensionClusters
/data/points-txt/small100005050
/data/points-txt/medium100000100100
/data/points-txt/large500000200200

You will get 2 additional points if the algorithm stops when none of the centroid positions change more than given ε.

spark_unique_words2Sep 30
23:59

Create a list of unique words used in the articles using Spark. Convert them to lowercase to ignore case.

spark_anagrams2Sep 30
23:59

Two words are anagrams if one is a letter permutation of the other (ignoring case).

For a given input, find all anagram classes that contain at least A words. Output each anagram class on a separate line.

spark_sort3Sep 30
23:59

You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format:

  • /data/numbers-txt/numbers-small: 3MB
  • /data/numbers-txt/numbers-medium: 184MB
  • /data/numbers-txt/numbers-large: 916MB

You can assume that the integers are uniformly distributed.

Your task is to sort these data, comparing the key numerically and not lexicographically. The lines in the output must be the same as in the input, only in different order.

Your solution should work for TBs of data. For that reason, you must use multiple reducers. If your job is executed using r reducers, the output consists of r files, which when concatenated would produce sorted (key, value) pairs. In other words, each of the output files contains sorted (integer, data) pairs and all keys in one file are either smaller or larger than in other file. Your solution should work for any number of reducer specified.

Obviously, do not use sortByKey method in your solution.

spark_nonuniform_sort4Sep 30
23:59

Improve the java_sort to handle nonuniform data. You can use the following exponentially distributed data:

  • /data/numbers-txt/nonuniform-small: 3MB
  • /data/numbers-txt/nonuniform-medium: 160MB
  • /data/numbers-txt/nonuniform-large: 797MB

Assume we want to produce r output files. One of the solutions is the following:

  • Go through the data and sample only a small fraction of the keys.
  • Find best r-1 separators using the sampled data.
  • Run the second pass using the computed separators.

Obviously, do not use sortByKey method in your solution.

spark_inverted_index2Sep 30
23:59

Compute inverted index in Spark – for every lowercased word from the articles, compute (article name, ascending positions of occurrences as word indices) pairs.

The output should be a file with one word on a line in the following format:

word \t article name \t space separated occurrences \t article name \t space separated occurrences ...

You will get 2 additional points if the articles will be numbered using consecutive integers. In that case, the output is ascending (article id, ascending positions of occurrences as word indices) pairs, together with a file containing list of articles representing this mapping (the article on line i is the article with id i).

spark_no_references3Sep 30
23:59

An article A is said to reference article B, if it contains B as a token (ignoring case).

Run a Spark job which for each article B counts how many references for article B there exist in the whole wiki (summing references in a single article).

You will get one extra point if the result is sorted by the number of references.

spark_wordsim_index4Sep 30
23:59

In order to implement word similarity search, compute for each form with at least three occurrences all contexts in which it occurs, including their number of occurrences. List the contexts in ascending order.

Given N (either 1, 2, 3 or 4), the context of a form occurrence is N forms preceding this occurrence and N forms following this occurrence (ignore sentence boundaries, use empty words when article boundaries are reached).

The output should be a file with one form on a line in the following format:

form \t context \t counts \t context \t counts ...
spark_wordsim_find4Sep 30
23:59

Let S be given natural number. Using the index created in spark_wordsim_index, find for each form S most similar forms. The similarity of two forms is computed using cosine similarity as (CA · CB) / (|CA| · |CB|), where C_F is a vector of occurrences of form F contexts.

The output should be a file with one form on a line in the following format:

form \t most similar form \t cosine similarity \t second similar form \t cosine similarity ...
spark_kmeans6Sep 30
23:59

Implement K-means clustering algorithm as described on http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm.

The user specifies number of iterations and the program run specified number of K-means clustering algorithm iterations.

You can use the following training data. Each line contains space separated coordinates of one points. The coordinates in one input naturally have the same dimension.

HDFS pathPointsDimensionClusters
/data/points-txt/small100005050
/data/points-txt/medium100000100100
/data/points-txt/large500000200200

You will get 2 additional points if the algorithm stops when none of the centroid positions change more than given ε.