## NPFL070 – Language Data Resources

SIS code: NPFL070
Semester: summer
E-credits: 5
Examination: 1/2 MC (KZ)
Instructors: Martin Popel, Zdeněk Žabokrtský

• the classes combine lectures and practicals

### Timespace Coordinates

Wednesday 14:00–15:30 in SU1 + 15:40-16:25 in SW1

### Covid-19-related update:

• in 2020/2021 winter semester, the course will be taught online via Zoom in the scheduled time
• the course is designed as highly interactive, typically with short slide sequences interleaved with discussions and hands-on exercices
• it would be really great if you could switch on your cams during the whole classes, as we would like to keep the atmosphere lively and friendly
• we kindly ask all students to be ready for allowing screen sharing on our demand, in order to make the working environment as efficient as in an ordinary UNIX lab (or even more efficient, let's see)
• in order to feel fully comfortable with screen sharing, you might consider tiny preparations on your side such as removing some personally sensitive bits from your working environment (e.g. if you use a very private nickname as your username)
• the classes will not be recorded...
• ... but if you miss a class because of a serious reason, or because of a silly reason, if you simply don't get something important during a class, then we encourage you to ask us for an individual consultation
• a link to a regular consultation Zoom room will be available to enrolled students in SIS

### Course prerequisities

• only an informal one: NPFL092 – Technology for NLP (unless, of course, you gained your knowledge of bash, Python, XML and alike elsewhere)

### Course passing requirements

To pass the course you will need to submit homework assignments and do a written test. See Grading for more details.

### 1. Introduction

Overview of language data types  Oct 07, 2020

• Course overview
• Prerequisities:
• Make sure you have a valid account for accessing the Czech National Corpus. If not, see the CNC registration page.
• Make sure you understand the topics taught in Technology for NLP, which is an informal prerequisite of this course
• Make sure you have a valid account for accessing computers in the Linux labs. If not, consult the student service in the main lab hall ('rotunda').

### 2. More on corpora and a case study: the Czech National Corpus

• Individual preparation before the class (duration: 45 minutes):

• During the online zoom class:

• we'll explore the most important corpus for Czech (containing many other languages too, though) at www.korpus.cz

• we'll work with the Kontext search tool

• we'll explore the tagset using MorphoDiTa

• first, try to assemble POS tags for all tokens in the following sentence (from the todays newpapers): "V úterý byl na nejméně dva týdny poslední den, kdy mohly mít restaurace otevřeno do 20 hodin."
• when finished, compare your solution with that of the on-line morphological analyser of Czech Morphodita
• once again, POS tagset documentation
• construct Kontext queries for the following examples

1. occurrences of word form "kousnout"; occurrences of all forms of lemma "kousnout"; occurrences of verbs derived from "kousnout" by prefixation (and make frequency list of their lemmas) and occurrences of adjectives derived from such prefixed verbs (and their frequency list too),
2. name 5 verb whose infinitive does not end with '-t'; find them in the corpus and make their frequency list
3. find adjectives with 'illusory negation', such as "nekalý", "neohrabaný", "nevrlý"...
5. find beginnings of subordinating conditional clauses,
6. find beginnings of subordinating relative clauses,
7. find examples of names of (state) presidents (family name+surname), order them according to frequency of occurrences,
8. find all occurrences of phraseme "mráz někomu běhá po zádech"
9. find nouns that are typical objects of the verb slovesa "kousnout" (and the same for subject)
10. find adverbs with locational or directional meaning (this is a bit tricky)
11. find nouns with temporal meaning
12. find some nouns created by compounding (such as "autoopravna")

### 3. Czech National Corpus cont.

October 21 hw_our_annotation

• Individual preparation before the class (duration: 45 minutes): practise CQL querying in Kontext by constructing at least three queries that find disambiguation mistakes (=incorrectly tagged and/or lemmatized tokens). Choose any corpus available in Kontext, any language. Ideally, each query should identify at least 100 corpus positions, out of which at least two thirds of the detected positions should be really errorneous (a very rough estimate based on a smaller sample is sufficient for this exercise). For instance, for English you can try to detect incorrectly tagged "that", or "der" in German, or "vino" in Spanish. If you choose Czech, then you can use the following list, either directly or just for your inspiration:
1. word form "se" - search for corpus positions, where "se" is tagged as a vocalized preposition, but in fact it is a reflexive pronoun (or vice versa)
2. word form "jí" - conjugated form of the verb "jíst" (to eat) wrongly tagged as a pronoun, or vice versa
3. surnames derived from verbs (such as "Pospíšil") - such surnames might be incorrectly tagged as verbs (or vice versa)
4. forms "a" and "A" - find corpus positions, where "a" is tagged as a coordination conjunction which is wrong (it could be the English article, physical unit, itemizer, etc.)
5. "weird imperatives" - search for tokens incorrectly tagged as imperatives (such as "leč", which is more likely to be a conjunction)
6. search for errors caused by homonymy between some verbs and adjectives (e.g., word form "zelená" could be an adjective or a verb)
7. search for tokens incorrectly tagged as vocalized prepositions (e.g. in cases in which the following word does not require any vocalization of the preceding preposition)
8. search for tokens whose tags indicate the locative case (6th case); hint: this case can appear only in prepositional groups in Czech
9. search for errors based on the fact that for each preposition there should be a word form somewhere behind the preposition which 'saturates' the preposition and indicates the same morphological case
10. word form "ty" - search for places in which "ty" is tagged as a personal pronouns, but in fact is is a demonstrative pronoun (or vice versa)
11. word form "ti" - analogously to the previous item
12. swap of nominative and accusative - search for nouns (or other parts of speech) with accusative indicated in the POS tag, even if they should be tagged as nominatives (or vice versa)
13. "weird vocatives" - search for tokens incorrectly tagged as vocative forms of nouns
14. two finite verbs close to each other - search for wrongly tagged tokens using the fact that in Czech there should not be two or more finite verb forms in a single clause (but there can be complex verb forms)
15. foreign words - search for foreign words incorrectly tagged as forms of obviously unrelated Czech words (such as "line" in "on-line" tagged as present-tense form of the verb "linout", or Germent article tagged as a form of the Czech verb "drát")
16. wrong clitics - search for tagging errors using the fact that Czech clitics (several short words such as "by","ti","mi" etc.) should appear in the so called second position (Vackernagel's position) in a sentence
17. confusion of prepositions and other parts of speech - find tokens wrongly tagged as prepositions which are in fact nouns or adverbs (homonymous forms such as kolem/kolem/kolem, místo/místo)
18. search for corpus spots with incorrectly segmented sentences
19. search for corpus spots with incorrect tokenization (such as "... sejí ..." instead of "... se jí ...")
• Online practicals:
• searching in Intercorp using Kontext
• a tour through other CNK-related tools: Treq, SyD, Morfio

### 4. Using annotated data for evaluation

Evaluation in NLP

• Warm-up exercise: find synonyms (or near synonyms) in English
• you can use a simple probabilistic English-Czech translation dictionary derived from CzEng czeng-reduced-dict.tsv.gz
• you can rely on the hint that synonymous words are likely to have the same translation equivalents
• you should find some reliability metrics for the detected synonymous pairs and sort the pairs, starting from the most reliable synonyms

### 5. Treebanking

October 30 Slides: Treebanks Slides: PDT

### 6. Lexical databases (a guided tour)

Slides: Derinet

• Morphological properties of lexical units
• inflection
• Example: morfflex.cz (as used in MorphoDiTa)
• Exercise: choose a verb in your native language and list all its inflected forms
• Exercise: try to find a word with as many inflected wordforms as possible
• derivation
• morpheme segmentation
• Example: a few of such resources compiled in a MorphoChallenge dataset (TODO:add a link)
• Exercise: choose a past-tense word form of some prefixed verb in your native language and segment it into morphemes
• Syntactic/semantic combinatorial potential of lexical units
• fine-grained role inventories
• coarse-grained role inventories
• Sense inventories
• Multilingual lexical resources
• translation databases
• multilingual wordnets
• Example: EuroWordNet
• cognate databases
• Example: CogNet
• Other lexical resources
• terminological databases
• named entity lists (such as that of geographical names)
• etymological dictionaries
• an overview of lexical resources by Christian M. Meyerand Hatem Mousselly Sergieh

### 8. Udapi cont. (by Martin Popel)

• warm-up: Where (and why) do we use commas in Czech and English?

• exercise1: Implement bhead – a tool like Unix head but instead of first n lines, it prints first n blocks of lines, where blocks are separated by an empty line. It will be useful for sampling conllu files.

• exercise2: Write a Udapi block which changes prepositions to postpositions (moves them after their parent's subtree).

• What does zone and bundle mean in Udapi. How to compare two conllu files (don't forget you should use train or sample, but not dev for this):

udapy -TN < gold.conllu > gold.txt # N means no colors
cat without.conllu | udapy -s tutorial.AddCommas write.TextModeTrees files=pred.txt > pred.conllu
vimdiff gold.txt pred.txt # exit vimdiff with ":qa" or "ZZZZ"


hw_parse

### 10. Licensing

Slides: Intro to authors' rights and licensing

• Licensing, LDC resources

### 11. Dialogue data

Dialogue data

• types of data for developing dialogue systems

A general remark: please note that all your homework solutions should be submitted exclusively using the faculty GitLab server. Detailed information on creating and using your GitLab repository is available within the course NPFL092. For our course, the instructions are to be modified as expected:

• Your project name should be "NPFL070"; the identifier should be "npfl070".
• Access to your repository should be given to both instructors of NPFL070, i.e. to Zdeněk Žabokrtský and Martin Popel.

### 1. hw_my_corpus

Deadline: 23:59 October 28, 2020  100 points Create a sequence of tools for building a very simple 1MW corpus

• choose a language different from Czech and English and also from your native language
• find on-line sources of texts for the language, containing altogether more than 1 million words, and download them
• convert the material into one large plain-text utf8 file
• tokenize the file on word boundaries and print 50 most frequent tokens
• organize all these steps into a Makefile so that the whole procedure is executed after running make all
• commit the Makefile into hw/my-corpus in your git repository for this course

### 2. hw_our_annotation

Deadline: 23:59 November 9, 2020  100 points design your own annotation project for a linguistic phenomenon of your choice

• work in pairs
• minimal requirements: annotation in a plain-text format, two annotations by two independently working annotators, at least 50 annotated instances, evaluated inter-annotator agreement, experiment documentation
• commit the annotated data and experiment documentation into hw/our-annotation/ in your git repository for this course; in each pair, only one student commits the solution, while the second student is only mentioned in the documentation
• the annotation tasks will be briefly presented by the students during one of the subsequent online practicals

• Commit blocks' source codes and results to hw/adpos-and-wordorder.
• Complete tutorial.Adpositions (see the usage hint) and detect which of the UD2.0 treebanks (based on the */sample.conllu files) use postpositions.
• Write a new Udapi block to detect word order type – for each language (and treebank, i.e. each sample file), compute the percentage of each of the six possible word order types. Hint: Verbs can be detected by upos. Subjects and objects can be detected by deprel, they are Core dependents of clausal predicates.
• Bonus: Detect which languages are pro-drop (again write a new Udapi block). For a language of your choice, write a block which inserts a node for each dropped pronoun (fill form, lemma, gender, number and person, whenever applicable).
• Points+Feedback in SIS since November 16

• commit your block to hw/add-commas/addcommas.py. Write a Udapi block which heuristically inserts commas into a conllu file (where all commas were deleted). Choose Czech, German, French or English (the final evaluation will be done on all, with the language parameter set to "cs", "de", "fr" or "en"). Use the UDv2.0 sample data: you can use the train.conllu and sample.conllu files for training and debugging your code. For evaluating with the F1 measure use the dev.conllu file, but don't look at the errors you did on this dev data (so you don't overfit). The final evaluation will be done on a secret test set (where the commas will be deleted also from root.text and node.misc['SpaceAfter'] using tutorial.RemoveCommas). To get all points for this hw, you need to achieve at least the LY-MEDIAN (see the results below) F1 score for any of the four languages or at least 45% F1 average on all four languages (on the secret test sets).

• Hints: See the tutorial.AddCommas template block. You can hardlink it to your hw directory: ln ~/udapi-python/udapi/block/tutorial/addcommas.py ~/where/my/git/is/npfl070/hw/add-commas/addcommas.py. For Czech and German (and partially for English) it is useful to detect (finite) clauses first (and finite verbs). It may be useful to first add commas according to a general rule and then delete extra commas (e.g. if neighboring a punctuation token or start/end of sentence).

cd sample
cp UD_English/dev.conllu gold.conllu
cat gold.conllu | udapy -s \
util.Eval node='if node.form==",": node.remove(children="rehang")' \
> without.conllu

# substitute the next line with your solution
cat without.conllu | udapy -s tutorial.AddCommas language=en > pred.conllu

# evaluate
udapy \
eval.F1 gold_zone=en_gold focus=,

# You should see an output similar to this
Comparing predicted trees (zone=en_pred) with gold trees (zone=en_gold), sentences=2002
=== Details ===
token       pred  gold  corr   prec     rec      F1
,            176   800    40  22.73%   5.00%   8.20%
=== Totals ===
predicted =     176
gold      =     800
correct   =      40
precision =  22.73%
recall    =   5.00%
F1        =   8.20%


#### Results (F1) as of 2019-12-04

SLOC means source lines of code excluding comments and docstrings. It is reported just for info, it plays no role in the evaluation. The homeworks are not code golf, the code should be nice to read.

Dev set

NICK SLOC CS DE EN FR AVG Points
LY-BEST 80.71 72.79 37.65 39.61 52.41
LY-MEDIAN 80.32 53.67 33.17 32.70 51.78
BASE 18 3.62 2.93 8.20 5.19 4.99
Pyth 172 1.61 3.25 7.31 13.80 6.49
xy123 71 5.43 4.28 7.25 33.15 12.53
Vilda 93 78.41 8.04 7.41 8.72 25.64
sammy 46 47.53 49.57 25.15 23.94 36.55
1a3e 72 48.01 45.51 28.47 25.04 36.76
mp 53 88.40 68.16 54.42 46.18 64.29

Test set

NICK SLOC CS DE EN FR AVG Points
LY-BEST 81.16 73.18 36.65 41.93 53.33
LY-MEDIAN 79.41 51.92 35.69 39.06 52.24
Pyth 172 1.95 0.90 6.96 11.44 5.31 29 (FR)
BASE 18 3.21 5.90 8.80 6.10 6.00
xy123 71 4.73 9.39 8.00 35.34 14.37 90 (FR)
Vilda 93 78.99 7.01 7.20 8.39 25.40 99 (CS)
sammy 46 47.77 41.42 26.42 25.15 35.19 79 (DE)
1a3e 72 47.59 45.13 28.24 27.03 37.00 86 (DE)
mp 53 88.47 62.74 54.32 49.42 63.74

• Commit your block to hw/add-articles/addarticles.py.
• Write a Udapi block tutorial.AddArticles which heuristically inserts English definite and indefinite articles (the, a, an) into a conllu file (where all articles were deleted). Similarly as in the previous homework: F1 score will be used for the evaluation, just with focus='(?i)an?|the' (note that only the form is evaluated, but it is case sensitive). For removing articles use util.Eval node='if node.upos=="DET" and node.lemma in {"a", "the"}: node.remove(children="rehang")'. Everything else is the same. To get all points for this hw, you need at least 30% F1 (on the secret test set).

#### Results (F1) as of 2019-12-07

NICK SLOC DEV TEST Points
LY-BEST 23 34.81 40.28
LY-MEDIAN 32 33.14 37.83
BASE 6 15.31 17.64
pyth 204 25.22 24.33 81
lucas 98 31.84 32.25 100
1a3e 19 30.25 33.22 100
sammy 20 28.97 33.38 100
mp 17 35.41 41.19
Vilda 78 38.31 42.49 100

### 6. hw_parse

• commit your block to hw/parse/parse.py.
• Write a Udapi block tutorial.Parse, which does dependency parsing (labelled, i.e. including deprel assignment) for English, Czech, French and German. A simple rule-based approach is expected, but machine learning is not forbidden (using the provided {train,dev}.conllu). Your goal is to achieve the highest LAS (you can ignore the language-specific part of deprel, so "LAS (udeprel)" reported by eval.Parsing is the evaluation measure to be optimized). To get all points for this hw, you need at least 40% LAS on at least one of the four languages or at least 30% LAS average on all four languages (on the secret test sets).

#### Results (LAS) as of 2019-12-18

Dev set

NICK SLOC CS DE EN FR AVG Points
LY-BEST 38.53 45.17 36.50 42.67 39.79
LY-MEDIAN 34.52 43.95 31.93 41.10 37.48
BASE 10 0.15 0.02 0.66 0.00 0.21
xy123 52 21.30 26.12 22.50 38.01 26.98
lucas 88 N/A 35.98 30.74 41.42 27.04
Vilda 170 43.29 36.69 24.59 33.60 34.54
xy123 50 32.01 42.17 37.84 41.10 38.28
sammy 141 37.78 42.88 33.97 43.83 39.61
mp 93 39.70 45.35 59.29 56.43 50.19

Test set

NICK SLOC CS DE EN FR AVG Points
LY-BEST 38.85 43.04 34.08 42.06 38.93
LY-MEDIAN 34.64 41.90 31.55 39.79 36.88
BASE 10 0.14 0.00 0.36 0.00 0.12
lucas 88 N/A 33.78 29.11 40.65 25.88 100 (FR)
xy123 52 20.58 27.18 21.60 37.58 26.73 93 (FR)
Vilda 170 43.00 37.14 23.10 33.19 34.11 100 (AV)
xy123 50 32.44 38.51 37.69 40.66 37.32 100 (AV)
sammy 141 37.18 38.53 34.03 43.45 38.30 100 (AV)
mp 93 38.93 43.68 58.55 55.89 49.26

## The pool of final written test questions

### Questions on basic types of corpora

• What is a corpus?
• How can you classify corpora? Give at least three criteria.
• What is an annotation? What kinds of annotation do you know?
• Explain terms sentence segmentation and tokenization. Give examples on non-trivial situations. lemmatization, tagging?
• Explain what lemmatization is and why it is used.
• Explain what a balanced corpus is. Why this notion is problematic?
• Explain what POS tagging is and give examples of tag sets. Give examples of situations in which tagging is non-trivial even for a human.
• Explain the main sources of variability of POS tag sets accross different corpora.
• Explain the main property of positional tag sets. Give examples of positional and non-positional tag sets.
• Give examples of at least three corpora (of any type). What is their size? (very roughly, order of magnitude is enough; do not forget to mention units)

### Questions on parallel corpora

• What is a parallel corpus?
• What types (levels) of alignment can be present in parallel corpora?
• Give examples of situations in which document alignment can be problematic.
• Give examples of situations in which sentence alignment can be problematic.
• Give examples of situations in which node alignment can be problematic.
• Give at least three examples of possible sources of parallel data, and for each source describe expected advantages and disadvantes.

### Questions on treebanking

• Either assign Penn Treebank POS tags to words in a given English sentence (short tagset documentation of Penn Treebank tags will be available to you), or assign CNK-style morphological tags to words in a given Czech sentence (short tagset documentation will be available to you). You can choose the language.
• Draw a dependency tree for a given Czech or English sentence.
• Draw a phrase-structure tree for a given Czech or English sentence.
• Name at least four treebanks and describe their main properties.
• Describe two main types of syntactic trees used in treebanks.
• What is a trace (in phrase-structure trees).
• How do we recognize presence/absence of a dependency relation between two words (in dependency treebanking).
• Give at least two examples of situations in which the "treeness assumption" on intra-sentence dependency relations is clearly violated.
• Give at least two examples of situations (e.g. syntactic constructions) for which annotation conventions for dependency analysis must be chosen since there are multiple solutions possible that are similarly good from the common sense view.
• Why coordination is difficult to capture in dependency trees (compared to e.g. predicate-argument structure)?

### Universal Dependencies

• How are Universal Dependencies different from other treebanks?
• Describe the CoNLL-U format used in Universal Dependencies.
• When working with Universal Dependencies which tools are suitable for automatic parsing, manual annotation, querying, automatic transformations and validity checking? Name at least one tool for each task.

### Other phenomena for which annotated corpora exist

• Explain what coreference is and how it can be annotated.
• Explain what named entities are and how they can be annotated.
• Explain what sentiment (in the context of NLP) is and how it can be annotated.

### Lexical data resources

• What is WordNet? What do its nodes and edges represent?
• What is EuroWordNet? How the interlinking through the hub language works?
• What is a synset?
• What is polysemy? Give examples.
• Explain the difference between the notions of polysemy and homography. Why this distinction is non-trivial to make?
• Give an example of an NLP tool/lexicon that captures inflectional morphology, explain what it can be used for and describe its main properties.
• Give an example of a NLP tool/lexicon that captures derivational morphology, explain what it can be used for and describe its main properties.
• What is valency? Give an example of a data resource that captures valency and describe its main properties.

### Evaluation

• Give at least two examples of situations in which measuring a percentage accuracy is not adequate.
• Explain: precision, recall
• What is F-measure, what is it useful for?
• What is k-fold cross-validation ?
• Explain BLEU (the exact formula not needed, just the main principles).
• Explain the purpose of brevity penalty in BLEU.
• What is Labeled Attachment Score (in parsing)?
• What is Word Error Rate (in speech recognition)?
• What is inter-annotator agreement? How can it be measured?
• What is Cohen's kappa?

### Questions on licensing

• In the Czech legal system, if you create an artifact, who/what protects your author's rights?
• In the Czech legal system, if you create an artifact, what should you do in order to allow an efficient protection of your author's rights?
• In the Czech legal system, if you create an artifact and you want to make it usable by anyone for free, what should you do?
• In the Czech legal system, what are the implications of attaching a copyright notice (e.g. "(C)opyright Josef Novák, 2018") compared to simply mentioning the author's name?
• What is the difference between moral and economic authors' rights? How can you transfer them to some other person/entity?
• Explain main features of GNU GPL.
• Explain main features of Creative Commons.
• There are four on-off elements defined in the Creative Commons license family (by, nc, sa, nd). Why it does not lead to 24=16 possible licenses?

### Homework assignments

• There will be 8–12 homework assignments.
• For most assignments, you will get points, up to a given maximum (the maximum is specified with each assignment).
• If your submission is especially good, you can get extra points (up to +10% of the maximum).
• Most assignments will have a fixed deadline (usually in 1 week).
• If you submit the assignment after the deadline, you will get:
• up to 50% of the maximum points if it is less than 2 weeks after the deadline;
• 0 points if it is more than 2 weeks after the deadline.
• Once we check the submitted assignments, you will see the points you got and the comments from us in:

### Test

Your grade is based on the average of your performance; the test and the homework assignments are weighted 1:1.

1. ≥ 90%
2. ≥ 70%
3. ≥ 50%
4. < 50%

For example, if you get 600 out of 1000 points for homework assignments (60%) and 36 out of 40 points for the test (90%), your total performance is 75% and you get a 2.

### No cheating

• Cheating is strictly prohibited and any student found cheating will be punished. The punishment can involve failing the whole course, or, in grave cases, being expelled from the faculty.
• Discussing homework assignments with your classmates is OK. Sharing code is not OK (unless explicitly allowed); by default, you must complete the assignments yourself.
• All students involved in cheating will be punished. E.g. if you share your assignment with a friend, both you and your friend will be punished.