Status: 
released
OS: 
Linux, Windows, OS X

UDPipe

1. Introduction

UDPipe is a trainable pipeline for tokenization, tagging, lemmatization and dependency parsing of CoNLL-U files. UDPipe is language-agnostic and can be trained given annotated data in CoNLL-U format. Trained models are provided for nearly all UD treebanks. UDPipe is available as a binary for Linux/Windows/OS X, as a library for C++, Python, Perl, Java, C#, and as a web service. Third-party R CRAN package also exists.

UDPipe is a free software distributed under the Mozilla Public License 2.0 and the linguistic models are free for non-commercial use and distributed under the CC BY-NC-SA license, although for some models the original data used to create the model may impose additional licensing conditions. UDPipe is versioned using Semantic Versioning.

Copyright 2017 by Institute of Formal and Applied Linguistics, Faculty of Mathematics and Physics, Charles University, Czech Republic.

2. Online Web Application and Web Service

UDPipe Web Application is available at http://lindat.mff.cuni.cz/services/udpipe/ using LINDAT/CLARIN infrastructure.

UDPipe REST Web Service is also available, with the API documentation available at http://lindat.mff.cuni.cz/services/udpipe/api-reference.php.

3. Release

3.1. Download

UDPipe releases are available on GitHub, both as source code and as a pre-compiled binary package. The binary package contains Linux, Windows and OS X binaries, Java bindings binary, C# bindings binary, and source code of UDPipe and all language bindings). While the binary packages do not contain compiled Python or Perl bindings, packages for those languages are available in standard package repositories, i.e. on PyPI and CPAN.

3.1.1. Language Models

To use UDPipe, a language model is needed. The language models are available from LINDAT/CLARIN infrastructure and described further in the UDPipe User's Manual. Currently, the following language models are available:

3.2. License

UDPipe is an open-source project and is freely available for non-commercial purposes. The library is distributed under Mozilla Public License 2.0 and the associated models and data under CC BY-NC-SA, although for some models the original data used to create the model may impose additional licensing conditions.

If you use this tool for scientific work, please give credit to us by referencing Straka et al. 2016 and the UDPipe website.

4. UDPipe Installation

UDPipe Installation on separate page.

5. UDPipe User's Manual

UDPipe User's Manual on separate page.

6. UDPipe API Reference

UDPipe API Reference on separate page.

7. Contact

Authors:

UDPipe website.

UDPipe LINDAT/CLARIN entry.

8. Acknowledgements

This work has been using language resources developed and/or stored and/or distributed by the LINDAT/CLARIN project of the Ministry of Education of the Czech Republic (project LM2010013).

Acknowledgements for individual language models are listed in the UDPipe User's Manual page.

8.1. Publications

8.2. Bibtex for Referencing

@InProceedings{udpipe:2017,
  author    = {Straka, Milan  and  Strakov\'{a}, Jana},
  title     = {Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe},
  booktitle = {Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies},
  month     = {August},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {88--99},
  url       = {http://www.aclweb.org/anthology/K/K17/K17-3009.pdf}
}

8.3. Persistent Identifier

If you prefer to reference UDPipe by a persistent identifier (PID), you can use http://hdl.handle.net/11234/1-1702.

Screenshot: