The NameTag API is defined in header nametag.h and resides in ufal::nametag namespace.

The strings used in the NameTag API are always UTF-8 encoded (except from file paths, whose encoding is system dependent).

1. NameTag Versioning

NameTag is versioned using Semantic Versioning. Therefore, a version consists of three numbers major.minor.patch, optionally followed by a hyphen and pre-release version info, with the following semantics:

  • Stable versions have no pre-release version info, development have non-empty pre-release version info.
  • Two versions with the same major.minor have the same API with the same behaviour, apart from bugs. Therefore, if only patch is increased, the new version is only a bug-fix release.
  • If two versions v and u have the same major, but minor(v) is greater than minor(u), version v contains only additions to the API. In other words, the API of u is all present in v with the same behaviour (once again apart from bugs). It is therefore safe to upgrade to a newer NameTag version with the same major.
  • If two versions differ in major, their API may differ in any way.

Models created by NameTag have the same behaviour in all NameTag versions with same major, apart from obvious bugfixes. On the other hand, models created from the same data by different major.minor NameTag versions may have different behaviour.

2. Struct string_piece

struct string_piece {
  const char* str;
  size_t len;

  string_piece();
  string_piece(const char* str);
  string_piece(const char* str, size_t len);
  string_piece(const std::string& str);
}

The string_piece is used for efficient string passing. The string referenced in string_piece is not owned by it, so users have to make sure the referenced string exists as long as the string_piece.

3. Struct token_range

struct token_range {
  size_t start;
  size_t length;
};

The token_range represent a range of a token as returned by a tokenizer. The start and length fields specify the token position in Unicode characters, not in bytes of UTF-8 encoding.

4. Struct named_entity

struct named_entity {
  size_t start;
  size_t length;
  std::string type;

  named_entity();
  named_entity(size_t start, size_t length, const std::string& type);
};

The named_entity is used to represend a named entity. The start and length fields represent the entity range in either tokens, unicode characters or bytes, depending on the usage. The type represents the entity type.

5. Class version

class version {
 public:
  unsigned major;
  unsigned minor;
  unsigned patch;

  static version current();
};

The version class represents NameTag version. See NameTag Versioning for more information.

5.1. version::current

static version current();

Returns current NameTag version.

6. Class tokenizer

class tokenizer {
 public:
  virtual ~tokenizer() {}

  virtual void set_text(string_piece text, bool make_copy = false) = 0;
  virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0;

  static tokenizer* new_vertical_tokenizer();
};

The tokenizer class performs segmentation and tokenization of given text. The class is not threadsafe.

The tokenizer instances can be obtained either directly using the static method new_vertical_tokenizer or through instances of ner.

6.1. tokenizer::set_text

virtual void set_text(string_piece text, bool make_copy = false) = 0;

Set the text which is to be tokenized.

If make_copy is false, only a reference to the given text is stored and the user has to make sure it exists until the tokenizer is released or set_text is called again. If make_copy is true, a copy of the given text is made and retained until the tokenizer is released or set_text is called again.

6.2. tokenizer::next_sentence

virtual bool next_sentence(std::vector<string_piece>* forms, std::vector<token_range>* tokens) = 0;

Locate and return next sentence of the given text. Returns true when successful and false when there are no more sentences in the given text. The arguments are filled with found tokens if not NULL. The forms contain token ranges in bytes of UTF-8 encoding, the tokens contain token ranges in Unicode characters.

6.3. tokenizer::new_vertical_tokenizer

static tokenizer new_vertical_tokenizer();

Returns a new instance of a vertical tokenizer, which considers every line to be one token, with empty line denoting end of sentence. The user should delete the instance after use.

7. Class ner

class ner {
 public:
  virtual ~ner() {}

  static ner* load(const char* fname);
  static ner* load(istream& is);

  virtual void recognize(const std::vector<string_piece>& forms, std::vector<named_entity>& entities) const = 0;

  virtual tokenizer* new_tokenizer() const = 0;
};

A ner instance represents a named entity recognizer. All methods are thread-safe.

7.1. ner::load(const char*)

static ner* load(const char* fname);

Factory method constructor. Accepts C string with a file name of the model. Returns a pointer to an instance of ner which the user should delete after use.

7.2. ner::load(istream&)

static ner* load(istream& is);

Factory method constructor. Accepts an input stream with the model. Returns a pointer to an instance of ner which the user should delete after use.

7.3. ner::recognize

virtual void recognize(const std::vector<string_piece>& forms, std::vector<named_entity>& entities) const = 0;

Perform named entity recognition on a tokenized sentence given in the forms argument. The found entities are returned in the entities argument. The range of the returned named_entity is represented using form indices.

7.4. ner::tokenize_and_recognize

void tokenize_and_recognize(string_piece text, std::vector<named_entity>& entities, bool unicode_offsets = false) const;

Perform named entity recognition on an untokenized text given in the text argument. The found entities are returned in the entities argument. The range of the returned named_entity is represented either in Unicode characters (when unicode_offsets == true), or in UTF-8 bytes (when unicode_offset == false).

7.5. ner::new_tokenizer

virtual tokenizer* new_tokenizer() const = 0;

Returns a new instance of a suitable tokenizer or NULL if no such tokenizer exists. The user should delete it after use.

8. C++ Bindings API

Bindings for other languages than C++ are created using SWIG from the C++ bindings API, which is a slightly modified version of the native C++ API. Main changes are replacement of string_piece type by native strings and removal of methods using istream. Here is the C++ bindings API declaration:

8.1. Helper Structures

typedef vector<string> Forms;

struct TokenRange {
  size_t start;
  size_t length;
};
typedef vector<TokenRange> TokenRanges;

struct NamedEntity {
  size_t start;
  size_t length;
  string type;

  NamedEntity();
  NamedEntity(size_t start, size_t length, const string& type);
};
typedef vector<NamedEntity> NamedEntities;

8.2. Main Classes

class Version {
 public:
  unsigned major;
  unsigned minor;
  unsigned patch;
  string prerelease;

  static Version current();
};

class Tokenizer {
 public:
  virtual void setText(const char* text);
  virtual bool nextSentence(Forms* forms, TokenRanges* tokens);

  static Tokenizer* newVerticalTokenizer();
};

class Ner {
  static ner* load(const char* fname);

  virtual void recognize(Forms& forms, NamedEntities& entities) const;

  virtual Tokenizer* newTokenizer() const;
};

9. C# Bindings

NameTag library bindings is available in the Ufal.NameTag namespace.

The bindings is a straightforward conversion of the C++ bindings API. The bindings requires native C++ library libnametag_csharp (called nametag_csharp on Windows).

10. Java Bindings

NameTag library bindings is available in the cz.cuni.mff.ufal.nametag package.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Java interface, see cz.cuni.mff.ufal.nametag.Forms class for reference. Also, class members are accessible and modifiable using using getField and setField wrappers.

The bindings require native C++ library libnametag_java (called nametag_java on Windows). If the library is found in the current directory, it is used, otherwise standard library search process is used.

11. Perl Bindings

NameTag library bindings is available in the Ufal::NameTag package. The classes can be imported into the current namespace using the :all export tag.

The bindings is a straightforward conversion of the C++ bindings API. Vectors do not have native Perl interface, see Ufal::NameTag::Forms for reference. Static methods and enumerations are available only through the module, not through object instance.

12. Python Bindings

NameTag library bindings is available in the ufal.nametag module.

The bindings is a straightforward conversion of the C++ bindings API. In Python 2, strings can be both unicode and UTF-8 encoded str, and the library always produces unicode. In Python 3, strings must be only str.