Adyghe corpus


Welcome to the start page of the Adyghe language corpus.

Details To the corpus

Adyghe corpus

Toggle navigation

This is the main page of the website where a linguistic corpus of the Adyghe language is located. The search in annotated Adyghe texts is performed via a web interface. When you search for something, you get individual sentences as search hits; there is no access to entire texts. You can search by character sequences, morphs and their combinations, grammatical tags, Russian translations of words, and position of the word in the sentence. The search can be narrowed down to a subset of texts, selected e.g. by genre. Morphological annotation of words was carried out automatically and was not checked manually; authors of the corpus take no responsibility for their correctness. It the table, main characteristics of the corpus are outlined. More detailed information about the texts and the tagsets used in the corpus can be found below.

Parameter Value
Size 10.68 million words
Texts
  • contemporary press — 65%
  • fiction — 20,6%
  • folklore — 6%
  • religious texts — 3,9%
  • other — 4,5%
Annotation
  • automatic morphological annotation (lemmatization, parts of speech, all inflectional categories), 83.5% of words have at least one analysisonly words that do not contain digits or Latin characters were taken into account
  • no disambiguation
  • glossing
  • Russian translations of lemmata
  • parallel Russian translation of some 4.6% of the corpus volume texts
Metadata
  • title of the text
  • author or name of the newspaper
  • year of creation
  • exact day (for newspapers)
  • birthplace of the author
  • year of birth of the author
  • dialect
  • genre

What is a corpus?

A language corpus is a collection of texts in that language which has been enriched with additional linguistic information, called annotation, and, preferably, equipped with a search engine. Here you will find a short list of frequently asked questions about the Adyghe corpus.

— Who needs corpora?

First of all, corpora are used by linguists. The search engine and annotation of corpora are designed in such a way that you can make linguistic queries such as “find all nouns in the genitive case” or “find all forms of the word цӏыф followed by a verb”. Apart from linguists, corpus can be a useful tool for language teachers, language learners, and even the native speakers.

— Can I use the corpus as a library?

No, this corpus is not designed for that. When you work with a corpus, you make a query, i.e. search for a particular word, phrase or construction, and get back all sentences that contain what you searched for. By default, the sentences are showed in random order. You can expand the context of each of the sentences you get, i.e. look at their neighboring sentences. However, you may do so only a limited number of times for each sentence. Therefore, it is impossible to read an entire text in the corpus. This is done for copyright protection.

— Can I use the corpus as a dictionary?

Each Adyghe word in the corpus has Russian translation (no English translations are available at the moment). However, they are only provided as auxiliary information for users who do not speak Adyghe. The translations in the corpus are kept short and simple by design, they do not list all senses and do not provide usage examples like real dictionaries. If you want to know how to translate a word, the right way to do so is consulting a dictionary.

— What is morphological annotation and how do you get it?

The corpora located here are lemmatized and morphologically annotated. Lemmatization means that each word in the texts is annotated with its lemma, i.e. dictionary/citation form. Morphological annotation means that each word is annotated for its grammatical features, such as part of speech, number, case, tense, etc. Since the corpus in question is too large for manual annotation to be feasible, it was annotated automatically with a program called morphological analyzer. The analyzer uses a manually compiled grammatical dictionary and a formalized description of Adyghe inflection. Automatic annotation unfortunately means that, first, out-of-vocabulary words are not annotated, and, second, that some words have several ambiguous analyses.

Tagsets

(Section under construction)

Authors

The main participants of the project are:

The first version of the corpus was developed in the course of a research project “Digital documentation of a polysynthetic language”, supported by the Russian Foundation for Basic Research (project No. 15-06-07434).

Some technical work was also done by A. Deynekina, V. Lavrentyev, G. Moroz, I. Naumov, and E. Pasalskaya.

Contacts


If you have questions, would like to propose collaboration, or noticed an error in the corpusexcept ambiguous analyses, which are not corrected manually, please contact Yury Lander.

yulander@yandex.ru