Nayiri Developers: Corpus of Western Armenian

Corpus of Western Armenian — Document Data Store

Overview

The Document Data Store is a folder-based repository that stores Document objects, with each persisted in its own text file.

Text files are UTF-8 encoded and are named <Document ID>.txt, where <Document ID> is the 6-digit unique identifier of the Document, as described in the Metadata Properties Reference section below.

Here we explore the file format of these text files.

Document File Format

Document files consist of two parts separated by the special String BEGIN_DOCUMENT_CONTENT that occupies a full line.

The first part of the file contains the Document's metadata and is similar to a Java properties file, with key-value pairs separated by an equals sign (=), one on each line.

To illustrate this, consider the contents of the file named JXPzhQ.txt. Its first 16 lines specify metadata properties as key-value pairs. This is followed by a blank line, then the BEGIN_DOCUMENT_CONTENT delimiter, a new line, and ends with the textual content of the Document (for conciseness, only the first three paragraphs of the sample Document have been shown here).

For reference, this example Document is the essay by Մուշեղ Իշխան (Mushegh Ishkhan) titled «Լուսնէջքէն Ետք Մարդկութիւնը Նոյնն Է» ("After the Moon Landing, Humanity is the Same") and published in Aztag Daily in Beirut, Lebanon, on August 7, 1969, shortly after the Apollo 11 Moon landing.

id = JXPzhQ
publicationId = aztag.daily
authorId = mushegh.ishkhan
yearPublished = 1969
monthPublished = 8
dayPublished = 7
title = Լուսնէջքէն Ետք Մարդկութիւնը Նոյնն Է
subTitle = 
category = 
url = https://www.aztagdaily.com/archives/448974
scanUrl = https://tert.nla.am/archive/NLA%20TERT/Azdak1927/1969/131_ocr.pdf
writtenLanguageVariant = WA
isManuallyStemmedAndTokenized = false
usernameCreatedBy = Serouj
creationTime = 1707693051965
usernameLastModifiedBy = Serouj
lastModifiedTime = 1771998726856

BEGIN_DOCUMENT_CONTENT
[[Լուսնէջքի >>> լուսնէջք]] ստեղծած [[առաջին >>> առաջին@ADJECTIVE]] [[մեծ >>> մեծ@ADJECTIVE]] [[խանդավառութենէն >>> խանդավառութիւն]] [[ետք >>> ետք@PREPOSITION]] զգացումի փոթորիկները [[սկսած են >>> սկսիլ]] տակաւ հանդարտիլ։ Մեր երկրագունդի կեանքին [[հետ >>> հետ@PREP]] կապ ունեցող ընթացիկ մտահոգութիւններ խորհրդածութեան նիւթ [[կը դառնան >>> դառնալ]] [[կրկին >>> կրկին@ADVERB]]։ [[Լուսնի >>> լուսին]] [[գրաւման >>> գրաւումն]] [[մեծ >>> մեծ@ADJECTIVE]] խնդիրը [[լուծուեցաւ >>> լուծուիլ]] մարդկային մտքի եւ քաջութեան յաղթանակով, սակայն [[այս >>> այս@DET]] աշխարհի զանազան [[հողամասերուն >>> հողամաս]] [[վրայ >>> վրայ@PREP]] հազարումէկ [[ծանր >>> ծանր@ADJ]] խնդիրներ կան, որոնք ոչ [[միայն >>> միայն@ADV]] [[չեն լուծուիր >>> լուծուիլ]], [[այլ >>> այլ@CON]] օրէ օր [[կը բարդանան >>> բարդանալ]] եւ արիւնոտ [[հանգոյցներու >>> հանգոյց]] [[կը վերածուին >>> վերածուիլ]]։

Քաղաքական տագնապներ, [[պատերազմի >>> պատերազմ]] բռնկած օճախներ, սովահար բազմութիւններ, ոճրածին [[ծրագիրներ >>> ծրագիր]], ընչաքաղց ախորժակներ, քէն եւ ատելութիւն։ [[Այս >>> այս@DET]] բոլորը [[կան >>> կանանալ]] ու [[կը շարունակուին >>> շարունակուիլ]] [[նախկին >>> նախկին@ADJ]] ձեւով, [[նախկին >>> նախկին@ADJECTIVE]] զօրութեամբ, [[հակառակ >>> հակառակ@ADP]] միջոցին [[մէջ >>> մէջ@PREP]] [[իրագործուած >>> իրագործուիլ]] աննախընթաց խոյանքին։

[[Աստղերու >>> աստղ]] [[փոշիով >>> փոշի]] եւ լոյսով թաթաւուն թեւաւոր մարդը [[կեցած է >>> կենալ]] դարձեալ աշխարհի [[կոտրած >>> կոտրիլ]] [[տաշտին >>> տաշտ]] [[առջեւ >>> առջեւ@PREPOSITION]] եւ [[ստիպուած է >>> ստիպուիլ]] հոն [[լուալու >>> լուալ]] կեանքի աղտոտ [[լաթերը >>> լաթ]]։

Note that there is one key-value pair per line in the metadata part of the file, where each line has the pattern <key> = <value>.

Note that in the file's textual portion, some of the text has been annotated with explicit tokenization, explicit lemmatization, and part of speech tagging using the Nayiri Markup Language.

Not all annotations are necessary, and some have been added for illustration purposes.

Metadata Properties Reference

The metadata section supports the properties described in this section.

All properties besides id and writtenLanguageVariant are optional, but almost all Documents in practice at least have a title and an authorId specified.

Note that some of the properties contain revision data internal to the Nayiri revision system and are included for tracking file versions.

Use the properties you need for your application.

Attribute Key Type Description

id String The 6-digit identifier that uniquely identifies this Document in the Nayiri Text Corpus.

It is the base64url encoding of the underlying 36-bit unique identifier.

It is the same identifier used in the filename.

publicationId String (Optional) A reference to the unique identifier of the Publication object, if this Document is associated with a publication such as a newspaper or journal.

authorId String (Optional) A reference to the unique identifier of the Author object, if this Document is associated with an author.

yearPublished Integer (Optional) The year this Document was published.

monthPublished Integer (Optional) The month (1-12) that this Document was published.

dayPublished Integer (Optional) The day of month (1-31) that this Document was published.

title String (Optional) The title of this Document (if known).

subTitle String (Optional) The subtitle of this Document (if known).

This is any subordinate title of the Document, giving additional information about its content.

category String (Optional) This is any category that the Document could be filed under (for example, Խմբագրական "Editorial" in an article that appeared in a newspaper).

url String (Optional) A URL pointing to the source of the text.

scanUrl String (Optional) A URL pointing to the scan (image) of the original document represented by the text.

writtenLanguageVariant String (Required) The Written Language Variant in which this Document is primarily written.

One of: WA for Western Armenian, EA for Eastern Armenian, and EA_RO for Eastern Armenian in Reformed Orthography.

Since the current release of the Corpus is for Western Armenian, this property will almost always be WA.

However, given that the Nayiri Armenian Corpus will include other written language variants of Armenian in the future, it is best to filter the Documents your application consumes by the written language variant(s) you need.

isManuallyStemmedAndTokenized Boolean (Optional) This is an administrative property, specifying whether the Document has been fully disambiguated via explicit tokenization and stemming (including part of speech tagging).

Most articles have this set to false.

usernameCreatedBy String (Optional) The name of the user who created the Document.

creationTime Long (Optional) A 64-bit Java Long number that is milliseconds since epoch when this Document was added to the Corpus.

usernameLastModifiedBy String (Optional) The name of the user who last modified the Document.

lastModifiedTime Long (Optional) A 64-bit Java Long number that is milliseconds since epoch when this Document was last modified.

Attribute Key	Type	Description
id	String	The 6-digit identifier that uniquely identifies this Document in the Nayiri Text Corpus. It is the base64url encoding of the underlying 36-bit unique identifier. It is the same identifier used in the filename.
publicationId	String	(Optional) A reference to the unique identifier of the Publication object, if this Document is associated with a publication such as a newspaper or journal.
authorId	String	(Optional) A reference to the unique identifier of the Author object, if this Document is associated with an author.
yearPublished	Integer	(Optional) The year this Document was published.
monthPublished	Integer	(Optional) The month (1-12) that this Document was published.
dayPublished	Integer	(Optional) The day of month (1-31) that this Document was published.
title	String	(Optional) The title of this Document (if known).
subTitle	String	(Optional) The subtitle of this Document (if known). This is any subordinate title of the Document, giving additional information about its content.
category	String	(Optional) This is any category that the Document could be filed under (for example, Խմբագրական "Editorial" in an article that appeared in a newspaper).
url	String	(Optional) A URL pointing to the source of the text.
scanUrl	String	(Optional) A URL pointing to the scan (image) of the original document represented by the text.
writtenLanguageVariant	String	(Required) The Written Language Variant in which this Document is primarily written. One of: `WA` for Western Armenian, `EA` for Eastern Armenian, and `EA_RO` for Eastern Armenian in Reformed Orthography. Since the current release of the Corpus is for Western Armenian, this property will almost always be `WA`. However, given that the Nayiri Armenian Corpus will include other written language variants of Armenian in the future, it is best to filter the Documents your application consumes by the written language variant(s) you need.
isManuallyStemmedAndTokenized	Boolean	(Optional) This is an administrative property, specifying whether the Document has been fully disambiguated via explicit tokenization and stemming (including part of speech tagging). Most articles have this set to `false`.
usernameCreatedBy	String	(Optional) The name of the user who created the Document.
creationTime	Long	(Optional) A 64-bit Java Long number that is milliseconds since epoch when this Document was added to the Corpus.
usernameLastModifiedBy	String	(Optional) The name of the user who last modified the Document.
lastModifiedTime	Long	(Optional) A 64-bit Java Long number that is milliseconds since epoch when this Document was last modified.

Next: Authors Data Store