OLiA Annotation Model for morphosyntactic annotation of Arabic, following Khoja et al. (2001)
Arabic grammar has been studied for centuries, and the principles of describing the language already exist. Since so much knowledge is readily available, it is logical to derive our tagset from this wealth of information. The alternative to this is to base the Arabic tagset on an Indo-European one, but by doing this we may lose a lot of the information that an Arabic tagset would give us. Also, by moulding Arabic to fit an Indo-European language, we might distort the way Arabic is perceived by its native speakers. (Khoja et al, 2001)
The prototype tagger reported in (Khoja 2003) was based on a lexicon of under 10,000 word-types, extracted from a corpus of about 50,000 word-tokens. The initial 50,000-word training corpus was extracted from the Saudi Al-Jazirah newspaper (date 03/03/1999); initial tagging experiments were done on other newspaper texts, and a social science paper. (Atwell 2007)
Unless specified otherwise, all comments are quotes from Khoja et al. (2001).
References:
Khoja, S, Garside, R, and Knowles, G (2001) A tagset for the morphosyntactic tagging of Arabic. Paper given at the Corpus Linguistics 2001 conference, Lancaster, http://zeus.cs.pacificu.edu/shereen/CL2001.pdf
Eric Atwell (2007), Development of tag sets for part-of-speech tagging, Corpus Linguistics Conference 2007, Birmingham, http://www.comp.leeds.ac.uk/eric/atwell07clih.pdf
Adjectives are nouns that describe the aspects of an object. Adjectives inherit the properties of nouns, so they take ?nunation? when in the indefinite and can take the definite article when definite. For example, alwld sghyr ?The small boy? contains the adjective sghyr ?small?. This adjective can take the definite article as in ?darasa alwaladu alsaghyr? ?the small boy studied?, and it can also have ?nunation? as in ?hasanu saghyrun? ?Hassan is small?.
The article is not described in Arabic as a category at all, and definiteness is just a linguistic feature that is realised by the definite article.
The way that definiteness is handled in Arabic is quite different to the way it is handled in EAGLES, and this difference is apparent when comparing the tagsets. In Arabic nouns are marked for definiteness by the prefix that is the definite article, unlike in English where the article itself can be definite or indefinite. In Arabic there is no indefinite article, and instead ?nunation? is used.
Nunation is the doubling of the vowels at the end of nouns and all subclasses of nouns when they are indefinite. This doubling has the effect of adding a final ?n? to the pronunciation, so that, kitabu becomes kitabun ?book?.
Arabic grammatical tradition already recognizes a subclass of Particle translated as Exceptions by Khoja, to cover some idiosyncratic words which do not fit other patterns: ? ... These include the Arabic words that are equivalent to the word except and the prefixes non-, un-, and im-.?
(Atwell 2007)
The current Arabic tagset can be extended to include many other features such as transitivity and voice for verbs, and derivation for nouns.
Arabic has two voices, the active and the passive.
Nouns in Arabic can be derived from other nouns or they can be derived from verbs.
In Arabic there is no indefinite article, and instead ?nunation? is used. Nunation is the doubling of the vowels at the end of nouns and all subclasses of nouns when they are indefinite. This doubling has the effect of adding a final ?n? to the pronunciation, so that, kitabu becomes kitabun ?book?.
The jussive is needed to express a command in the first and third person. This mood is realised in Arabic by rejecting the final vowel and is sometimes called the apocopated imperfect.
The purpose of the jussive is to express a command in the first or third person as in ?heena yahduru yalbas thiyaban nazyfatan? which means ?when he attends, let him (he must) wear clean clothes?, where the jussive is the word ?yalbas? ?wear?. Also, there is no negative imperative in Arabic, so the negative particle followed by the jussive is used in its place, such as ?la taktub? ?do not write?, where the jussive is the word ?taktub? ?write?.
All subclasses of the noun inherit the tnwyn ?nunation? when in the indefinite which is one of the main properties of the noun: Nunation is the doubling of the vowels at the end of nouns and all subclasses of nouns when they are indefinite. This doubling has the effect of adding a final ?n? to the pronunciation, so that, kitabu becomes kitabun ?book?.
Examples of these subcategories include:
? Singular, masculine, accusative, common noun such as ktab ?book? in the sentence ?akhadha alwaladu kitaban? ?the boy took a book?.
? Singular, masculine, genitive, common noun such as ktab ?book? in the sentence ?darastu min kitabin? ?I studied from a book?.
? Singular, feminine, nominative, common noun such as ktab ?book? in the sentence ?hadhihi madrasatun? ?this is a school?.
The linguistic attributes of nouns that have been used in this tagset are:
(i) Gender: M [masculine] F [feminine] N [neuter]
(ii) Number: Sg [singular] Pl [plural] Du [dual]
(iii) Person: 1 [first] 2 [second] 3 [third]
(iv) Case: N [nominative] A [accusative] G [genitive]
(v) Definiteness: D [definite] I [indefinite]
Examples of numerals include:
? Singular, masculine, nominative, indefinite cardinal number such as ?arba?atun? ?four?.
? Singular, masculine, nominative, indefinite ordinal number such as ?rabi?un? ?fourth?.
? Singular, masculine, numerical adjective such as ?ruba?iyun? ?of four?.
Numerical adjectives describe the number of sides to a shape, for example thmany ?octagonal?. They also indicate a pair or couple when describing two people as in thna`y.
NNuNaSgM Singular, masculine, numerical adjective
?????
rubaa?y
Of four
NNuNaSgF Singular, feminine, numerical adjective
??????
rubaa?iya
Particles are sometimes affixes; for example the definite article ?al? is well-known as a prefix in Arabic loan-words in other languages, e.g. algebra, Algarve. These are handled by a compound tag, reminiscent of the Brown tagging scheme: ?... For morphologically complex words a combination of tags is used. For example, the word walktab ?and the book? is given the tag PC+NCSgMND, where PC indicates a particle that is a conjunction, and NCSgMND indicates a singular, masculine, nominative, definite noun.? (Atwell 2007)
NPrPSg1
First person, singular, neuter, personal
pronoun
????
?????
-
?????
ana- kitaabee ?
darabanee
Me ? my book
? he hit me
NPrPSg2M
Second person, singular, masculine,
personal pronoun
???
?
?????
??
anta ? kitaabuka
You ? your
book
NPrPSg2F
Second person, singular, feminine,
personal pronoun
???
?
?????
??
anti ? kitaabuki
You ? your
book
NPrPSg3M
Third person, singular, masculine,
personal pronoun
??
-
?????
kitaabahu ? huwa
His book ?
him
NPrPSg3F Third person, singular, feminine, personal
??
-
??????
kitaabuhaa ? hiya Her book ?
Page 9
pronoun
her
NPrPDu2
Second person, dual, neuter, personal
pronoun
?????????????
antumaa ?
kitaabakumaa
You two ?
your book
NPrPDu3
Third person, dual, neuter, personal
pronoun
???????????
humaa ?
kitaabahumaa
Those two ?
their book
NPrPPl1
First person, plural, neuter, personal
pronoun
??????????
nahnu ? kitaabunaa Us ? our book
NPrPPl2M
Second person, plural, masculine, personal
pronoun
???????????
antum ?
kitaabakum
You ? your
book
NPrPPl2F
Second person, plural, feminine, personal
pronoun
???????????
antunna ?
kitaabakunna
You ? your
book
NPrPPl3M
Third person, plural, masculine, personal
pronoun
?????????
hum ? kitaabahum
Them ? their
book
NPrPPl3F
Third person, plural, feminine, personal
pronoun
??
-
??????
kitaabahunna ?
hunna
Since Arabic contains three different kinds of plurals, these could also be included in the tagset. The three types of plurals are: the masculine sound plural, the feminine sound plural, and the broken plural. The first two are recognised in Arabic morphologically by suffixes, while the last, is derived by the following of fixed patterns.
The personal pronouns can be detached words such as hwa ?he?, or attached to a word in the form of a clitic. The attached pronouns can be attached to nouns to indicate possession, to verbs as direct object, or attached to prepositions such as fyh ?in it?.
Some examples of pronouns include:
? Third person, singular, masculine, personal pronoun such as hwa ?him?.
? Singular, feminine, demonstrative pronoun such as hdhh ?this?.
Examples of relative pronouns include:
? Dual, feminine, specific, relative pronoun such as alltan ?who?.
? Plural, masculine, specific, relative pronoun such as alldhyn ?who?.
? Common, relative pronoun such as ?men? ?who?.
Verbs are defined in ancient Arabic grammar as being perfect, imperfect or imperative. This classification is an important part of Arabic, and trying to mould Arabic verbs to fit the traditional past, present and future tenses of Indo-European languages would be very unnatural.
Examples of verbs include:
? First person, singular, neuter, perfect verb ?kasartu? ?I broke?.
? First person, singular, neuter, indicative, imperfect verb ?aksiru? ?I break?
? Second person, singular, masculine, imperative verb ?aksir? ?Break!
The verbal attributes that have been used in our tagset are:
(i) Gender: M [masculine] F [feminine] N [neuter]
(ii) Number: Sg [singular] Pl [plural] Du [dual]
(iii) Person: 1 [first] 2 [second] 3 [third]
(iv) Mood: I [indicative] S [subjunctive] J [jussive]
The two most notable verbal attributes that are fundamental to Arabic but do not normally appear in Indo-European tagsets are the dual number, and the jussive mood.