emille

Ontology - emille

Abstract: OLiA Annotation Model for the morphosyntactic annotation of the Urdu section of the EMILLE corpus (Hardie 2003, 2004). Unless marked otherwise, all comments are quotes from Hardie (2004), Chapter 3. The tagset discussed here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of corpora. Although these guidelines were written to cover the languages of the European Union, they can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo- European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies presented by Urdu grammar. (Hardie 2003) The first stage of the work was to develop a tagset for use in Urdu texts and corpora, an area which has not been research extensively heretofore2. The next stage, now underway, is to test the tagset’s usability in manual tagging, and build up a set of tagged texts to serve as training data for the final phase of this part of the project. This will be to automate the tagging and subsequently tag the whole of the EMILLE Urdu corpus. (Hardie 2003) References Hardie, A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Corpus Linguistics 2003, 2003-03-01, Lancaster. http://eprints.lancs.ac.uk/103/ Hardie, Andrew (2004) The computational analysis of morphosyntactic categories in Urdu. Other thesis, Lancaster University. http://eprints.lancs.ac.uk/106/ Ruth Laila Schmidt (1999) Urdu, an essential grammar, Routledge, London.
Latest Version: http://purl.org/olia/emille.owl#

Imports

http://purl.org/olia/system.owl olia_system

Classes - Overview

Classes

Abbreviation
SubClass Of	Residual
Acronym
SubClass Of	Residual
AdjectivalOccupationalParticle
Abstract	Adjectival / occupational particle (v?l?) This element is the source of the English word / suffix ?wallah? (Kachru 1990: 70), which may help the reader to gain some grasp on its meaning.
SubClass Of	Unique
AdjectivalParticle
Abstract	Adjectival particle (s?)
SubClass Of	Unique
Adjective
Abstract	use ... refers to whether an adjective may be used in attributive or predicative positions only. The default value for this is naturally both. In the absence of a specification in the EAGLES guidelines, I represent this with 0. There are a number of common Perso-Arabic adjectives in Urdu that can only be used in predicative position (Schmidt 1999: 37), for which this attribute can take the value 2. This is the rationale for including this attribute, which is however a prime candidate to be underspecified in a practical subtagset. It is anticipated that it will be difficult for a POS tagger to detect predicate-only adjectives. Since the predicate-only adjectives are Perso-Arabic, it ought to follow that they are all unmarked adjectives. However, this is a point on which Schmidt (1999) is silent. For this reason, tags have been included for predicate-only adjectives that are marked for gender/number/case. These may need to be removed if it turns out from the data that they do indeed describe nonexistent categories, as I suspect
SubClass Of	PartOfSpeech
Sub-Classes	AttributiveOrPredicativeAdjective PredicativeAdjective
Adposition
Abstract	It should be noted at the outset that I treat as adpositions those elements of Urdu that some writers (e.g. Kellogg 1875, Butt 1995) describe as case suffixes or clitics. This is firstly because Schmidt (1999), the model of the language being used, does so. Secondly, however, treating n? (among other markers) as adpositions allows theoretical neutrality to be maintained on the question of whether Urdu displays ergativity48. The EAGLES guidelines give only one attribute for adpositions, Type, which has a range of recommended and optional values: preposition, fused preposition- 48 See also the discussion of the ergativity controversy in 1.1.5.4 and the discussion of noun cases and the etymology of postpositions in 3.1.3. 177 article, postposition, and circumposition. The second and fourth of these do not apply to Urdu, which lacks articles49 and circumpositions. The vast majority of Urdu adpositions are postpositions, but there are some prepositions borrowed from Persian and Arabic (Schmidt 1999: 68, 250, 267), so this attribute is relevant. There are two other issues. The first is that of iz?fat (Bhatia and Koul 2000: 339; Schmidt 1999: 246-247). The iz?fat is a Persian enclitic (pronounced as a shorter form of ???) which in some circumstances can be considered a preposition: it links two nouns in a possessive relationship, although the phrase thus produced may often have a different meaning to a phrase produced with the native Urdu postposition k?. However, the iz?fat may also join a noun to an adjective, in which case it is not so clearly accurate to describe it as a preposition parallel to the prepositions in European languages for which the EAGLES guidelines were compiled. A better way to treat iz?fat is in the context of the Unique category of miscellaneous one-member wordclasses, discussed below. The second issue is that in Urdu, the postposition k? can be marked for number/gender/case agreement (Schmidt 1999: 68-69). It does not agree with the noun it governs, but with the head noun of the noun phrase that contains its postposition phrase. This is not a phenomenon allowed for by the EAGLES guidelines as they now stand. k? takes the same inflectional endings as marked adjectives (having the forms k?, k?, and k?). Therefore, it is necessary for the same number/gender/case categories to be distinguished by the tagset for postpositions as for adjectives50. This means that the intermediate tagset contains three more attributes than are suggested in the EAGLES guidelines.
SubClass Of	PartOfSpeech
Sub-Classes	Postposition Preposition
Adverb
Abstract	As with verbs, there are lexical and non-lexical adverbs, which will be considered in turn. In the EAGLES guideline, the recommended attribute for adverbs is degree44, which is not relevant morphologically to Urdu (as discussed with reference to adjectives: see 3.3 above). However, the remaining three features are relevant, and have been included. These are adverb-type, which distinguishes general and degree adverbs, and polarity and wh-type, which distinguish interrogative and relative pronouns. The following summarises the features used in the intermediate tagset. There are a total of 13 adverb tags.
SubClass Of	PartOfSpeech
Sub-Classes	GeneralAdverb NonLexicalAdverb
Article
Abstract	Articles Urdu lacks articles. However, some phrases borrowed from Arabic contain the clitic Arabic definite article, which receives the single tag AL (the spelling of the Arabic article). I have not included a C in this tag, as I have done for other clitics (see section 3.12), because this would make the tag less transparent. The use of the AT intermediate tag could be queried here, because the use of the Arabic definite article in Urdu does not parallel that of, for example, the in English or le/la/les in French. For example, the Arabic definite article is only found with Arabic loanwords43, whereas of course the can appear with the vast majority of nouns in English. However, on balance it seems that this disadvantage is outweighed by the advantage of indicating that the Arabic definite article in Urdu does do pretty much what other languages? articles do. Khoja et al.?s (2001) Arabic tagset does not have a separate tag for the article, but considers definiteness a feature of nouns: this would not be an appropriate approach for Urdu because non-Arabic nouns cannot be made definite by use of the Arabic definite form.
SubClass Of	PartOfSpeech
Aspect
SubClass Of	system:Feature
Sub-Classes	ImperfectiveAspect PerfectiveAspect
AttributiveOrPredicativeAdjective
Abstract	Adjective/Use=both
SubClass Of	Adjective
AuxiliaryVerb
Abstract	It should be noted that, whereas I have in this category treated all auxiliary elements as verbs, in the terms of the EAGLES guidelines for intermediate tagsets some could easily be characterised as unique or unassigned words (see below). The EAGLES guidelines treat the English infinitive marker to in this manner, for example. However, treating them as verbs in the intermediate is firstly in keeping with the structure of the Urdu tagset, and secondly allows verbal attributes such as gender and number to be used (the EAGLES unique intermediate tags include no such attributes).
SubClass Of	Verb
Sub-Classes	CahieAuxiliary GaAuxiliary HonaAuxiliary RahaAuxiliary
CahieAuxiliary
Abstract	The word c?hi? is used in combination with the infinitive of a lexical verb to express advisability. It is also used (as described by Bhatia and Koul 2000: 60) as a polite form of the verb c?hn?, ?want?. It is derived from an old morphologically marked passive form (Schmidt 1999: 137) of c?hn?20; however, c?hn? is a lexical verb and other than this use of c?hi?, it does not deviate from the pattern of other lexical verbs. Therefore the best approach would seem to be to give c?hi? its own tags (it requires two tags because it agrees with the number of the object of the preceding infinitive in certain circumstances21). This is the approach taken in many English tagsets for modal auxiliary verbs, which are, like c?hi?, anomalous forms. The intermediate tags given to c?hi? and its plural form c?hi?~ list them as being without person or gender, without finiteness (since it can be used with or without a following tense-bearing auxiliary), indicative, present tense and without aspect. In the descriptions, these words are defined as ?c?hi?-type?, rather than attempt to find an English word to accurately summarise the range of meanings associated with desirability and/or advisability that these words can convey.
SubClass Of	AuxiliaryVerb
CardinalNumber
Abstract	Numeral/type=cardinal Cardinal numbers function as grammatically unmarked determiner-like adjectives (Schmidt 1999: 228). However, they can appear in the oblique plural ? with the same suffix as an unmarked noun ? to express totality (Schmidt 1999: 10-11). There is therefore an additional tag for this (indicated only by O, since there is no oblique singular to make a contrast). In the intermediate tagset I have given their function as determiner, in line with the determiners that are in the pronoun category above. Numerals are to be tagged as below, even if written as figures rather than words (and whatever set of figures are used: Urdu uses both the Western European and the Arabic-Indic digits).
SubClass Of	Numeral
Case
Abstract	In the model of the language given by Schmidt, Urdu has three cases, nominative, oblique and vocative. McGregor (1972: 1-2) uses a different classification, treating the vocative as a special form of the oblique case. However, since the special form would still need to be tagged separately, it makes sense to treat it as a vocative case, a phenomenon for which the EAGLES guidelines already allow for. As Schmidt (1999: 7) points out, some grammarians4 have treated Urdu postpositions as being either suffixes or clitics indicating cases, in which case Urdu would possess many more than three cases. However, this is a minority view amongst writers of general grammars: Schmidt (1999), Barz (1977), Bhatia and Koul (2000), McGregor (1972), Bailey et al. (1956) all do not treat postpositions as marking cases. There is an etymological basis for this view. Kellogg (1875: 128-133) reports that the postpositions do not derive from Sanskrit case markers, but rather from independent words (e.g. k?, ?to?, from Sanskrit k?kshe, ?armpit, side?; m?~, ?in?, from Sanskrit madhye, ?middle?, both locative nouns; tak, ?until?, from the Sanskrit past participle tarita, ?passed to?, plus a dative affix ku.). Furthermore, the suffix/clitic approach would require case to be determined across multi-token units, which would breach the design principle of including no multiword tags. It would also have implications for the principle of theoretical neutrality, since it would be necessary to take some standpoint on the subject of whether or not Urdu has ergative case marking, a theoretically controversial point (see 1.1.5.4). Thus I use the nominative-obliquevocative distinction as exemplified below: laRk?, laRk? ?boy(s)? (nominative singular/plural) laRk?, laRk?~ (oblique singular/plural) laRk?, laRk? (vocative singular/plural) (example from Schmidt 1999: 10-12) There is something of an issue with the names of the cases. Vocative is straightforward enough, and is one of the values given for the case attribute in the EAGLES guidelines. Nominative, however, is usually given meaning by its contrast with accusative ? a case that does not exist in Urdu. The nominative may in Urdu be used for either, neither or both of the subject and the direct object. Thus it is not certain whether the nominative in Urdu really corresponds with the nominative that is value 1 in the EAGLES guidelines5. Certainly it does not correspond with the nominative as it exists in, for example, German or Latin. However, I have used value 1 in the intermediate tagset for the Urdu case, on the basis that no Urdu case resembles the nominative in the European languages for which the EAGLES guidelines were devised any more closely than the Urdu nominative. There is no value in the EAGLES guidelines for oblique. Nor is there one for postpositional, locative or instrumental (alternative names used by Bailey et al. 1956 for this case6). Rather than invent an extra value (undesirable for reasons given with regard to markedness above), I have used the value for dative to represent oblique, on the grounds that in some European languages (e.g. German) prepositions frequently govern the dative, and in Urdu postpositions govern the oblique.
SubClass Of	system:Feature
Sub-Classes	NominativeCase ObliqueCase VocativeCase
CliticExclusiveEmphaticParticle
Abstract	Clitic exclusive emphatic particle ((h)?(~))
SubClass Of	Unique
CliticPostposition
Abstract	?, ?~, h?~ A form of k? added to a pronoun.
SubClass Of	Unique
CloseParenthesis
SubClass Of	Punctuation
CloseQuotationMark
SubClass Of	Punctuation
CloseSquareBracket
SubClass Of	Punctuation
Colon
SubClass Of	Punctuation
Comma
SubClass Of	Punctuation
CommonNoun
Abstract	Noun/Type=common
SubClass Of	Noun
CompoundFormingConjunction
Abstract	Persian compoundforming conjunction (?)
SubClass Of	Unique
Conjunction
Abstract	The EAGLES guidelines suggest that conjunctions be classified firstly for whether they are coordinating or subordinating, and then secondly as one of four coordinating types or one of three subordinating types. I have disregarded the attribute for subordinate-type, since it was developed for German and does not seem relevant to Urdu subordinating conjunction as described by Schmidt (1999: 223-227). Urdu correlative conjunctions (such as bh??bh?, y??y?) do not have initial and non-initial forms, so those features are also not needed. This gives three types of conjunctions: simple coordinating, correlative coordinating, and subordinate. Note that phrases involving the relative j-set of pronouns, adjectives and adverbs are often translated by conjunctions, but are not to be tagged as such.
SubClass Of	PartOfSpeech
Sub-Classes	CoordinatingConjunction CorrelativeCoordinatingConjunction SubordinatingConjunction
ContrastiveEmphaticParticle
Abstract	Contrastive emphatic particle t?
SubClass Of	Unique
CoordinatingConjunction
SubClass Of	Conjunction
CorrelativeCoordinatingConjunction
Abstract	The EAGLES guidelines (Leech and Wilson 1999: 68) specify that a conjunction is correlative when it is at the start of the first of a pair of correlated clauses. The conjunction at the start of the second half of the pair is then a simple coordinating conjunction (CC)51. This practice will be followed to ensure compliance with the EAGLES guidelines.
SubClass Of	Conjunction
DeadjectivalAdverb
Abstract	In Urdu these are of two sorts: adverbs which are derived from adjectives by inflecting them to their masculine oblique form or adding a Persian or Arabic loaned 44 This use of ?degree? (i.e. inflected superlative or comparative) should be clearly distinguished from the use of ?degree adverb? below (i.e. words with meanings such as ?very?, ?more?). 173 derivational suffix45 (RRJ), and adverbs which are not (RR). While this unfortunately violates the principle of not including derivational information, this distinction has been included in the tagset for two reasons. Firstly, it helps avoid ambiguity, since an adverb derived from an adjective has the same form as that adjective in its masculine singular oblique form (see Schmidt 1999: 57). If adjectival adverbs were marked RR, this would lead to a wide ambiguity between RR and JJM1O, which would make non-adjectival adverbs ambiguous as well! Using a separate tag, there is only an RRJ~JJM1O ambiguity, which significantly reduces the scope of the ambiguity. Although this is a pragmatic consideration which should probably be included at the subtagset level, it involves creating a distinction rather than collapsing one, and must thus exist in the top level tagset. However, there is another motivation for the RRJ tag, which is that it is necessary to maintain theoretical neutrality. It is possible that some analyst might wish to treat the RRJ adverbs as if they were actually adjectives ? that is, identify them with JJ? categories instead of RR. Indeed Bailey et al. (1956: 18) come close to saying this. The principle of theoretical neutrality must here override the principle of excluding derivational information.
SubClass Of	GeneralAdverb
DegreeAdverb
SubClass Of	NonLexicalAdverb
DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
Abstract	Third person pronouns/demonstratives, interrogative and relative pronouns and determiners This class of pronouns consists of all those pronouns that fall into the parallel classes of what Schmidt (1999: 39) calls ?symmetrical y-v-k-j word sets?. These classes contain a variety of pronouns and adjectives that are of similar form, the first letter indicating what set they belong to, thus: 161 ? y or a vowel indicates the set of proximal demonstratives (this, now, etc.) ? v or t35 indicates the set of distal demonstratives (that, then, etc.) ? k indicates the set of interrogatives (who, what, how, etc.) ? j indicates the set of relative words (who, where, whither, etc.) Thus, in Urdu there is 1) a significant distinction between proximal and distal words, for which there is no distinction in the EAGLES guidelines; 2) a significant distinction between interrogatives and relatives, which is only made by the EAGLES guidelines at the secondary optional level (the recommended features include only int./rel., presumably on the basis that these have similar forms in many European languages ? the so-called wh-words). This means that the intermediate tags for these pronouns are not as elegant as they might be, and the tags for the y-set and the v-set are the same36. However, I will make this distinction in the Urdu tags, which begin with P followed by the letter of the relevant y-v-k-j set. The proximal and distal demonstratives have not been distinguished for any other language that I am aware of. For example, no English tagset I know of distinguishes here/hither from there/thither. However, most distinguish where/whither from the non-interrogative/relative words. In Urdu, the ?near~far? phonological pattern is much more consistent ? there are no odd pairs such as English this~that ? and is formally of an equal degree to the ?demonstrative~interrogative? distinction. Furthermore, there is a difference of usage between the proximal and distal sets ? the latter are used in correlative clauses where the former are not37. For this reason I tag the four-way distinction, since it would be odd to arbitrarily merge two of what are on a language-internal basis clearly different categories. The pronouns in the y-v-k-j sets are used as demonstrative pronouns and third person personal pronouns (so yah and vah38 mean both ?this? and ?that? and ?he/she/it?). They can also act as determiners within a noun phrase. I have not tagged these uses differently, because this would fall under the heading of syntactic information, which this tagset does not include. See also section 3.4.1.1. I do not, as Schmidt (1999: 38-41) does, characterise the determiner-usage as adjectival, since these pronouns do not display gender agreement, as adjectives (including other members of the y-v-k-j sets) do. They are however marked for case and number39. They also have the peculiarity that their plurals have a third case-like form, which appear solely before the postposition n? (which indicates the subject of an ergative-type clause). This is tagged separately (and, like the proximal/distal distinction, not distinguished in the intermediate tagset, since it is difficult to see how this could be achieved). There are two interrogative pronouns, both beginning in k; one means ?what? and one means ?who?. They both receive the same tags, since tagging an animacy distinction would be odd when this is done nowhere else in the tagset. 37 There is one minor exception to this (Schmidt 1999: 206). 38 These two words are almost always transcribed as y? and v?, which is how they are pronounced. However, the spellings with h are closer to the Perso-Arabic (Bhatia and Koul 2000: 36). 39 However, in the nominative case the singular and plural forms are identical. 163 In the intermediate tagset, following what is done for such pronouns in the example English tagset given in the EAGLES guidelines I give person as zero, and for the k-set words the wh-type is ?240, since ky? may also be exclamatory. The category attribute is both, because these words are both pronouns and determiners. There are also in the y-v-k-j sets a number of words that are more like 165 determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone as pronouns. However they behave in some respects more like adjectives, e.g. they can be predicative rather than attributive. In terms of the EAGLES guidelines they are best characterised within the pronoun/determiner category. They correspond to English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu tagset, I have classified them as JD ? determiner-like adjectives41.
SubClass Of	PronounOrDeterminer
Sub-Classes	DistalDemonstrativeAdjective DistalDemonstrativePronoun InterrogativeAdjective InterrogativePronoun ProximalDemonstrativeAdjective ProximalDemonstrativePronoun RelativeAdjective RelativePronoun
DistalDemonstrativeAdjective
Abstract	There are also in the y-v-k-j sets a number of words that are more like 165 determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone as pronouns. However they behave in some respects more like adjectives, e.g. they can be predicative rather than attributive. In terms of the EAGLES guidelines they are best characterised within the pronoun/determiner category. They correspond to English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu tagset, I have classified them as JD ? determiner-like adjectives41.
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
DistalDemonstrativeAdverb
SubClass Of	PronominalAdverb
DistalDemonstrativeDeadjectivalAdverb
SubClass Of	PronominalAdverb
DistalDemonstrativePronoun
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
ExclamationMark
SubClass Of	Punctuation
ExclusiveEmphaticParticle
Abstract	Exclusive emphatic particle (h?)
SubClass Of	Unique
FeminineGender
Abstract	Gender=feminine
SubClass Of	Gender
Finite
Abstract	Finiteness=finite
SubClass Of	Finiteness
Finiteness
Abstract	The last two attributes, finiteness and mood, are problematic. Firstly, inherent in the EAGLES guidelines is the problem that the mood attribute contains values relevant to both finite and non-finite forms, so that the finiteness attribute becomes redundant. Secondly, the finite/non-finite distinction may be hard to draw in Urdu. The forms described below as participles would traditionally be considered non-finite in European languages. However, in Urdu they have certain features which make them seem more like finite forms. For example, they can occur as the only verb in a main clause, and can agree with a subject or object ? not a property prototypically associated with non-finite forms. These properties are illustrated by the following example from Schmidt (1999: 126)15: unh?~ n? an paRh k? b?t nah 3-PLRL-OBL ERG un educated of-FEM speech not m?n? accept-PERF.PART-FEM-SING They did not accept what the uneducated person said. The verb form m?n? is a participle, but it is the only verb form in the sentence, and it is marked for agreement (with the object, since this clause is of the ergative type). It, like the postposition k?, agrees with the feminine singular noun b?t.
SubClass Of	system:Feature
Sub-Classes	Finite NonFinite
FirstPerson
Abstract	Person=first
SubClass Of	Person
ForeignWord
SubClass Of	Residual
Formula
SubClass Of	Residual
Fraction
Abstract	Urdu has a fairly wide range of words for fractions (there are for example words for ?plus one quarter? (sav?), ?less one quarter? (paun, paun?), ?one half? (?dh, ?dh?), ?one and a half? (D?Rh), ?plus one half? (s?Rh?)), which can modify cardinal numerals as well as nouns. They are therefore tagged separately (although the intermediate tags are not all distinct). Most are unmarked, but two are marked. Two others can also function as nouns, in which case they should receive standard noun tagging.
SubClass Of	Numeral
FullStop
SubClass Of	Punctuation
FutureTense
Abstract	Tense=future
SubClass Of	Tense
GaAuxiliary
Abstract	The form g? indicates future tense when it follows a verb in the subjunctive form. It may also follow the polite imperative as a marker of additional politeness (Bhatia and Koul 2000: 332). It is considered by Schmidt (1999) to be a suffix, although one that is written as a separate word; Bhatia and Koul (2000) go so far as to write the inflected verb and the g? as a single word. However, given that the orthography must lead g? to be treated by the tagging system as a separate token (see 2.2.6.1), and given that the form of the future is otherwise identical to the subjunctive, it makes sense to tag g? separately from the lexical verb. Since g? is marked for gender and number and the subjunctive is marked for person and number, the future would, if treated as a simple rather than a compound tense, be marked for all three of these features ? which is not true of any other simple tense in Urdu. Furthermore, as Schmidt (1999: 94) explains, g? derives from a contraction of the perfective participle of the verb j?n?, ?go?. Therefore, g? is tagged independently. In the intermediate tagset it is considered to be finite, indicative, future, and with zero aspect.
SubClass Of	AuxiliaryVerb
Gender
Abstract	Urdu has two genders, masculine and feminine. Some nouns are marked for gender, whereas others are not3. This means that there is in effect a four-way distinction among nouns: masculine marked, masculine unmarked, feminine marked and feminine unmarked. For example: r?payah ?money? (marked masculine) ghar ?house? (unmarked masculine) bacc? ?female child? (marked feminine) kit?b ?book? (unmarked feminine) (examples from Schmidt 1999: 1-2.) Note that since some unmarked nouns coincidentally display the suffixes typical of marked nouns, the diagnostic feature of a marked noun is that its plural inflection follows that of the marked nouns (e.g. masculine ?? changing to ??, feminine ?? to ?iy?~, and so on). This four-way split could be encoded into a tagset in two ways: by creating two new values for the gender attribute (the EAGLES guidelines have only masculine, feminine, neuter, and common) or by creating a new markedness attribute with two values, 1 = marked for gender and 2 = not marked for gender. The latter approach has been followed since it will almost certainly be easier for software processing the intermediate tagset to ignore an entire attribute than to work out what to do about values it does not recognise in existing attributes. This is especially the case if the extra attribute is added at the end of the tag, as I have done.
SubClass Of	system:Feature
Sub-Classes	FeminineGender MasculineGender
GenderMarking
Abstract	Urdu has two genders, masculine and feminine. Some nouns are marked for gender, whereas others are not3. This means that there is in effect a four-way distinction among nouns: masculine marked, masculine unmarked, feminine marked and feminine unmarked. For example: r?payah ?money? (marked masculine) ghar ?house? (unmarked masculine) bacc? ?female child? (marked feminine) kit?b ?book? (unmarked feminine) (examples from Schmidt 1999: 1-2.) Note that since some unmarked nouns coincidentally display the suffixes typical of marked nouns, the diagnostic feature of a marked noun is that its plural inflection follows that of the marked nouns (e.g. masculine ?? changing to ??, feminine ?? to ?iy?~, and so on). This four-way split could be encoded into a tagset in two ways: by creating two new values for the gender attribute (the EAGLES guidelines have only masculine, feminine, neuter, and common) or by creating a new markedness attribute with two values, 1 = marked for gender and 2 = not marked for gender. The latter approach has been followed since it will almost certainly be easier for software processing the intermediate tagset to ignore an entire attribute than to work out what to do about values it does not recognise in existing attributes. This is especially the case if the extra attribute is added at the end of the tag, as I have done.
SubClass Of	system:Feature
Sub-Classes	MarkedForGender UnmarkedForGender
GeneralAdverb
Abstract	Adverb/Adverb-Type=general "lexical adverb" Lexical adverbs In Urdu these are of two sorts: adverbs which are derived from adjectives by inflecting them to their masculine oblique form or adding a Persian or Arabic loaned 44 This use of ?degree? (i.e. inflected superlative or comparative) should be clearly distinguished from the use of ?degree adverb? below (i.e. words with meanings such as ?very?, ?more?). 173 derivational suffix45 (RRJ), and adverbs which are not (RR). While this unfortunately violates the principle of not including derivational information, this distinction has been included in the tagset for two reasons. Firstly, it helps avoid ambiguity, since an adverb derived from an adjective has the same form as that adjective in its masculine singular oblique form (see Schmidt 1999: 57). If adjectival adverbs were marked RR, this would lead to a wide ambiguity between RR and JJM1O, which would make non-adjectival adverbs ambiguous as well! Using a separate tag, there is only an RRJ~JJM1O ambiguity, which significantly reduces the scope of the ambiguity. Although this is a pragmatic consideration which should probably be included at the subtagset level, it involves creating a distinction rather than collapsing one, and must thus exist in the top level tagset. However, there is another motivation for the RRJ tag, which is that it is necessary to maintain theoretical neutrality. It is possible that some analyst might wish to treat the RRJ adverbs as if they were actually adjectives ? that is, identify them with JJ? categories instead of RR. Indeed Bailey et al. (1956: 18) come close to saying this. The principle of theoretical neutrality must here override the principle of excluding derivational information. The EAGLES intermediate tags for RR and RRJ are the same.
SubClass Of	Adverb
Sub-Classes	DeadjectivalAdverb
GeneralAuxiliary
SubClass Of	Verb
HonaAuxiliary
Abstract	The verb h?n?, ?be?, is the auxiliary with the greatest range of application: the Urdu compound tenses are formed with it, and it has other uses, such as the copula. It can also be the sole verb of a main clause, but as explained above (section 3.2) it will be tagged the same whether it is a main verb or an auxiliary. The following examples from Schmidt (1999: 94, 120, 126) demonstrate the range of h?n?: ?j mai~ daftar m?~ nah?~ h?~ today 1-SING-NOM office in not be-PRES-1-SING Today I am not in the office (h?n? as copula with postpositional phrase) kal mausam acch? th? yesterday weather good-MASC-SING-NOM be-PAST-MASC-SING Yesterday the weather was fine (h?n? as copula with adjective) ham far? par s?t? hai~ 1-PLRL-NOM floor on sleep-IMPERF.PART-MASC-PLRL be-PRES-1- PLRL we sleep on the floor (h?n? as auxiliary marking the habitual present with imperfective participle) b?ri? h?? hai rain be-PERF.PART-FEM-SING be-PRES-3-SING It has rained (h?n? as auxiliary marking immediate past with perfective participle of h?n? as main verb; more literal translation would be ?There has been rain?) Some of the parts of h?n? are equivalent to the parts of lexical verbs; this being so, their tags are the same for those of lexical verbs, except that they commence in VH? instead of VV?. In the intermediate tagset, this difference is expressed by the verbs being marked as auxiliary instead of main. Unfortunately, Schmidt (1999) does not give a full listing of all the forms of h?n?, and I was forced to use other methods as outlined in 2.3. The first recourse was to refer to other works ? in this case Bailey 133 et al. (1956). However, there were still gaps in the listing of forms of h?n?. When initially composing the tagset, I was forced by the underspecification in the literature to infer the existence and shape of some forms of the infinitive and imperative. In the case of an irregular verb like h?n?, implying its forms on the basis of regular verbal inflections involves making unwarranted assumptions. Therefore, these forms were treated as highly provisional in nature until the stage of manual tagging was undertaken (as described in the next chapter). At this point, it was possible to find examples in tagged texts for most of the forms. The polite imperative was a very notable exception to this. It did not occur in any of the manually tagged texts, and of two native speaker informants consulted on the issue, one concluded that the form h?iy? was not possible. However, the other informant suggested that it was possible. This being the case, the VHIA tag stands ? since there can be no harm in maintaining the parallelism with other verbs even if this form is rare to vanishing point. The past participle of h?n?, as with that of other verbs, can be used alone as a simple past tense. The participial tags above would be used in this case. However, there is also an irregular inflected simple past tense ? which, as might be expected, differs slightly in its meaning (Bailey et al. 1956: 109; Barz 1977: 48-49 considers this to be an instance of two separate verbs with the same infinitive22). There is, in addition, an irregular inflected simple present tense (the only one in the whole language). These inflected forms are the basis of the compound tense system and both require separate tags, as follows. Like the regular inflected subjunctive mood, the present indicative of h?n? is marked for person and number but not gender. The intermediate tags for the present tense are the same for those of the subjunctive except that the mood is indicative. In the mnemonic tags I use H to indicate the present tense, since this tense is entirely characteristic of h?n?. The irregular past tense is marked for gender and number in the same way as a perfective participle, but it is a finite form. The intermediate tags are the same as those for the present tense, except that 1) gender is not zero, 2) person is zero, and 3) tense is past rather than present.
SubClass Of	AuxiliaryVerb
HonorificSecondPerson
Abstract	tag. The existence of a second person honorific form does not undermine the general principle, stated above, that the ?p pronoun takes a third person verb form since, in the imperative, there is no third person, and the subject is not expressed anyway. For the purposes of the intermediate tagset the tense is considered to be present, and the number of the honorific form is considered to be ( 1 \| 2 ), since both singular and plural ?subjects? are possible. This also serves to distinguish the VVIA tag in the intermediate tagset. The mnemonic ?A? is the same as that used for the ?p pronoun, and thus refers to politeness.
SubClass Of	Person
Imperative
Abstract	There are three simple imperative forms: second person singular (which is identical to the ?root? form), second person plural (which is identical to the second person plural subjunctive form) and second person honorific. Each of these receives a separate tag. The existence of a second person honorific form does not undermine the general principle, stated above, that the ?p pronoun takes a third person verb form since, in the imperative, there is no third person, and the subject is not expressed anyway. For the purposes of the intermediate tagset the tense is considered to be present, and the number of the honorific form is considered to be ( 1 \| 2 ), since both singular and plural ?subjects? are possible. This also serves to distinguish the VVIA tag in the intermediate tagset. The mnemonic ?A? is the same as that used for the ?p pronoun, and thus refers to politeness.
SubClass Of	LexicalVerb
ImperativeMood
Abstract	Mood=imperative
SubClass Of	Mood
ImperfectiveAspect
Abstract	Aspect=imperfective
SubClass Of	Aspect
ImperfectiveParticiple
Abstract	Urdu has two participles, the imperfective and the perfective. However, unlike participles in many European languages, they can be used as the sole verb of a main clause. This creates the tenses referred to as the irrealis and the simple past respectively. However, the presence or absence of an auxiliary makes no difference to the form of the participle. It would therefore be misleading to use two tags for a single form of the verb. These tags are thus used for both finite and non-finite, and the notions of irrealis and simple past are not referred to in the precise definitions of the tags. The dual finite and non-finite nature of the tags is indicated in the intermediate tagset using the OR operator, \| . There is a value in the EAGLES tagset for past tense, but there is not one for irrealis. The closest approximation to an irrealis in the EAGLES guidelines is subjunctive past (see the discussion of this point in 3.2 above). This is not a perfect solution, but without adding extra values to the intermediate tagset it is the best that can be managed. Thus, the imperfective is finite subjunctive past with zero aspect or non-finite participle imperfective with zero tense. The perfective is finite indicative18 past with zero aspect or non-finite participle perfective with zero tense. The participles are not marked for person, but are marked for gender and 18 It is hard to justify this use of ?indicative?, since Urdu lexical verbs do not possess any indicative form as such. Therefore the notion of the indicative is not used in the definitions of the tags themselves, but only in the intermediate tagset (where something is needed to distinguish the finite use of the perfective participle from the finite use of the imperfective participle). 123 number. Their inflection is the same as that of adjectives, except that in some circumstances a distinction is made between feminine singular and plural which is not made by adjectives. Participles can also function as adjectives (see discussion of adjectives in 3.3 below), in which case this extra feminine singular/feminine plural distinction is not made (though this does not affect the tagging). That is to say, an adjective which agrees with a feminine plural noun or pronoun will always receive an F2 tag, regardless of whether it has the plural ending ??~ or the more general feminine ending ??. When participles are used as adjectives, it would in theory be possible to tag them as if they were adjectives. However, this has not been done, since even when being used attributively, participles appear in structures that normal adjectives do not. For example, they frequently occur in participial phrases with the perfective participle of the auxiliary verb h?n? (see below). When used adjectivally rather than verbally, participles may be marked for case as well as number and gender. This feature is also included in the tagset. Of course, the feature case only applies to the non-finite usage of the participle; this is reflected in the intermediate tagset by the use of ( 0 \| 1 ) for the nominative or finite form. As with adjectives (see below), the ?oblique? case is ( 3 \| 5 ) in the intermediate tagset. The characters Y and T have been used for the perfective and imperfective participles respectively, since these are the consonants that indicate the suffixes for these forms19.
SubClass Of	Participle
InclusiveEmphaticParticle
Abstract	Inclusive emphatic particle (bh?)
SubClass Of	Unique
IndefiniteDeterminer
Abstract	There is also a tag for indefinite determiners. Two words in this class are zy?dah ?more? and k?f? ?enough?. Following Schmidt (1999) these are classed broadly as adjectives for two reasons: to keep them in line with the possessive adjectives, which are determiners; and because they can also function as adverbs (see section 3.6 below), which is characteristic of adjectives. These are not marked for gender, number or case.
SubClass Of	OtherPronounOrDeterminer
IndefinitePronoun
Abstract	In this miscellaneous group of pronouns are included two indefinite pronouns, k?? and kuch, which may function as pronouns or determiners (just as yah and vah do). Also included in the PN* category is sab, ?all?, which has an inflected oblique plural (like numerals ? see section 3.9) which is tagged as PNO.
SubClass Of	OtherPronounOrDeterminer
IndicativeMood
Abstract	Mood=indicative
SubClass Of	Mood
Infinitive
Abstract	The infinitive of the verb is regularly formed. Mostly it is used as a verbal noun or as part of a complex verb phrase. It is also used as a neutral request form, in which case it is the main verb of its clause; however, I do not think that this usage is 121 sufficient to justify separate tagging; this is better treated example of a secondary usage of the same word, rather than a separate word (which giving it a separate tag would imply). The ?default? ending of the infinitive is ?n?, which is a masculine singular ending. When used as a noun it may occur in the oblique case; when it occurs in a verb phrase it may display gender and number agreement (in a similar way to an adjective). However these conditions cannot both occur17; therefore there is no feminine oblique or plural oblique, which reduces the number of tags necessary. There is a problem creating the intermediate tagset: inasmuch as there is no attribute for ?case? in the EAGLES guidelines for verbs (presumably non-finite verb forms in European languages do not display case inflection). An attribute, case, has therefore been added to the end of the intermediate tags. Otherwise this set of intermediate tags is fairly unproblematic. The ?N? in the mnemonic tags is derived from the ?n? suffix that indicates the infinitive.
SubClass Of	LexicalVerb
InfinitiveMood
Abstract	Mood=infinitive
SubClass Of	Mood
Interjection
Abstract	The EAGLES guidelines do not recommend any additional attributes for the class of interjections. Nor have I introduced any of my own. There is thus one tag. The mnemonic tag represent the spelling of ? (Schmidt 1999: 217), which has been selected as a representative interjection.
SubClass Of	PartOfSpeech
InterrogativeAdjective
Abstract	There are also in the y-v-k-j sets a number of words that are more like 165 determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone as pronouns. However they behave in some respects more like adjectives, e.g. they can be predicative rather than attributive. In terms of the EAGLES guidelines they are best characterised within the pronoun/determiner category. They correspond to English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu tagset, I have classified them as JD ? determiner-like adjectives41.
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
InterrogativeAdverb
SubClass Of	PronominalAdverb
InterrogativeDeadjectivalAdverb
SubClass Of	PronominalAdverb
InterrogativePronoun
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
Izafat
Abstract	The iz?fat is a Persian enclitic (pronounced as a shorter form of ???) which in some circumstances can be considered a preposition: it links two nouns in a possessive relationship, although the phrase thus produced may often have a different meaning to a phrase produced with the native Urdu postposition k?. However, the iz?fat may also join a noun to an adjective, in which case it is not so clearly accurate to describe it as a preposition parallel to the prepositions in European languages for which the EAGLES guidelines were compiled. A better way to treat iz?fat is in the context of the Unique category of miscellaneous one-member wordclasses, discussed below.
SubClass Of	Unique
Letter
SubClass Of	Residual
LexicalVerb
Abstract	The EAGLES guidelines do not consider lexical and auxiliary verbs to be separate major parts of speech, although this is a view that some have held (e.g. the ICE tagset ? Greenbaum and Yibin 1996). However, in Urdu this distinction is very significant, since auxiliary forms pattern differently to the forms of lexical verbs. Therefore, this tagset will employ a high-level (but not top-level) distinction between lexical verbal elements (whose tags will commence with VV) and non-lexical or auxiliary verbal elements (whose tags will commence with V and one other letter ? either one indicating what word it is, for auxiliary verbs whose inflectional behaviour is anomalous, or X for a general auxiliary). Thus both the EAGLES guidelines and the demands of Urdu morphology are complied with. There exist in Urdu two widely applicable derivational suffixes which attach to the root of a lexical verb and increase its valence, making it transitive or causative in sense. This has been highlighted as a significant feature of the language (e.g. by Kachru 1990: 63)and is described in some detail by Schmidt (1999: 87, 157-175). It might be possible to distinguish such derived verbs from non-derived verbs in the tagset, but I do not, because of the design principle that no derivational information should be included. Furthermore, such a distinction would be difficult to automate, and also probably difficult for humans to annotate. Lexical verbs occur in a number of inflected forms. The names of these forms are perhaps not very useful, since each of them has a variety of uses hard to capture by one of the traditional grammatical category names. However, rather than resort to letters or numbers which would be unlinkable to any previous writing on the Urdu verb, I use the same names for the forms as Schmidt (1999), as I have been doing thus far in this thesis.
SubClass Of	Verb
Sub-Classes	Imperative Infinitive Root Subjunctive
MarkedForGender
Abstract	Markedness=1
SubClass Of	GenderMarking
MasculineGender
Abstract	Gender=masculine
SubClass Of	Gender
ModalAdverb
SubClass Of	NonLexicalAdverb
Sub-Classes	NegativeModalAdverb
Mood
Abstract	The last two attributes, finiteness and mood, are problematic. Firstly, inherent in the EAGLES guidelines is the problem that the mood attribute contains values relevant to both finite and non-finite forms, so that the finiteness attribute becomes redundant. Secondly, the finite/non-finite distinction may be hard to draw in Urdu. The forms described below as participles would traditionally be considered non-finite in European languages. However, in Urdu they have certain features which make them seem more like finite forms. For example, they can occur as the only verb in a main clause, and can agree with a subject or object ? not a property prototypically associated with non-finite forms. These properties are illustrated by the following example from Schmidt (1999: 126)15: unh?~ n? an paRh k? b?t nah 3-PLRL-OBL ERG un educated of-FEM speech not m?n? accept-PERF.PART-FEM-SING They did not accept what the uneducated person said. The verb form m?n? is a participle, but it is the only verb form in the sentence, and it is marked for agreement (with the object, since this clause is of the ergative type). It, like the postposition k?, agrees with the feminine singular noun b?t. A third problem with the mood distinctions made in the EAGLES guidelines is that they are not necessarily those made by Urdu. For example, Urdu has forms which 15 Schmidt does not give word-by-word glosses, only whole-sentence translations. I have added the glosses using Schmidt (1999) and Haq (2001) as guides. See also Appendix 2. 117 may be described as subjunctive and imperative moods, but it would seem to lack an indicative (except for the auxiliary h?n?). Because of these difficulties, the concepts of finiteness and mood will not be used to structure the tagset itself, although they are of course inevitable as attributes in the intermediate tagset16. This means that in some cases, the intermediate tagset values used to characterise some Urdu verb forms are somewhat arbitrary, since I have had to simply pick the values that seem closest to describing Urdu. For example, considering the ?irrealis tense? (the term used by Schmidt 1999 for the finite use of the imperfective participle) to be a past tense subjunctive is not warranted by the Urdu verbal system. It was picked as the ?least bad? way to characterise it simply because the Urdu irrealis has a usage similar to that of the past subjunctive in languages included in EAGLES such as German and (vestigially) English (e.g. ich w?re, I were). For example, Schmidt (1999) translates a sentence from the poet Ghalib as follows: agar aur j?t? raht? if and alive-MASC-PLRL stay-IMPERF.PART-MASC-PLRL yah? intiz?r h?t? this-very waiting be-IMPERF.PART-MASC-SING If I were to live longer it would only be to wait like this The presence in the translation of the past tense subjunctive (?I were?) in the first ? but not the second ? of two clauses containing the finite imperfective participle demonstrates the partial parallelism between an Urdu irrealis and an English past subjunctive.
SubClass Of	system:Feature
Sub-Classes	ImperativeMood IndicativeMood InfinitiveMood ParticipleMood SubjunctiveMood
MultiplicativeMarker
Abstract	Multiplicative marker (gun?)
SubClass Of	Unique
NegativeModalAdverb
Abstract	particles that mark tense, aspect and negation, cf. Schmidt (1999, p.69f.)
SubClass Of	ModalAdverb
NeutralQuotation
SubClass Of	Punctuation
NominativeCase
Abstract	Case=nominative Nominative is usually given meaning by its contrast with accusative ? a case that does not exist in Urdu. The nominative may in Urdu be used for either, neither or both of the subject and the direct object. Thus it is not certain whether the nominative in Urdu really corresponds with the nominative that is value 1 in the EAGLES guidelines. (Barz (1977) and McGregor (1972) actually call the nominative case the direct case.) Certainly it does not correspond with the nominative as it exists in, for example, German or Latin. However, I have used value 1 in the intermediate tagset for the Urdu case, on the basis that no Urdu case resembles the nominative in the European languages for which the EAGLES guidelines were devised any more closely than the Urdu nominative.
SubClass Of	Case
NonFinite
Abstract	Finiteness=non-finite
SubClass Of	Finiteness
NongrammaticalLexicalElement
Abstract	Nongrammatical lexical element Words that contain an orthographic space which does not actually represent a word break ? principally Persian loans such as zimmah d?r, ?responsible?, x?b tar?n, ?best?, and ham z?t, ?of the same caste?63 ? cause a problem for tokenisation as described in 2.2.6.1. This was solved by the decision to treat every orthographic space as a word break, so that zimmah d?r, etc., are treated as two tokens. However, this leads to another problem, greater if anything, concerned with tagging. How are the two elements to be tagged? 62 This problem is referred to as such because it was first encountered during an attempt to manually tag a sentence from Schmidt (1999) containing the word zimmah d?r using an early trial version of the tagset. 63 All examples from Schmidt (1999: 248-256). 193 As it happens, zimmah, x?b and z?t are independent words (?duty?, ?good? and ?caste? respectively) and could be given the appropriate tags, nominal and adjectival. The problem then becomes, what to do with d?r, t?r?n and ham? The former two could be given some tag to indicate that they were adjective forming clitics or affixes, and the prefix ham could be marked up as an adverb (according to Haq 2001 the part of speech of ham when it occurs independently). However, this has two drawbacks. Firstly, it breaks with the design principle that no derivational information will be included in the tagset by analysing the component morphemes of complex words ? for zimmah d?r etc. are words, not phrases. The word zimmah d?r?, ?responsibility?, is clear evidence of this ? it has been created by a morphological process (suffixation of ??) and morphological processes apply to words, not to syntactic phrases64. Also, the single word zimmah d?r has been given two tags in this approach ? a contravention of the ?one word, one tag? principle65. Secondly, it introduces inconsistency into the tagging. The derivational information would be present for some words formed with the relevant Persian derivational morphemes, but not for all, because not all words formed with them contain the superfluous orthographic token break. Examples of single-token derived words include samajhd?r, ?sensible?, kamtar?n, ?least?, and hamdard?, ?sympathy?. If zimmah and d?r are to be tagged separately, then for consistency samajh would also have to be tagged separately ? opening up whole vistas of morphological analysis that are utterly irrelevant to part-of-speech tagging. Indeed, going down this road subverts the entire enterprise: we would find ourselves engaged in derivational analysis instead of morphosyntactic analysis. To take the opposite approach to tagging zimmah d?r, we might mark a single tag for the whole word (JJU in this case) ? however this also breaks the ?one word, one tag? principle as there is now an untagged token and multiword tag. The best solution to the problem (although far from ideal) would seem to be to use some kind of special tag on the first part of the two-token word to indicate that this is a case of the zimmah d?r problem, and put the tag we would like to give to the whole thing on the second token66. This tag will be LL, the ?nongrammatical lexical element? listed in the previous section, and it will be applied thus67: zimmah_LL d?r_JJU samajhd?r_JJU x?b_LL tar?n_JJU kamtar?n_JJU ham_LL z?t_JJU hamdard?_NNUF1N The first element is described as a nongrammatical lexical element because while it does not contribute to the morphosyntax of the two-token word, it does contribute to its meaning. Therefore it is entirely lexical in nature. It is to be hoped 66 Since d?r, tar?n and other affixes involved in the zimmah d?r problem are derivational suffixes, it is they that determine the part of speech; thus it makes sense for them to carry the actual tag. 67 I use an underscore format to link the words and their tags for clarity in the examples given here; in practice an XML/SGML markup would be used. 195 that the usage of the LL tag can be restricted to one context: alongside a relatively small number of affixes such as d?r.
SubClass Of	Unique
NonLexicalAdverb
SubClass Of	Adverb
Sub-Classes	DegreeAdverb ModalAdverb PronominalAdverb
NonPersoArabicString
SubClass Of	Residual
Noun
Abstract	The EAGLES guidelines give four recommended attributes for nouns: type, gender, number and case. There are also two optional attribute, countability and definiteness. Type refers to whether a noun is common (denotes one or more members of a class of things2) or proper (is the name of one or more particular things). This attribute is an example of one which is marginal to morphosyntax, but should be included since the distinction between common and proper might well prove useful to some future linguistic investigation of the text. It has been included in the tagset for now, but with the reservation that it might have to be collapsed in any subtagset for automatic tagging. This is because there may well not be any way for the tagger to make this distinction. Unlike the Roman, Greek and Cyrillic alphabets, the Urdu alphabet has no uppercase letters. In the European languages for which the EAGLES guidelines were designed, which use one of the former alphabets, uppercase letters are often used to identify proper nouns. It is clear that no such simple rule could be employed in Urdu. Furthermore there are no articles in Urdu (Bhatia and Koul 2000: 318), the absence and presence of an article being typical of proper and common nouns respectively in English and similar languages.
SubClass Of	PartOfSpeech
Sub-Classes	CommonNoun ProperNoun
Number
Abstract	Urdu has two numbers, singular and plural. This is well agreed on (Schmidt 1999: 1; Bhatia and Koul 2000: 314; Barz 1977: 36; Bailey et al. 1956: 1, 5). The EAGLES guidelines on noun number allow for exactly this possibility, and thus have been implemented unproblematically.
SubClass Of	system:Feature
Sub-Classes	PluralNumber SingularNumber
Numeral
Abstract	The EAGLES guidelines give numerals as a separate major part-of-speech, but 51 In fact the EAGLES guidelines on this point are significantly more complicated. However, the remainder of the recommendations are concerned with handling phenomena that do not occur in Urdu. 181 say that ?In some languages (e.g. Portuguese) this category is not normally considered to be a separate part of speech, because it can be subsumed under others? We recognise that in some tagsets Numeral may therefore occur as subcategory within other parts of speech? (Leech and Wilson 1999: 65). This approach seems sensible for Urdu, where numerals display very much the behaviour of adjectives. However, for purposes of the intermediate tagset, the numeral class has been used, since it contains the very useful attribute type. In fact, all the EAGLES attributes have been used (though of course, not all of their values). For case, the oblique / vocative value ( 3 \| 5 ) is used, as with adjectives.
SubClass Of	PartOfSpeech
Sub-Classes	CardinalNumber Fraction OrdinalNumber
ObliqueCase
Abstract	Case=dative There is no value in the EAGLES guidelines for oblique. Nor is there one for postpositional, locative or instrumental (alternative names used by Bailey et al. 1956 for this case6). Rather than invent an extra value (undesirable for reasons given with regard to markedness above), I have used the value for dative to represent oblique, on the grounds that in some European languages (e.g. German) prepositions frequently govern the dative, and in Urdu postpositions govern the oblique.
SubClass Of	Case
ObliqueOrVocativeCase
Abstract	As far as marked adjectives are concerned, there is again the problem of tagto- meaning many-to-one and one-to-many mapping ? but with adjectives it is, if anything, even greater a problem than it was with nouns. There is no oblique-vocative distinction at all (Schmidt 1999: 36 goes so far as to say that ?An adjective modifying a vocative noun is in the oblique case?) ... Thus the tagset does not distinguish vocative adjectives from oblique adjectives (or participle forms of verbs: see above). In the intermediate tagset, this is represented using the OR and bracket operators, as described in the EAGLES guidelines (Leech and Wilson 1999: 71), as ( 3 \| 5 ).
OpenParenthesis
SubClass Of	Punctuation
OpenQuotationMark
SubClass Of	Punctuation
OpenSquareBracket
SubClass Of	Punctuation
OrdinalNumber
Abstract	Numeral/Type=ordinal
SubClass Of	Numeral
OtherPronounOrDeterminer
Abstract	Other pronouns and determiners In this miscellaneous group of pronouns are included two indefinite pronouns, k?? and kuch, which may function as pronouns or determiners (just as yah and vah do). Also included in the PN* category is sab, ?all?, which has an inflected oblique plural (like numerals ? see section 3.9) which is tagged as PNO. There is also a tag for indefinite determiners. Two words in this class are zy?dah ?more? and k?f? ?enough?. Following Schmidt (1999) these are classed broadly as adjectives for two reasons: to keep them in line with the possessive adjectives, which are determiners; and because they can also function as adverbs (see section 3.6 below), which is characteristic of adjectives. These are not marked for gender, number or case.
SubClass Of	PronounOrDeterminer
Sub-Classes	IndefiniteDeterminer IndefinitePronoun
OtherSymbol
SubClass Of	Residual
OtherUnclassifiableNonUrduElement
SubClass Of	Residual
Participle
Abstract	Urdu has two participles, the imperfective and the perfective. However, unlike participles in many European languages, they can be used as the sole verb of a main clause. This creates the tenses referred to as the irrealis and the simple past respectively. However, the presence or absence of an auxiliary makes no difference to the form of the participle. It would therefore be misleading to use two tags for a single form of the verb. These tags are thus used for both finite and non-finite, and the notions of irrealis and simple past are not referred to in the precise definitions of the tags. The dual finite and non-finite nature of the tags is indicated in the intermediate tagset using the OR operator, \| . There is a value in the EAGLES tagset for past tense, but there is not one for irrealis. The closest approximation to an irrealis in the EAGLES guidelines is subjunctive past (see the discussion of this point in 3.2 above). This is not a perfect solution, but without adding extra values to the intermediate tagset it is the best that can be managed. Thus, the imperfective is finite subjunctive past with zero aspect or non-finite participle imperfective with zero tense. The perfective is finite indicative18 past with zero aspect or non-finite participle perfective with zero tense. The participles are not marked for person, but are marked for gender and 18 It is hard to justify this use of ?indicative?, since Urdu lexical verbs do not possess any indicative form as such. Therefore the notion of the indicative is not used in the definitions of the tags themselves, but only in the intermediate tagset (where something is needed to distinguish the finite use of the perfective participle from the finite use of the imperfective participle). 123 number. Their inflection is the same as that of adjectives, except that in some circumstances a distinction is made between feminine singular and plural which is not made by adjectives. Participles can also function as adjectives (see discussion of adjectives in 3.3 below), in which case this extra feminine singular/feminine plural distinction is not made (though this does not affect the tagging). That is to say, an adjective which agrees with a feminine plural noun or pronoun will always receive an F2 tag, regardless of whether it has the plural ending ??~ or the more general feminine ending ??. When participles are used as adjectives, it would in theory be possible to tag them as if they were adjectives. However, this has not been done, since even when being used attributively, participles appear in structures that normal adjectives do not. For example, they frequently occur in participial phrases with the perfective participle of the auxiliary verb h?n? (see below). When used adjectivally rather than verbally, participles may be marked for case as well as number and gender. This feature is also included in the tagset. Of course, the feature case only applies to the non-finite usage of the participle; this is reflected in the intermediate tagset by the use of ( 0 \| 1 ) for the nominative or finite form. As with adjectives (see below), the ?oblique? case is ( 3 \| 5 ) in the intermediate tagset. The characters Y and T have been used for the perfective and imperfective participles respectively, since these are the consonants that indicate the suffixes for these forms19.
Sub-Classes	ImperfectiveParticiple PerfectiveParticiple
ParticipleMood
Abstract	Mood=participle
SubClass Of	Mood
PartOfSpeech
SubClass Of	system:UnitOfAnnotation
Sub-Classes	Adjective Adposition Adverb Article Conjunction Interjection Noun Numeral PronounOrDeterminer Punctuation Residual Unique Verb
PastTense
Abstract	Tense=past
SubClass Of	Tense
PerfectiveAspect
Abstract	Aspect=perfective
SubClass Of	Aspect
PerfectiveParticiple
Abstract	Urdu has two participles, the imperfective and the perfective. However, unlike participles in many European languages, they can be used as the sole verb of a main clause. This creates the tenses referred to as the irrealis and the simple past respectively. However, the presence or absence of an auxiliary makes no difference to the form of the participle. It would therefore be misleading to use two tags for a single form of the verb. These tags are thus used for both finite and non-finite, and the notions of irrealis and simple past are not referred to in the precise definitions of the tags. The dual finite and non-finite nature of the tags is indicated in the intermediate tagset using the OR operator, \| . There is a value in the EAGLES tagset for past tense, but there is not one for irrealis. The closest approximation to an irrealis in the EAGLES guidelines is subjunctive past (see the discussion of this point in 3.2 above). This is not a perfect solution, but without adding extra values to the intermediate tagset it is the best that can be managed. Thus, the imperfective is finite subjunctive past with zero aspect or non-finite participle imperfective with zero tense. The perfective is finite indicative18 past with zero aspect or non-finite participle perfective with zero tense. The participles are not marked for person, but are marked for gender and 18 It is hard to justify this use of ?indicative?, since Urdu lexical verbs do not possess any indicative form as such. Therefore the notion of the indicative is not used in the definitions of the tags themselves, but only in the intermediate tagset (where something is needed to distinguish the finite use of the perfective participle from the finite use of the imperfective participle). 123 number. Their inflection is the same as that of adjectives, except that in some circumstances a distinction is made between feminine singular and plural which is not made by adjectives. Participles can also function as adjectives (see discussion of adjectives in 3.3 below), in which case this extra feminine singular/feminine plural distinction is not made (though this does not affect the tagging). That is to say, an adjective which agrees with a feminine plural noun or pronoun will always receive an F2 tag, regardless of whether it has the plural ending ??~ or the more general feminine ending ??. When participles are used as adjectives, it would in theory be possible to tag them as if they were adjectives. However, this has not been done, since even when being used attributively, participles appear in structures that normal adjectives do not. For example, they frequently occur in participial phrases with the perfective participle of the auxiliary verb h?n? (see below). When used adjectivally rather than verbally, participles may be marked for case as well as number and gender. This feature is also included in the tagset. Of course, the feature case only applies to the non-finite usage of the participle; this is reflected in the intermediate tagset by the use of ( 0 \| 1 ) for the nominative or finite form. As with adjectives (see below), the ?oblique? case is ( 3 \| 5 ) in the intermediate tagset. The characters Y and T have been used for the perfective and imperfective participles respectively, since these are the consonants that indicate the suffixes for these forms19.
SubClass Of	Participle
Person
Abstract	Urdu has the three normal persons given in the EAGLES guidelines, each in singular and plural forms. Schmidt (1999: 97) suggests that Urdu verbs also have an additional polite or honorific form, which although second person in meaning (it agrees with a pronoun ?p that refers to one or more interlocutors) is identical to the third person plural form of the verb. In this case I have deviated from the model described by Schmidt, for reasons discussed in my treatment of the ?p pronoun in section 3.4.1.2. There will be no tags for honorific verbal forms, and verb forms which agree with ?p will be tagged as third person forms. The exception to this is the imperative, discussed in the next section.
SubClass Of	system:Feature
Sub-Classes	FirstPerson HonorificSecondPerson SecondPerson ThirdPerson
PersonalPronoun
Abstract	The issue of what exactly constitutes a personal pronoun is not an easy one in the context of the grammar of Urdu as presented by Schmidt (1999). Therefore, in this section, before discussing the tags of the personal pronouns I elaborate on how I drew the boundary of this category, justifying the minor claim that the pronouns vah and yah (and their various inflected forms) are not personal pronouns, as stated by Schmidt (1999)29. I first consider these third person pronouns (3.4.1.1), and subsequently the problematic honorific pronoun ?p (3.4.1.2). In 3.4.1.3 I deal with the tagging of mai~ and t?, the remaining words in the category of personal pronouns. 3.4.1.1 The non-existence of third person personal pronouns Urdu has no third person personal pronouns. The demonstrative pronouns/determiners are used in their place. This is claimed contrary to Schmidt, who states (1999: 15) that ?The demonstrative pronouns ye and vo are identical in form to the personal pronouns ye and vo (meaning ?he?, ?she?, ?it?)?. However the differences in behaviour between these pronouns and the first and second person pronouns that I list below, also drawn from Schmidt, make it clear that the statement that began this section is justified. There are absolutely no differences in case / number inflection between the third person pronouns and the demonstratives (Schmidt 1999: 16) ? In a perfective transitive sentence (the type that some, such as Dixon 1994, would class as ?ergative?), a third person pronoun subject appears in the oblique case (like a noun); but a first or second person subject pronoun is in the nominative case at all times (Schmidt 1999: 22) ? The third person pronouns take special plural oblique forms before the postposition n? (Schmidt 1999: 22), whereas the first and second do not ? There are no possessive adjectives corresponding to the third person pronouns, whereas there are such adjectives corresponding to the first and second person pronouns (Schmidt 1999: 24) On these grounds, I exclude the third person pronouns from consideration as personal pronouns, and deal with them as demonstratives/determiners, etc. (see section 3.4.2). Thus, the subcategory of first and second person personal pronouns contains only the pronouns mai~ and t?, and inflectionally related forms such as their plurals and possessive forms. All tags in this subcategory begin PP? (or PG? for possessives). Personal pronouns are not marked for gender: as with verbs, that which is marked for person is not marked for gender. (The ?M? in the tags below signifies ?first person?, not ?masculine?.) They are marked for number and case. As noted in the preceding section, the intermediate tagset for pronouns contains an attribute of politeness. All pronouns in this section are given as familiar, to distinguish their intermediate tags from that for ?p. In practice, the singular/plural distinction is often also used to indicate formality in the second person pronouns (Bhatia and Koul 2000: 35-36); tum may apply to one or more than one person. However, the EAGLES guidelines suggest34 that such a pragmatic usage of the number distinction may still be encoded as a number distinction. This is what I have done, tagging tum as plural, on the basis that for purposes of inflection it is the number of the pronoun, not the number of its referent, that counts. There are possessive adjectives corresponding to the personal pronouns above. While the intermediate tagset must treat these as pronouns, within the Urdu tagset they could have been treated as adjectives (as has been done with some other determiner-like pronouns; see below). However, this has not been done, since the possessive adjectives have person. This is not true for any adjectival form, and thus the possessive adjectives are better classed as personal pronouns. As they are adjectival, they may be marked for gender, number and case. The 157 case and gender attributes indicate the features that are in agreement with the head noun rather than inherent features of the pronoun. The number attribute is also for agreement; the inherent number of the possessive adjective itself is shown by the attribute possessive.
SubClass Of	PronounOrDeterminer
Sub-Classes	SecondPersonHonorificPronoun
PluralNumber
Abstract	Number=plural
SubClass Of	Number
PossessiveAdjective
Abstract	There are possessive adjectives corresponding to the personal pronouns above. While the intermediate tagset must treat these as pronouns, within the Urdu tagset they could have been treated as adjectives (as has been done with some other determiner-like pronouns; see below). However, this has not been done, since the possessive adjectives have person. This is not true for any adjectival form, and thus the possessive adjectives are better classed as personal pronouns.
SubClass Of	PronounOrDeterminer
Sub-Classes	ReflexivePossessiveAdjective
Postposition
SubClass Of	Adposition
PredicativeAdjective
Abstract	Adjective/Use=predicative
SubClass Of	Adjective
PremultiplicativeCliticNumeral
Abstract	Pre-multiplicative clitic cardinal number du-, ti-, cau-
SubClass Of	Unique
Preposition
SubClass Of	Adposition
PresentTense
Abstract	Tense=present
SubClass Of	Tense
PronominalAdverb
SubClass Of	NonLexicalAdverb
Sub-Classes	DistalDemonstrativeAdverb DistalDemonstrativeDeadjectivalAdverb InterrogativeAdverb InterrogativeDeadjectivalAdverb ProximalDemonstrativeAdverb ProximalDemonstrativeDeadjectivalAdverb RelativeAdverb RelativeDeadjectivalAdverb
PronounOrDeterminer
Abstract	The EAGLES guidelines treat pronouns and determiners together as a single category, although one of the recommended attributes, category, distinguishes between them. Since in Urdu the distinction is not clear (particularly in the area of third person pronouns), I also treat this category as being single at the most fundamental level. The difference between what is considered a determiner and what is considered a pronoun is not made in the EAGLES guidelines, which say ?different analyses for different languages entail separating [these parts of speech] out in different ways? (Leech and Wilson 1999: 63). For Urdu, I have mostly followed Schmidt ? who does not have a separate ?determiner? category ? in the divisions I make. However, I have classed together all third person pronouns/demonstratives, interrogative and relative pronouns/determiners, because these form sets of words 149 displaying morphological symmetry (see 3.4.2). Schmidt counts pronouns such as yah, vah, as both personal pronouns and determiners. However, for the purposes of the tagset, the division should be sharp; therefore I have limited the ?personal pronouns? category to the first and second persons. The justification for this is given in section 3.4.1.1. I have also diverged from Schmidt in classing together a number of her minor categories of pronoun under the covering title ?other? for the purposes of this tagset definition. This gives the following groups of pronoun/determiner-like words ? first and second person personal pronouns ? third person pronouns/demonstratives, interrogative and relative pronouns and determiners ? reflexive pronouns ? other pronouns and determiners There is one pronoun, ?p (a kind of honorific personal pronoun) which does not fit unproblematically into any of these categories. Discussion is devoted to this pronoun in section 3.4.1.2 below.
SubClass Of	PartOfSpeech
Sub-Classes	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer OtherPronounOrDeterminer PersonalPronoun PossessiveAdjective ReciprocalPronoun ReflexivePronoun
ProperNoun
Abstract	Noun/Type=proper
SubClass Of	Noun
ProximalDemonstrativeAdjective
Abstract	There are also in the y-v-k-j sets a number of words that are more like 165 determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone as pronouns. However they behave in some respects more like adjectives, e.g. they can be predicative rather than attributive. In terms of the EAGLES guidelines they are best characterised within the pronoun/determiner category. They correspond to English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu tagset, I have classified them as JD ? determiner-like adjectives41.
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
ProximalDemonstrativeAdverb
SubClass Of	PronominalAdverb
ProximalDemonstrativeDeadjectivalAdverb
SubClass Of	PronominalAdverb
ProximalDemonstrativePronoun
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
Punctuation
Abstract	The EAGLES guidelines allow three options for the markup of word-external punctuation: firstly, to use a single tag for all punctuation marks (the obligatoryattribute- only approach); secondly, to give each punctuation mark its own separate tag; and thirdly, to group punctuation marks into a smaller number of tags according to how they may position in a sentence. The first approach I rejected on the grounds that it needlessly excluded potentially useful information. The third approach, likewise, tags different punctuation marks in the same way. Since punctuation marks can be tagged utterly unambiguously ? a comma is always a comma ? this is needless. The decision was therefore taken to give each punctuation mark a unique tag. This tag is, in fact, the same as the punctuation mark itself (a practice also adhered to in, for example, the C7 tagset: see 2.1.2.1). However, since the tagset is designed to operate in Unicode texts, more forms of punctuation can be distinguished (for example, opening and closing quotation marks). Some of these distinctions may be finer than is necessary (e.g. that between square and normal brackets is useless if one simply wishes to search for brackets in general) but it would be trivial to design search software that could treat the two tags as alike, or to map to a subtagset that collapsed these to a single ?bracket? category. There are 13 tags in this section. The EAGLES guidelines underspecify the value of the one attribute, stating values only for the full stop, comma, and question mark, so I have inferred it (using letters when the available digits ran out). For all punctuation marks, the Unicode of the Perso-Arabic tag is the same as that of the punctuation mark being tagged52. The Roman tags for full stop, comma, question mark, and semi-colon consist of a different Unicode character to the punctuation mark being tagged, but otherwise likewise use the same Unicode. With regard to paired punctuation ? the quotation marks and brackets ? there is a point to be made as regards directionality. The Unicode Standard specifies (Unicode 1996: 6-4) that in bi-directional text53 the same character ? i.e. the same Unicode value ? should represent the opening member of the pair whatever its appearance, and the same with the closing member of the pair. That is, the code U+0028 (OPENING PARENTHESIS) ought always to be the first of the pair, and be rendered as ? ( ? in left-to-right text, such as English, and as ? ) ? in right-to-left text, such as Urdu. Other paired punctuation marks should function similarly54. Therefore for each of these marks, the Roman and Perso-Arabic tags are mirror images of one another, though they are encoded by the same numeric value. This could potentially create confusion when an analyst tags text by hand, inasmuch as the (Roman) tag will have the opposite appearance to the (Perso-Arabic) symbol in the actual text55. However, this will not be problematic when tagging is automated, ?right? and ?left? meaning nothing to a computerised tagger. There remain some problematic points, for example, the ellipsis (?), angle bracket speech marks, and braces. These have not been given tags for now, on the basis that no Urdu text I have yet seen contains these symbols. However, nor does any work on Urdu rule out their use, so extra punctuation tags may prove necessary.
SubClass Of	PartOfSpeech
Sub-Classes	CloseParenthesis CloseQuotationMark CloseSquareBracket Colon Comma ExclamationMark FullStop NeutralQuotation OpenParenthesis OpenQuotationMark OpenSquareBracket QuestionMark SemiColon
QuestionMark
SubClass Of	Punctuation
QuestionMarker
Abstract	Question marker ky?
SubClass Of	Unique
RahaAuxiliary
Abstract	rah? This auxiliary element is used in the formation of tenses in the durative aspect. It is itself the perfective participle of the lexical verb rahn?, ?remain?, but as Schmidt (1999: 111) reports, this form ?has been delexicalised?. It is marked for gender and number. It may seem that treating rah? as auxiliary and rahn? as lexical goes against the principle laid down in 3.2 that the distinction between lexical and auxiliary should be inherent to the verb and not dependent on context, and conflicts, for example, with the treatment of h?n? (see 3.2.2.4 below). However, this is not the case. The verb h?n? may be main but it is never lexical; rahn? is lexical when it is main, and cannot act as an auxiliary at all except for the one, very particular, delexicalised form rah?. There is a problem in the intermediate tagset, in that the EAGLES guidelines contain no value for durative aspect. Therefore, the aspect attribute is given the value zero, since the aspect is neither perfective nor imperfective. This is not a very good solution but it is preferable to adding a value, and there is no satisfactory way to mark durative in the intermediate tagset by adding an attribute. This solution also ensures that each form of auxiliary rah? has a unique value in the intermediate tagset, since every other participial element is either imperfective or perfective. Otherwise in the intermediate tagset, rah? is considered to be a non-finite participle with zero tense. When used lexically, rah? receives the tag VVYM1N, rah? receives VVYF1N or VVYF2N, and so on.
SubClass Of	AuxiliaryVerb
ReciprocalPronoun
SubClass Of	PronounOrDeterminer
ReflexivePossessiveAdjective
SubClass Of	PossessiveAdjective
ReflexivePronoun
Abstract	Unlike many European languages, Urdu reflexive pronouns are not personal. That is, they have the same form regardless of the person of the pronoun they are reflexing back to. There are two reflexive pronouns, both tagged the same, a reciprocal pronoun (which only appears within a postpositional phrase) and a reflexive possessive adjective. The reflexive possessive adjective is classed with the other possessive adjectives in the hierarchy given in 3.14. See also the discussion of the honorific usage of ?p in section 3.4.1.2 above.
SubClass Of	PronounOrDeterminer
RelativeAdjective
Abstract	There are also in the y-v-k-j sets a number of words that are more like 165 determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone as pronouns. However they behave in some respects more like adjectives, e.g. they can be predicative rather than attributive. In terms of the EAGLES guidelines they are best characterised within the pronoun/determiner category. They correspond to English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu tagset, I have classified them as JD ? determiner-like adjectives41.
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
RelativeAdverb
Abstract	A relative adverb locates an event or an object in one place or time. (Schmidt 1999, p. 218)
SubClass Of	PronominalAdverb
RelativeDeadjectivalAdverb
SubClass Of	PronominalAdverb
RelativePronoun
SubClass Of	DemonstrativeOrInterrogativeOrRelativePronounOrDeterminer
Residual
Abstract	The remaining categories (called ?residual? in the EAGLES guidelines) cover, quite simply, everything else. This comprises various semi-linguistic and non-Urdu elements. There are 8 such tags. Although the EAGLES guidelines allows for these elements having number and gender, I have not included this: if such an element is inflected as a verb, noun or adjective, then it may be considered sufficiently a part of that category to be tagged as such. This particularly applies to acronyms and abbreviations. Thus, the second and third EAGLES attributes, number and gender, are zero in the intermediate tags below. Every value from the first EAGLES attribute, type, has been used; with the exception of FX and FS, each tag bears the name of the value in the intermediate tagset it is mapped onto. The tag for ?foreign words? is meant to cover words from other languages written in the Urdu alphabet. It is not meant to cover the large number of Persian, Arabic and English loanwords that exist in Urdu, although it remains to be seen how sharp this distinction can be made in actual tagging. The tag for ?non-Perso-Arabic string? is for foreign words in other alphabets, or for other non-Perso-Arabic incursions into the text. FU is a catch-all ?Unclassified? category, although it is to be hoped that the vast majority of tokens will be catered for by at least one of the other tags outlined in this chapter.
SubClass Of	PartOfSpeech
Sub-Classes	Abbreviation Acronym ForeignWord Formula Letter NonPersoArabicString OtherSymbol OtherUnclassifiableNonUrduElement
Root
Abstract	The root consists, as its name suggests, of the root of the verb unadorned by affixation. It is not marked for person, number or gender and cannot occur as the sole verb of a main clause; it is, therefore, non-finite (untensed and also neither imperfective nor perfective in aspect). The exception to this is when it is used as an imperative form (discussed below). However, it does not fit neatly into any of the non-finite values for mood (the choices being infinitive, participle, gerund and supine). Therefore, in the intermediate tagset it is given a 0 for mood. Since this only has one form, there is only one tag. It should be noted that in the intermediate tags for this and all the following forms of lexical verb, all the tags give the status attribute the value main, since by definition a lexical verb is not an auxiliary (see the discussion of the status attribute in 3.2 above).
SubClass Of	LexicalVerb
SecondPerson
Abstract	Person=second
SubClass Of	Person
SecondPersonHonorificPronoun
Abstract	The problematic honorific pronoun ?p The case of ?p, the second person honorific pronoun, is by no means as clear as that of the third person pronouns. While the fact of its identical appearance with the reflexive pronoun (also ?p: see 3.4.330) suggests that, like the third person pronouns, it may be best classified elsewhere, there are two very good reasons for regarding ?p as a personal pronoun like mai~ and t?. 30 Kellogg (1875: 180-181) gives the common etymology of (what he sees as) these two pronouns in a single Sanskrit word. 153 The first is semantic. Semantically and pragmatically, ?p has a very similar meaning to t? and its plural form tum ? they both mean ?you?31. The second reason is syntactic. From the examples of ?p given by Schmidt (1999), it would appear that ?p has a very similar distribution to mai~ and t?. It is used, for example, as the subject of a sentence; the reflexive pronoun ?p, by contrast, can never be the subject of a sentence for obvious reasons. There are, on the other hand, a number of reasons to regard ?p as unlike mai~ and t? and either identical or at least more akin to the cognate reflexive pronoun (also ?p. All are morphological. Firstly, ?p (both the honorific and reflexive pronoun) does not have separate nominative and oblique cases, whereas mai~ and t? do. Secondly, as noted above, mai~ and t? have associated possessive adjectives. ?p also has such a possessive adjective, apn?, but this is only used reflexively (see 3.4.3). When the usage is honorific, possession is expressed phrasally with the postposition k?, ?of?. Thirdly, while mai~ and t? agree with verbal forms distinct from those used with nouns or third person pronouns, ?p does not, always taking identical verbal inflections to the third person. This is what we would expect if it were simply a special usage of a reflexive pronoun. So then, is ?p a second person personal pronoun or is it a special usage of the reflexive pronoun? Either position is tenable. The syntax and semantics of the case supports the former approach while the morphology backs up the latter approach. The EAGLES guidelines cannot help in choosing between them, since this problem is an idiosyncrasy of emille: we would therefore not expect it to be covered by a standard drawn up for a set of languages which do not include Urdu. Ultimately, this is a case where an arbitrary decision must be taken: the decision I took was not to treat ?p as a personal pronoun along with mai~ and t?. However, although arbitrary, this decision is consistent: ?p will always be treated separately in this way32. In fact the non-reflexive ?p will be given the tag PA, so that in terms of the hierarchy of the tagset, it is categorised neither with the personal nor the reflexive pronouns, but in a separate subdivision of the pronoun category. This is, to an extent, another arbitrary decision: PPA could have been an equally reasonable tag, emphasising the similarity of syntactic function with mai~ and t?, or PRA, emphasising the similarity of its case inflections to those of the reflexive pronouns, which likewise show no difference between the nominative and oblique cases. However, to impose either of these interpretations might prove theoretically controversial, in breach of a stated design principle33. Note however that in terms of the intermediate tagset, ?p is still treated as a personal pronoun, because the things that it will map onto in other languages will be personal pronouns. Its number is ( 1 \| 2 ), on the grounds that it may refer to one person or to more than one. Note that the intermediate tagset for pronouns contains a value, politeness; ?p has been listed as polite, whereas the intermediate tags for t? as given in the next section contain the value for familiar.
SubClass Of	PersonalPronoun
SemiColon
SubClass Of	Punctuation
SentenceTagWord
Abstract	Sentence tagword (e.g. s?h?) This category is rather more open than the other ?unique? categories, and may in certain circumstances be ambiguous with adverbs.
SubClass Of	Unique
SingularNumber
Abstract	Number=singular
SubClass Of	Number
Subjunctive
Abstract	The subjunctive is the only form that is marked for person in Urdu lexical verbs. It is not, however, marked for gender. Therefore the intermediate tagset forms give gender as zero, mood as subjunctive and tense as present.
SubClass Of	LexicalVerb
SubjunctiveMood
Abstract	Mood=subjunctive
SubClass Of	Mood
SubordinatingConjunction
SubClass Of	Conjunction
system:Feature
Namespace	http://purl.org/olia/system.owl#
Sub-Classes	Aspect Case Finiteness Gender GenderMarking Mood Number Person Tense
system:UnitOfAnnotation
Namespace	http://purl.org/olia/system.owl#
Sub-Classes	PartOfSpeech
Tense
SubClass Of	system:Feature
Sub-Classes	FutureTense PastTense PresentTense
ThirdPerson
Abstract	Person=third
SubClass Of	Person
Unique
Abstract	Unique/unassigned (including particles, clitics and tags) The Unique category in the EAGLES guidelines is meant to contain words that are members of a one-word category; for example, the infinitive marker to or the existential there in English. I will first outline the general nature of the tags defined in this part of the tagset (3.12.1), before going into some depth on the problem that motivated the creation of one particular unique category, that of nongrammatical lexical element: the zimmah d?r problem (3.12.2).
SubClass Of	PartOfSpeech
Sub-Classes	AdjectivalOccupationalParticle AdjectivalParticle CliticExclusiveEmphaticParticle CliticPostposition CompoundFormingConjunction ContrastiveEmphaticParticle ExclusiveEmphaticParticle InclusiveEmphaticParticle Izafat MultiplicativeMarker NongrammaticalLexicalElement PremultiplicativeCliticNumeral QuestionMarker SentenceTagWord
UnmarkedForGender
Abstract	Markedness=2
SubClass Of	GenderMarking
Verb
Abstract	There are a considerable number of factors to be taken into account in a description and categorisation of the Urdu verbal system. There are a number of inflected forms, and with the use of one or more auxiliary elements, 15 compound tenses are built up. Furthermore, any part of the compound verb-phrase may be marked for number, person or gender agreement12. There are two conceivable approaches to the markup of such a compound verb-phrase. Firstly, each word could be tagged separately, regardless of its context. So for example the form that Schmidt (1999) refers to as the ?perfective participle? would be tagged the same regardless of what compound tense it was being used in. Secondly, compound verbs could be treated as multi-word units, each such unit receiving a single tag. The latter approach was not followed, for three reasons. In the first place, it goes against the principle that every word should have its own tag, using no multiword tags. Secondly, it goes against the suggestion made by the EAGLES guidelines that ?In general, compound tenses are not dealt with at the morphosyntactic level, since they involve the combination of more than one verb in a larger construction? (Leech and Wilson 1999: 63). Thirdly, it would result in the tagset being much more complicated than need be. For example, each of the 15 compound tenses would need to be distinguished. By contrast the other approach would require a relatively smaller number of distinctions to be made, between the elements of which the compound tenses are built. The over-complicated tagset design that multi-word tagging of compound verbs would necessitate would also have the drawback of going far beyond the EAGLES guidelines on verbal tags. By treating each word of the compound verb as separate, it is possible to stick fairly closely to the guidelines. the agreement attributes number, gender, and person are clearly relevant to the Urdu verbal system. Some writers consider that Urdu displays what has been described as split ergativity (as described in section 1.1.5.4). That is, the verb agrees sometimes with the subject, and sometimes with the direct object. It may also under some circumstances agree with neither (Schmidt 1999: 125). As explained in 1.1.5.4, however, some writers (e.g. Butt 1995) disagree with this analysis. However, for the purposes of defining verbal tags the matter of ergativity is more or less irrelevant. The agreement suffixes which occur on verbs ? and, therefore, the morphosyntactic categories displayed by verbs ? are exactly the same regardless of which argument of the verb is being agreed with. A single morphosyntactic phenomena receives a single tag; so for example when I give 13 Except for one marginal case (see discussion of c?hi? in section 3.2.2.3 below). 115 a verb a tag VVYF1N14 (see 3.2.1.3), it is not specified whether the feminine agreement is with a subject or object. Thus, the principle of theoretical neutrality is upheld: this analysis is as compatible with a theory in which Urdu displays split ergativity as with a theory in which it does not. Status (i.e. whether a verb is main or auxiliary) is relevant throughout. However, the way in which it has been used is a little different to that given in the EAGLES recommendations. The EAGLES guidelines suggest a main/auxiliary distinction which is context dependent. This can be seen by Leech and Wilson?s example tagset for English (1999: 72-74), in which it is made clear that the verb be can be either a main verb or an auxiliary verb. However, the distinction I have used is between lexical verbs and non-lexical auxiliary verbs. This is not context-dependent; English be would be considered an auxiliary regardless of context. The motivation for this is the decidedly irregular morphology of Urdu auxiliary verbs, most particularly h?n?, ?be? (see also 3.2.2.4). This goes far beyond the inflectional oddities found in English non-lexical verbs: h?n? possesses two tenses that no other verb has, and it possesses them regardless of whether it is a main verb or not. To mark up h?n? as a main verb, there would have to be a tag, for example, for a present-tense main verb. But to include such a tag would be to vastly misrepresent the majority of Urdu verbs, which have no inflected present tense. There are similar problems with such non-lexical verbal forms as c?hi? and g?. Thus it makes sense to use the status attribute to distinguish (mostly regular) lexical verbs and (irregular) auxiliary verbs, so that the unique marking on the latter can be tagged exclusively on the latter. The optional third value of the status attribute, semi-auxiliary, has been used as described below.
SubClass Of	PartOfSpeech
Sub-Classes	AuxiliaryVerb GeneralAuxiliary LexicalVerb
VocativeCase
Abstract	Case=vocative
SubClass Of	Case

Object Properties

hasAspect
Range	Aspect
hasCase
Range	Case
hasFiniteness
Range	Finiteness
hasGender
Range	Gender
hasGenderMarking
Range	GenderMarking
hasInherentNumber
Range	Number
hasMood
Range	Mood
hasNumber
Range	Number
hasPerson
Range	Person
hasTense
Range	Tense
system:hasFeature
Namespace	http://purl.org/olia/system.owl#
Sub-Properties	hasAspect hasCase system:hasFeature hasFiniteness hasGender hasGenderMarking hasInherentNumber hasMood hasNumber hasPerson hasTense
Domain	system:UnitOfAnnotation

Individuals

AL
Class	Article
AU
Class	Interjection
CC
Class	CoordinatingConjunction
CCC
Class	CorrelativeCoordinatingConjunction
CS
Class	SubordinatingConjunction
FA
Class	Acronym
FB
Class	Abbreviation
FF
Class	ForeignWord
FO
Class	Formula
FS
Class	OtherSymbol
FU
Class	OtherUnclassifiableNonUrduElement
FX
Class	NonPersoArabicString
FZ
Class	Letter
IB
Class	Postposition
II
Class	Preposition
II1
Class	SingularNumber
II2
Class	PluralNumber
IIC
Class	CliticPostposition
IIF
Class	FeminineGender
IIM
Class	MasculineGender
II_
Class	UnmarkedForGender
II_N
Class	NominativeCase
II_O
Class	ObliqueOrVocativeCase
II_gendermarked
Class	MarkedForGender
J1
Class	SingularNumber
J2
Class	PluralNumber
JD
Class	IndefiniteDeterminer
JDF
Class	Fraction
JDJ
Class	RelativeAdjective
JDK
Class	InterrogativeAdjective
JDNU
Class	CardinalNumber
JDNUC
Class	PremultiplicativeCliticNumeral
JDNUO
Class	ObliqueCase
JDN_O
Class	ObliqueOrVocativeCase
JDN_ordinal
Class	OrdinalNumber
JDV
Class	DistalDemonstrativeAdjective
JDY
Class	ProximalDemonstrativeAdjective
JD_F
Class	FeminineGender
JD_M
Class	MasculineGender
JD_O
Class	ObliqueCase
JD_U
Class	UnmarkedForGender
JD_gendermarked
Class	MarkedForGender
JJ
Class	AttributiveOrPredicativeAdjective
JP
Class	PredicativeAdjective
JXG
Class	MultiplicativeMarker
JXS
Class	AdjectivalParticle
JXV
Class	AdjectivalOccupationalParticle
JX_F
Class	FeminineGender
JX_M
Class	MasculineGender
J_F
Class	FeminineGender MarkedForGender
J_M
Class	MarkedForGender MasculineGender
J_N
Class	NominativeCase
J_O
Class	ObliqueOrVocativeCase
J_U
Class	UnmarkedForGender
LL
Class	NongrammaticalLexicalElement
NN
Class	CommonNoun
NP
Class	ProperNoun
N_F
Class	FeminineGender
N_M
Class	MasculineGender
N__M
Class	MarkedForGender
N__U
Class	UnmarkedForGender
N___1
Class	SingularNumber
N___2
Class	PluralNumber
N____N
Class	NominativeCase
N____O
Class	ObliqueCase
N____V
Class	VocativeCase
OO
Class	CompoundFormingConjunction
PA
Class	HonorificSecondPerson SecondPersonHonorificPronoun
PG
Class	PossessiveAdjective
PGR
Class	ReflexivePossessiveAdjective
PGRF
Class	FeminineGender
PGRM
Class	MasculineGender
PG_1
Class	SingularNumber
PG_2
Class	PluralNumber
PG_F
Class	FeminineGender
PG_M
Class	MasculineGender
PJ
Class	RelativePronoun
PK
Class	InterrogativePronoun
PN
Class	IndefinitePronoun
PP
Class	PersonalPronoun
PPM
Class	FirstPerson
PPT
Class	SecondPerson
PP_1
Class	SingularNumber
PP_2
Class	PluralNumber
PP_N
Class	NominativeCase
PP_O
Class	ObliqueCase
PRC
Class	ReciprocalPronoun
PRF
Class	ReflexivePronoun
PU1
Class	FullStop
PU2
Class	Comma
PU3
Class	QuestionMark
PU4
Class	ExclamationMark
PU5
Class	Colon
PU6
Class	SemiColon
PU7
Class	NeutralQuotation
PU8
Class	OpenQuotationMark
PU9
Class	CloseQuotationMark
PUA
Class	OpenParenthesis
PUB
Class	CloseParenthesis
PUC
Class	OpenSquareBracket
PUD
Class	CloseSquareBracket
PV
Class	DistalDemonstrativePronoun
PY
Class	ProximalDemonstrativePronoun
P_1
Class	SingularNumber
P_2
Class	PluralNumber
P_E
Class	ObliqueCase
QQ
Class	QuestionMarker
RD
Class	DegreeAdverb
RJ
Class	RelativeAdverb
RJJ
Class	RelativeDeadjectivalAdverb
RK
Class	InterrogativeAdverb
RKJ
Class	InterrogativeDeadjectivalAdverb
RM
Class	ModalAdverb
RMN
Class	NegativeModalAdverb
RR
Class	GeneralAdverb
RRJ
Class	DeadjectivalAdverb
RV
Class	DistalDemonstrativeAdverb
RVJ
Class	DistalDemonstrativeDeadjectivalAdverb
RY
Class	ProximalDemonstrativeAdverb
RYJ
Class	ProximalDemonstrativeDeadjectivalAdverb
TT
Class	SentenceTagWord
V1
Class	SingularNumber
V2
Class	PluralNumber
VC
Class	CahieAuxiliary
VG
Class	GaAuxiliary
VGF
Class	FeminineGender
VGM
Class	MasculineGender
VH
Class	HonaAuxiliary
VHH
Class	IndicativeMood PresentTense
VHN
Class	InfinitiveMood
VHP
Class	IndicativeMood PastTense
VR
Class	RahaAuxiliary
VV
Class	LexicalVerb
VV0
Class	Root
VVI
Class	Imperative ImperativeMood
VVIA
Class	HonorificSecondPerson ImperativeMood
VVN
Class	Infinitive InfinitiveMood
VVNF
Class	FeminineGender
VVNM
Class	MasculineGender
VVS
Class	Subjunctive SubjunctiveMood
VVSM
Class	FirstPerson
VVST
Class	SecondPerson
VVSV
Class	ThirdPerson
VVT
Class	ImperfectiveAspect ImperfectiveParticiple ParticipleMood
VVY
Class	ParticipleMood PerfectiveAspect PerfectiveParticiple
VX
Class	GeneralAuxiliary
V_N
Class	NominativeCase
V_O
Class	ObliqueCase
XB
Class	InclusiveEmphaticParticle
XH
Class	ExclusiveEmphaticParticle
XHC
Class	CliticExclusiveEmphaticParticle
XT
Class	ContrastiveEmphaticParticle
ZZ
Class	Izafat