OLiA Annotation Model for the morphosyntactic annotation of the Urdu section of the EMILLE corpus (Hardie 2003, 2004). Unless marked otherwise, all comments are quotes from Hardie (2004), Chapter 3.
The tagset discussed
here was created in accordance with the EAGLES guidelines for morphosyntactic annotation of
corpora. Although these guidelines were written to cover the languages of the European Union, they
can be applied fairly easily to Urdu, which, coming as it does from another branch of the Indo-
European family, is structurally quite similar. They can also be extended to deal with the idiosyncrasies
presented by Urdu grammar. (Hardie 2003)
The first stage of the work was to develop a tagset for use in Urdu texts and corpora, an area
which has not been research extensively heretofore2. The next stage, now underway, is to test the
tagset’s usability in manual tagging, and build up a set of tagged texts to serve as training data for the
final phase of this part of the project. This will be to automate the tagging and subsequently tag the
whole of the EMILLE Urdu corpus. (Hardie 2003)
References
Hardie, A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In: Corpus Linguistics 2003, 2003-03-01, Lancaster. http://eprints.lancs.ac.uk/103/
Hardie, Andrew (2004) The computational analysis of morphosyntactic categories in Urdu. Other thesis, Lancaster University. http://eprints.lancs.ac.uk/106/
Ruth Laila Schmidt (1999) Urdu, an essential grammar, Routledge, London.
Adjectival /
occupational
particle (v?l?)
This element is the source of the English word / suffix ?wallah? (Kachru 1990: 70), which may help
the reader to gain some grasp on its meaning.
use ... refers to whether an adjective may be used in
attributive or predicative positions only. The default value for this is naturally both. In
the absence of a specification in the EAGLES guidelines, I represent this with 0.
There are a number of common Perso-Arabic adjectives in Urdu that can only be used
in predicative position (Schmidt 1999: 37), for which this attribute can take the value
2. This is the rationale for including this attribute, which is however a prime candidate
to be underspecified in a practical subtagset. It is anticipated that it will be difficult for
a POS tagger to detect predicate-only adjectives. Since the predicate-only adjectives
are Perso-Arabic, it ought to follow that they are all unmarked adjectives. However,
this is a point on which Schmidt (1999) is silent. For this reason, tags have been
included for predicate-only adjectives that are marked for gender/number/case. These
may need to be removed if it turns out from the data that they do indeed describe nonexistent
categories, as I suspect
It should be noted at the outset that I treat as adpositions those elements of
Urdu that some writers (e.g. Kellogg 1875, Butt 1995) describe as case suffixes or
clitics. This is firstly because Schmidt (1999), the model of the language being used,
does so. Secondly, however, treating n? (among other markers) as adpositions allows
theoretical neutrality to be maintained on the question of whether Urdu displays
ergativity48.
The EAGLES guidelines give only one attribute for adpositions, Type, which
has a range of recommended and optional values: preposition, fused preposition-
48 See also the discussion of the ergativity controversy in 1.1.5.4 and the discussion of noun cases and
the etymology of postpositions in 3.1.3.
177
article, postposition, and circumposition. The second and fourth of these do not apply
to Urdu, which lacks articles49 and circumpositions. The vast majority of Urdu
adpositions are postpositions, but there are some prepositions borrowed from Persian
and Arabic (Schmidt 1999: 68, 250, 267), so this attribute is relevant.
There are two other issues. The first is that of iz?fat (Bhatia and Koul 2000:
339; Schmidt 1999: 246-247). The iz?fat is a Persian enclitic (pronounced as a shorter
form of ???) which in some circumstances can be considered a preposition: it links
two nouns in a possessive relationship, although the phrase thus produced may often
have a different meaning to a phrase produced with the native Urdu postposition k?.
However, the iz?fat may also join a noun to an adjective, in which case it is not so
clearly accurate to describe it as a preposition parallel to the prepositions in European
languages for which the EAGLES guidelines were compiled. A better way to treat
iz?fat is in the context of the Unique category of miscellaneous one-member wordclasses,
discussed below.
The second issue is that in Urdu, the postposition k? can be marked for
number/gender/case agreement (Schmidt 1999: 68-69). It does not agree with the
noun it governs, but with the head noun of the noun phrase that contains its
postposition phrase. This is not a phenomenon allowed for by the EAGLES guidelines
as they now stand. k? takes the same inflectional endings as marked adjectives
(having the forms k?, k?, and k?). Therefore, it is necessary for the same
number/gender/case categories to be distinguished by the tagset for postpositions as
for adjectives50. This means that the intermediate tagset contains three more attributes
than are suggested in the EAGLES guidelines.
As with verbs, there are lexical and non-lexical adverbs, which will be
considered in turn.
In the EAGLES guideline, the recommended attribute for adverbs is degree44,
which is not relevant morphologically to Urdu (as discussed with reference to
adjectives: see 3.3 above). However, the remaining three features are relevant, and
have been included. These are adverb-type, which distinguishes general and degree
adverbs, and polarity and wh-type, which distinguish interrogative and relative
pronouns. The following summarises the features used in the intermediate tagset.
There are a total of 13 adverb tags.
Articles
Urdu lacks articles. However, some phrases borrowed from Arabic contain the
clitic Arabic definite article, which receives the single tag AL (the spelling of the
Arabic article). I have not included a C in this tag, as I have done for other clitics (see
section 3.12), because this would make the tag less transparent. The use of the AT
intermediate tag could be queried here, because the use of the Arabic definite article
in Urdu does not parallel that of, for example, the in English or le/la/les in French. For
example, the Arabic definite article is only found with Arabic loanwords43, whereas of
course the can appear with the vast majority of nouns in English. However, on
balance it seems that this disadvantage is outweighed by the advantage of indicating
that the Arabic definite article in Urdu does do pretty much what other languages?
articles do. Khoja et al.?s (2001) Arabic tagset does not have a separate tag for the
article, but considers definiteness a feature of nouns: this would not be an appropriate
approach for Urdu because non-Arabic nouns cannot be made definite by use of the
Arabic definite form.
It should be noted that, whereas I have in this category treated all auxiliary
elements as verbs, in the terms of the EAGLES guidelines for intermediate tagsets
some could easily be characterised as unique or unassigned words (see below). The
EAGLES guidelines treat the English infinitive marker to in this manner, for example.
However, treating them as verbs in the intermediate is firstly in keeping with the
structure of the Urdu tagset, and secondly allows verbal attributes such as gender and
number to be used (the EAGLES unique intermediate tags include no such attributes).
The word c?hi? is used in combination with the infinitive of a lexical verb to
express advisability. It is also used (as described by Bhatia and Koul 2000: 60) as a
polite form of the verb c?hn?, ?want?. It is derived from an old morphologically
marked passive form (Schmidt 1999: 137) of c?hn?20; however, c?hn? is a lexical
verb and other than this use of c?hi?, it does not deviate from the pattern of other
lexical verbs. Therefore the best approach would seem to be to give c?hi? its own tags
(it requires two tags because it agrees with the number of the object of the preceding
infinitive in certain circumstances21). This is the approach taken in many English
tagsets for modal auxiliary verbs, which are, like c?hi?, anomalous forms. The
intermediate tags given to c?hi? and its plural form c?hi?~ list them as being without
person or gender, without finiteness (since it can be used with or without a following
tense-bearing auxiliary), indicative, present tense and without aspect. In the
descriptions, these words are defined as ?c?hi?-type?, rather than attempt to find an
English word to accurately summarise the range of meanings associated with
desirability and/or advisability that these words can convey.
Numeral/type=cardinal
Cardinal numbers function as grammatically unmarked determiner-like
adjectives (Schmidt 1999: 228). However, they can appear in the oblique plural ? with
the same suffix as an unmarked noun ? to express totality (Schmidt 1999: 10-11).
There is therefore an additional tag for this (indicated only by O, since there is no
oblique singular to make a contrast). In the intermediate tagset I have given their
function as determiner, in line with the determiners that are in the pronoun category
above. Numerals are to be tagged as below, even if written as figures rather than
words (and whatever set of figures are used: Urdu uses both the Western European
and the Arabic-Indic digits).
In the model of the language given by Schmidt, Urdu has three cases,
nominative, oblique and vocative. McGregor (1972: 1-2) uses a different
classification, treating the vocative as a special form of the oblique case. However,
since the special form would still need to be tagged separately, it makes sense to treat
it as a vocative case, a phenomenon for which the EAGLES guidelines already allow
for.
As Schmidt (1999: 7) points out, some grammarians4 have treated Urdu
postpositions as being either suffixes or clitics indicating cases, in which case Urdu
would possess many more than three cases. However, this is a minority view amongst
writers of general grammars: Schmidt (1999), Barz (1977), Bhatia and Koul (2000),
McGregor (1972), Bailey et al. (1956) all do not treat postpositions as marking cases.
There is an etymological basis for this view. Kellogg (1875: 128-133) reports that the
postpositions do not derive from Sanskrit case markers, but rather from independent
words (e.g. k?, ?to?, from Sanskrit k?kshe, ?armpit, side?; m?~, ?in?, from Sanskrit
madhye, ?middle?, both locative nouns; tak, ?until?, from the Sanskrit past participle
tarita, ?passed to?, plus a dative affix ku.). Furthermore, the suffix/clitic approach
would require case to be determined across multi-token units, which would breach the
design principle of including no multiword tags. It would also have implications for
the principle of theoretical neutrality, since it would be necessary to take some
standpoint on the subject of whether or not Urdu has ergative case marking, a
theoretically controversial point (see 1.1.5.4). Thus I use the nominative-obliquevocative
distinction as exemplified below:
laRk?, laRk? ?boy(s)? (nominative singular/plural)
laRk?, laRk?~ (oblique singular/plural)
laRk?, laRk? (vocative singular/plural)
(example from Schmidt 1999: 10-12)
There is something of an issue with the names of the cases. Vocative is
straightforward enough, and is one of the values given for the case attribute in the
EAGLES guidelines. Nominative, however, is usually given meaning by its contrast
with accusative ? a case that does not exist in Urdu. The nominative may in Urdu be
used for either, neither or both of the subject and the direct object. Thus it is not
certain whether the nominative in Urdu really corresponds with the nominative that is
value 1 in the EAGLES guidelines5. Certainly it does not correspond with the
nominative as it exists in, for example, German or Latin. However, I have used value
1 in the intermediate tagset for the Urdu case, on the basis that no Urdu case
resembles the nominative in the European languages for which the EAGLES
guidelines were devised any more closely than the Urdu nominative.
There is no value in the EAGLES guidelines for oblique. Nor is there one for
postpositional, locative or instrumental (alternative names used by Bailey et al. 1956
for this case6). Rather than invent an extra value (undesirable for reasons given with
regard to markedness above), I have used the value for dative to represent oblique, on
the grounds that in some European languages (e.g. German) prepositions frequently
govern the dative, and in Urdu postpositions govern the oblique.
The EAGLES guidelines suggest that conjunctions be classified firstly for
whether they are coordinating or subordinating, and then secondly as one of four coordinating
types or one of three subordinating types. I have disregarded the attribute
for subordinate-type, since it was developed for German and does not seem relevant
to Urdu subordinating conjunction as described by Schmidt (1999: 223-227). Urdu
correlative conjunctions (such as bh??bh?, y??y?) do not have initial and non-initial
forms, so those features are also not needed. This gives three types of conjunctions:
simple coordinating, correlative coordinating, and subordinate. Note that phrases
involving the relative j-set of pronouns, adjectives and adverbs are often translated by
conjunctions, but are not to be tagged as such.
The EAGLES guidelines (Leech and Wilson 1999: 68) specify that a
conjunction is correlative when it is at the start of the first of a pair of correlated
clauses. The conjunction at the start of the second half of the pair is then a simple
coordinating conjunction (CC)51. This practice will be followed to ensure compliance
with the EAGLES guidelines.
In Urdu these are of two sorts: adverbs which are derived from adjectives by
inflecting them to their masculine oblique form or adding a Persian or Arabic loaned
44 This use of ?degree? (i.e. inflected superlative or comparative) should be clearly distinguished from
the use of ?degree adverb? below (i.e. words with meanings such as ?very?, ?more?).
173
derivational suffix45 (RRJ), and adverbs which are not (RR). While this unfortunately
violates the principle of not including derivational information, this distinction has
been included in the tagset for two reasons.
Firstly, it helps avoid ambiguity, since an adverb derived from an adjective has
the same form as that adjective in its masculine singular oblique form (see Schmidt
1999: 57). If adjectival adverbs were marked RR, this would lead to a wide ambiguity
between RR and JJM1O, which would make non-adjectival adverbs ambiguous as
well! Using a separate tag, there is only an RRJ~JJM1O ambiguity, which
significantly reduces the scope of the ambiguity. Although this is a pragmatic
consideration which should probably be included at the subtagset level, it involves
creating a distinction rather than collapsing one, and must thus exist in the top level
tagset.
However, there is another motivation for the RRJ tag, which is that it is
necessary to maintain theoretical neutrality. It is possible that some analyst might
wish to treat the RRJ adverbs as if they were actually adjectives ? that is, identify
them with JJ? categories instead of RR. Indeed Bailey et al. (1956: 18) come close to
saying this. The principle of theoretical neutrality must here override the principle of
excluding derivational information.
Third person pronouns/demonstratives, interrogative and relative
pronouns and determiners
This class of pronouns consists of all those pronouns that fall into the parallel
classes of what Schmidt (1999: 39) calls ?symmetrical y-v-k-j word sets?. These
classes contain a variety of pronouns and adjectives that are of similar form, the first
letter indicating what set they belong to, thus:
161
? y or a vowel indicates the set of proximal demonstratives (this, now, etc.)
? v or t35 indicates the set of distal demonstratives (that, then, etc.)
? k indicates the set of interrogatives (who, what, how, etc.)
? j indicates the set of relative words (who, where, whither, etc.)
Thus, in Urdu there is 1) a significant distinction between proximal and distal
words, for which there is no distinction in the EAGLES guidelines; 2) a significant
distinction between interrogatives and relatives, which is only made by the EAGLES
guidelines at the secondary optional level (the recommended features include only
int./rel., presumably on the basis that these have similar forms in many European
languages ? the so-called wh-words). This means that the intermediate tags for these
pronouns are not as elegant as they might be, and the tags for the y-set and the v-set
are the same36. However, I will make this distinction in the Urdu tags, which begin
with P followed by the letter of the relevant y-v-k-j set.
The proximal and distal demonstratives have not been distinguished for any
other language that I am aware of. For example, no English tagset I know of
distinguishes here/hither from there/thither. However, most distinguish
where/whither from the non-interrogative/relative words. In Urdu, the ?near~far?
phonological pattern is much more consistent ? there are no odd pairs such as English
this~that ? and is formally of an equal degree to the ?demonstrative~interrogative?
distinction. Furthermore, there is a difference of usage between the proximal and
distal sets ? the latter are used in correlative clauses where the former are not37. For
this reason I tag the four-way distinction, since it would be odd to arbitrarily merge
two of what are on a language-internal basis clearly different categories.
The pronouns in the y-v-k-j sets are used as demonstrative pronouns and third
person personal pronouns (so yah and vah38 mean both ?this? and ?that? and
?he/she/it?). They can also act as determiners within a noun phrase. I have not tagged
these uses differently, because this would fall under the heading of syntactic
information, which this tagset does not include. See also section 3.4.1.1.
I do not, as Schmidt (1999: 38-41) does, characterise the determiner-usage as
adjectival, since these pronouns do not display gender agreement, as adjectives
(including other members of the y-v-k-j sets) do. They are however marked for case
and number39. They also have the peculiarity that their plurals have a third case-like
form, which appear solely before the postposition n? (which indicates the subject of
an ergative-type clause). This is tagged separately (and, like the proximal/distal
distinction, not distinguished in the intermediate tagset, since it is difficult to see how
this could be achieved).
There are two interrogative pronouns, both beginning in k; one means ?what?
and one means ?who?. They both receive the same tags, since tagging an animacy
distinction would be odd when this is done nowhere else in the tagset.
37 There is one minor exception to this (Schmidt 1999: 206).
38 These two words are almost always transcribed as y? and v?, which is how they are pronounced.
However, the spellings with h are closer to the Perso-Arabic (Bhatia and Koul 2000: 36).
39 However, in the nominative case the singular and plural forms are identical.
163
In the intermediate tagset, following what is done for such pronouns in the
example English tagset given in the EAGLES guidelines I give person as zero, and
for the k-set words the wh-type is ?240, since ky? may also be exclamatory. The
category attribute is both, because these words are both pronouns and determiners.
There are also in the y-v-k-j sets a number of words that are more like
165
determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone
as pronouns. However they behave in some respects more like adjectives, e.g. they
can be predicative rather than attributive. In terms of the EAGLES guidelines they are
best characterised within the pronoun/determiner category. They correspond to
English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu
tagset, I have classified them as JD ? determiner-like adjectives41.
There are also in the y-v-k-j sets a number of words that are more like
165
determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone
as pronouns. However they behave in some respects more like adjectives, e.g. they
can be predicative rather than attributive. In terms of the EAGLES guidelines they are
best characterised within the pronoun/determiner category. They correspond to
English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu
tagset, I have classified them as JD ? determiner-like adjectives41.
The last two attributes, finiteness and mood, are problematic. Firstly, inherent
in the EAGLES guidelines is the problem that the mood attribute contains values
relevant to both finite and non-finite forms, so that the finiteness attribute becomes
redundant. Secondly, the finite/non-finite distinction may be hard to draw in Urdu.
The forms described below as participles would traditionally be considered non-finite
in European languages. However, in Urdu they have certain features which make
them seem more like finite forms. For example, they can occur as the only verb in a
main clause, and can agree with a subject or object ? not a property prototypically
associated with non-finite forms. These properties are illustrated by the following
example from Schmidt (1999: 126)15:
unh?~ n? an paRh k? b?t nah
3-PLRL-OBL ERG un educated of-FEM speech not
m?n?
accept-PERF.PART-FEM-SING
They did not accept what the uneducated person said.
The verb form m?n? is a participle, but it is the only verb form in the sentence,
and it is marked for agreement (with the object, since this clause is of the ergative
type). It, like the postposition k?, agrees with the feminine singular noun b?t.
Urdu has a fairly wide range of words for fractions (there are for example
words for ?plus one quarter? (sav?), ?less one quarter? (paun, paun?), ?one half?
(?dh, ?dh?), ?one and a half? (D?Rh), ?plus one half? (s?Rh?)), which can modify
cardinal numerals as well as nouns. They are therefore tagged separately (although the
intermediate tags are not all distinct). Most are unmarked, but two are marked. Two
others can also function as nouns, in which case they should receive standard noun
tagging.
The form g? indicates future tense when it follows a verb in the subjunctive
form. It may also follow the polite imperative as a marker of additional politeness
(Bhatia and Koul 2000: 332). It is considered by Schmidt (1999) to be a suffix,
although one that is written as a separate word; Bhatia and Koul (2000) go so far as to
write the inflected verb and the g? as a single word. However, given that the
orthography must lead g? to be treated by the tagging system as a separate token (see
2.2.6.1), and given that the form of the future is otherwise identical to the subjunctive,
it makes sense to tag g? separately from the lexical verb. Since g? is marked for
gender and number and the subjunctive is marked for person and number, the future
would, if treated as a simple rather than a compound tense, be marked for all three of
these features ? which is not true of any other simple tense in Urdu. Furthermore, as
Schmidt (1999: 94) explains, g? derives from a contraction of the perfective participle
of the verb j?n?, ?go?. Therefore, g? is tagged independently.
In the intermediate tagset it is considered to be finite, indicative, future, and
with zero aspect.
Urdu has two genders, masculine and feminine. Some nouns are marked for
gender, whereas others are not3. This means that there is in effect a four-way
distinction among nouns: masculine marked, masculine unmarked, feminine marked
and feminine unmarked. For example:
r?payah ?money? (marked masculine)
ghar ?house? (unmarked masculine)
bacc? ?female child? (marked feminine)
kit?b ?book? (unmarked feminine)
(examples from Schmidt 1999: 1-2.)
Note that since some unmarked nouns coincidentally display the suffixes
typical of marked nouns, the diagnostic feature of a marked noun is that its plural
inflection follows that of the marked nouns (e.g. masculine ?? changing to ??,
feminine ?? to ?iy?~, and so on).
This four-way split could be encoded into a tagset in two ways: by creating
two new values for the gender attribute (the EAGLES guidelines have only
masculine, feminine, neuter, and common) or by creating a new markedness attribute
with two values, 1 = marked for gender and 2 = not marked for gender. The latter
approach has been followed since it will almost certainly be easier for software
processing the intermediate tagset to ignore an entire attribute than to work out what
to do about values it does not recognise in existing attributes. This is especially the
case if the extra attribute is added at the end of the tag, as I have done.
Urdu has two genders, masculine and feminine. Some nouns are marked for
gender, whereas others are not3. This means that there is in effect a four-way
distinction among nouns: masculine marked, masculine unmarked, feminine marked
and feminine unmarked. For example:
r?payah ?money? (marked masculine)
ghar ?house? (unmarked masculine)
bacc? ?female child? (marked feminine)
kit?b ?book? (unmarked feminine)
(examples from Schmidt 1999: 1-2.)
Note that since some unmarked nouns coincidentally display the suffixes
typical of marked nouns, the diagnostic feature of a marked noun is that its plural
inflection follows that of the marked nouns (e.g. masculine ?? changing to ??,
feminine ?? to ?iy?~, and so on).
This four-way split could be encoded into a tagset in two ways: by creating
two new values for the gender attribute (the EAGLES guidelines have only
masculine, feminine, neuter, and common) or by creating a new markedness attribute
with two values, 1 = marked for gender and 2 = not marked for gender. The latter
approach has been followed since it will almost certainly be easier for software
processing the intermediate tagset to ignore an entire attribute than to work out what
to do about values it does not recognise in existing attributes. This is especially the
case if the extra attribute is added at the end of the tag, as I have done.
Adverb/Adverb-Type=general
"lexical adverb"
Lexical adverbs
In Urdu these are of two sorts: adverbs which are derived from adjectives by
inflecting them to their masculine oblique form or adding a Persian or Arabic loaned
44 This use of ?degree? (i.e. inflected superlative or comparative) should be clearly distinguished from
the use of ?degree adverb? below (i.e. words with meanings such as ?very?, ?more?).
173
derivational suffix45 (RRJ), and adverbs which are not (RR). While this unfortunately
violates the principle of not including derivational information, this distinction has
been included in the tagset for two reasons.
Firstly, it helps avoid ambiguity, since an adverb derived from an adjective has
the same form as that adjective in its masculine singular oblique form (see Schmidt
1999: 57). If adjectival adverbs were marked RR, this would lead to a wide ambiguity
between RR and JJM1O, which would make non-adjectival adverbs ambiguous as
well! Using a separate tag, there is only an RRJ~JJM1O ambiguity, which
significantly reduces the scope of the ambiguity. Although this is a pragmatic
consideration which should probably be included at the subtagset level, it involves
creating a distinction rather than collapsing one, and must thus exist in the top level
tagset.
However, there is another motivation for the RRJ tag, which is that it is
necessary to maintain theoretical neutrality. It is possible that some analyst might
wish to treat the RRJ adverbs as if they were actually adjectives ? that is, identify
them with JJ? categories instead of RR. Indeed Bailey et al. (1956: 18) come close to
saying this. The principle of theoretical neutrality must here override the principle of
excluding derivational information.
The EAGLES intermediate tags for RR and RRJ are the same.
The verb h?n?, ?be?, is the auxiliary with the greatest range of application: the
Urdu compound tenses are formed with it, and it has other uses, such as the copula. It
can also be the sole verb of a main clause, but as explained above (section 3.2) it will
be tagged the same whether it is a main verb or an auxiliary. The following examples
from Schmidt (1999: 94, 120, 126) demonstrate the range of h?n?:
?j mai~ daftar m?~ nah?~ h?~
today 1-SING-NOM office in not be-PRES-1-SING
Today I am not in the office (h?n? as copula with postpositional phrase)
kal mausam acch? th?
yesterday weather good-MASC-SING-NOM be-PAST-MASC-SING
Yesterday the weather was fine (h?n? as copula with adjective)
ham far? par s?t? hai~
1-PLRL-NOM floor on sleep-IMPERF.PART-MASC-PLRL be-PRES-1- PLRL
we sleep on the floor (h?n? as auxiliary marking the habitual present with
imperfective participle)
b?ri? h?? hai
rain be-PERF.PART-FEM-SING be-PRES-3-SING
It has rained (h?n? as auxiliary marking immediate past with perfective participle of
h?n? as main verb; more literal translation would be ?There has been rain?)
Some of the parts of h?n? are equivalent to the parts of lexical verbs; this
being so, their tags are the same for those of lexical verbs, except that they commence
in VH? instead of VV?. In the intermediate tagset, this difference is expressed by the
verbs being marked as auxiliary instead of main. Unfortunately, Schmidt (1999) does
not give a full listing of all the forms of h?n?, and I was forced to use other methods
as outlined in 2.3. The first recourse was to refer to other works ? in this case Bailey
133
et al. (1956). However, there were still gaps in the listing of forms of h?n?. When
initially composing the tagset, I was forced by the underspecification in the literature
to infer the existence and shape of some forms of the infinitive and imperative. In the
case of an irregular verb like h?n?, implying its forms on the basis of regular verbal
inflections involves making unwarranted assumptions. Therefore, these forms were
treated as highly provisional in nature until the stage of manual tagging was
undertaken (as described in the next chapter). At this point, it was possible to find
examples in tagged texts for most of the forms. The polite imperative was a very
notable exception to this. It did not occur in any of the manually tagged texts, and of
two native speaker informants consulted on the issue, one concluded that the form
h?iy? was not possible. However, the other informant suggested that it was possible.
This being the case, the VHIA tag stands ? since there can be no harm in maintaining
the parallelism with other verbs even if this form is rare to vanishing point.
The past participle of h?n?, as with that of other verbs, can be used alone as a
simple past tense. The participial tags above would be used in this case. However,
there is also an irregular inflected simple past tense ? which, as might be expected,
differs slightly in its meaning (Bailey et al. 1956: 109; Barz 1977: 48-49 considers
this to be an instance of two separate verbs with the same infinitive22). There is, in
addition, an irregular inflected simple present tense (the only one in the whole
language). These inflected forms are the basis of the compound tense system and both
require separate tags, as follows. Like the regular inflected subjunctive mood, the
present indicative of h?n? is marked for person and number but not gender.
The intermediate tags for the present tense are the same for those of the
subjunctive except that the mood is indicative. In the mnemonic tags I use H to
indicate the present tense, since this tense is entirely characteristic of h?n?.
The irregular past tense is marked for gender and number in the same way as a
perfective participle, but it is a finite form. The intermediate tags are the same as
those for the present tense, except that 1) gender is not zero, 2) person is zero, and 3)
tense is past rather than present.
tag. The existence of a second person honorific form does not undermine the
general principle, stated above, that the ?p pronoun takes a third person verb form
since, in the imperative, there is no third person, and the subject is not expressed
anyway. For the purposes of the intermediate tagset the tense is considered to be
present, and the number of the honorific form is considered to be ( 1 | 2 ), since both
singular and plural ?subjects? are possible. This also serves to distinguish the VVIA
tag in the intermediate tagset. The mnemonic ?A? is the same as that used for the ?p
pronoun, and thus refers to politeness.
There are three simple imperative forms: second person singular (which is
identical to the ?root? form), second person plural (which is identical to the second
person plural subjunctive form) and second person honorific. Each of these receives a
separate tag. The existence of a second person honorific form does not undermine the
general principle, stated above, that the ?p pronoun takes a third person verb form
since, in the imperative, there is no third person, and the subject is not expressed
anyway. For the purposes of the intermediate tagset the tense is considered to be
present, and the number of the honorific form is considered to be ( 1 | 2 ), since both
singular and plural ?subjects? are possible. This also serves to distinguish the VVIA
tag in the intermediate tagset. The mnemonic ?A? is the same as that used for the ?p
pronoun, and thus refers to politeness.
Urdu has two participles, the imperfective and the perfective. However, unlike
participles in many European languages, they can be used as the sole verb of a main
clause. This creates the tenses referred to as the irrealis and the simple past
respectively. However, the presence or absence of an auxiliary makes no difference to
the form of the participle. It would therefore be misleading to use two tags for a single
form of the verb. These tags are thus used for both finite and non-finite, and the
notions of irrealis and simple past are not referred to in the precise definitions of the
tags. The dual finite and non-finite nature of the tags is indicated in the intermediate
tagset using the OR operator, | . There is a value in the EAGLES tagset for past tense,
but there is not one for irrealis. The closest approximation to an irrealis in the
EAGLES guidelines is subjunctive past (see the discussion of this point in 3.2 above).
This is not a perfect solution, but without adding extra values to the intermediate
tagset it is the best that can be managed. Thus, the imperfective is finite subjunctive
past with zero aspect or non-finite participle imperfective with zero tense. The
perfective is finite indicative18 past with zero aspect or non-finite participle perfective
with zero tense.
The participles are not marked for person, but are marked for gender and
18 It is hard to justify this use of ?indicative?, since Urdu lexical verbs do not possess any indicative
form as such. Therefore the notion of the indicative is not used in the definitions of the tags themselves,
but only in the intermediate tagset (where something is needed to distinguish the finite use of the
perfective participle from the finite use of the imperfective participle).
123
number. Their inflection is the same as that of adjectives, except that in some
circumstances a distinction is made between feminine singular and plural which is not
made by adjectives. Participles can also function as adjectives (see discussion of
adjectives in 3.3 below), in which case this extra feminine singular/feminine plural
distinction is not made (though this does not affect the tagging). That is to say, an
adjective which agrees with a feminine plural noun or pronoun will always receive an
F2 tag, regardless of whether it has the plural ending ??~ or the more general feminine
ending ??.
When participles are used as adjectives, it would in theory be possible to tag
them as if they were adjectives. However, this has not been done, since even when
being used attributively, participles appear in structures that normal adjectives do not.
For example, they frequently occur in participial phrases with the perfective participle
of the auxiliary verb h?n? (see below). When used adjectivally rather than verbally,
participles may be marked for case as well as number and gender. This feature is also
included in the tagset. Of course, the feature case only applies to the non-finite usage
of the participle; this is reflected in the intermediate tagset by the use of ( 0 | 1 ) for
the nominative or finite form. As with adjectives (see below), the ?oblique? case is
( 3 | 5 ) in the intermediate tagset.
The characters Y and T have been used for the perfective and imperfective
participles respectively, since these are the consonants that indicate the suffixes for
these forms19.
There is also a tag for indefinite determiners. Two words in this class are
zy?dah ?more? and k?f? ?enough?. Following Schmidt (1999) these are classed
broadly as adjectives for two reasons: to keep them in line with the possessive
adjectives, which are determiners; and because they can also function as adverbs (see
section 3.6 below), which is characteristic of adjectives. These are not marked for
gender, number or case.
In this miscellaneous group of pronouns are included two indefinite pronouns,
k?? and kuch, which may function as pronouns or determiners (just as yah and vah
do). Also included in the PN* category is sab, ?all?, which has an inflected oblique
plural (like numerals ? see section 3.9) which is tagged as PNO.
The infinitive of the verb is regularly formed. Mostly it is used as a verbal
noun or as part of a complex verb phrase. It is also used as a neutral request form, in
which case it is the main verb of its clause; however, I do not think that this usage is
121
sufficient to justify separate tagging; this is better treated example of a secondary
usage of the same word, rather than a separate word (which giving it a separate tag
would imply). The ?default? ending of the infinitive is ?n?, which is a masculine
singular ending. When used as a noun it may occur in the oblique case; when it occurs
in a verb phrase it may display gender and number agreement (in a similar way to an
adjective). However these conditions cannot both occur17; therefore there is no
feminine oblique or plural oblique, which reduces the number of tags necessary.
There is a problem creating the intermediate tagset: inasmuch as there is no
attribute for ?case? in the EAGLES guidelines for verbs (presumably non-finite verb
forms in European languages do not display case inflection). An attribute, case, has
therefore been added to the end of the intermediate tags. Otherwise this set of
intermediate tags is fairly unproblematic.
The ?N? in the mnemonic tags is derived from the ?n? suffix that indicates the
infinitive.
The EAGLES guidelines do not recommend any additional attributes for the
class of interjections. Nor have I introduced any of my own. There is thus one tag.
The mnemonic tag represent the spelling of ? (Schmidt 1999: 217), which has been
selected as a representative interjection.
There are also in the y-v-k-j sets a number of words that are more like
165
determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone
as pronouns. However they behave in some respects more like adjectives, e.g. they
can be predicative rather than attributive. In terms of the EAGLES guidelines they are
best characterised within the pronoun/determiner category. They correspond to
English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu
tagset, I have classified them as JD ? determiner-like adjectives41.
The iz?fat is a Persian enclitic (pronounced as a shorter
form of ???) which in some circumstances can be considered a preposition: it links
two nouns in a possessive relationship, although the phrase thus produced may often
have a different meaning to a phrase produced with the native Urdu postposition k?.
However, the iz?fat may also join a noun to an adjective, in which case it is not so
clearly accurate to describe it as a preposition parallel to the prepositions in European
languages for which the EAGLES guidelines were compiled. A better way to treat
iz?fat is in the context of the Unique category of miscellaneous one-member wordclasses,
discussed below.
The EAGLES guidelines do not consider lexical and auxiliary verbs to be
separate major parts of speech, although this is a view that some have held (e.g. the
ICE tagset ? Greenbaum and Yibin 1996). However, in Urdu this distinction is very
significant, since auxiliary forms pattern differently to the forms of lexical verbs.
Therefore, this tagset will employ a high-level (but not top-level) distinction between
lexical verbal elements (whose tags will commence with VV) and non-lexical or
auxiliary verbal elements (whose tags will commence with V and one other letter ?
either one indicating what word it is, for auxiliary verbs whose inflectional behaviour
is anomalous, or X for a general auxiliary). Thus both the EAGLES guidelines and the
demands of Urdu morphology are complied with.
There exist in Urdu two widely applicable derivational suffixes which attach
to the root of a lexical verb and increase its valence, making it transitive or causative
in sense. This has been highlighted as a significant feature of the language (e.g. by
Kachru 1990: 63)and is described in some detail by Schmidt (1999: 87, 157-175). It
might be possible to distinguish such derived verbs from non-derived verbs in the
tagset, but I do not, because of the design principle that no derivational information
should be included. Furthermore, such a distinction would be difficult to automate,
and also probably difficult for humans to annotate.
Lexical verbs occur in a number of inflected forms. The names of these forms
are perhaps not very useful, since each of them has a variety of uses hard to capture
by one of the traditional grammatical category names. However, rather than resort to
letters or numbers which would be unlinkable to any previous writing on the Urdu
verb, I use the same names for the forms as Schmidt (1999), as I have been doing thus
far in this thesis.
The last two attributes, finiteness and mood, are problematic. Firstly, inherent
in the EAGLES guidelines is the problem that the mood attribute contains values
relevant to both finite and non-finite forms, so that the finiteness attribute becomes
redundant. Secondly, the finite/non-finite distinction may be hard to draw in Urdu.
The forms described below as participles would traditionally be considered non-finite
in European languages. However, in Urdu they have certain features which make
them seem more like finite forms. For example, they can occur as the only verb in a
main clause, and can agree with a subject or object ? not a property prototypically
associated with non-finite forms. These properties are illustrated by the following
example from Schmidt (1999: 126)15:
unh?~ n? an paRh k? b?t nah
3-PLRL-OBL ERG un educated of-FEM speech not
m?n?
accept-PERF.PART-FEM-SING
They did not accept what the uneducated person said.
The verb form m?n? is a participle, but it is the only verb form in the sentence,
and it is marked for agreement (with the object, since this clause is of the ergative
type). It, like the postposition k?, agrees with the feminine singular noun b?t.
A third problem with the mood distinctions made in the EAGLES guidelines is
that they are not necessarily those made by Urdu. For example, Urdu has forms which
15 Schmidt does not give word-by-word glosses, only whole-sentence translations. I have added the
glosses using Schmidt (1999) and Haq (2001) as guides. See also Appendix 2.
117
may be described as subjunctive and imperative moods, but it would seem to lack an
indicative (except for the auxiliary h?n?). Because of these difficulties, the concepts
of finiteness and mood will not be used to structure the tagset itself, although they are
of course inevitable as attributes in the intermediate tagset16. This means that in some
cases, the intermediate tagset values used to characterise some Urdu verb forms are
somewhat arbitrary, since I have had to simply pick the values that seem closest to
describing Urdu. For example, considering the ?irrealis tense? (the term used by
Schmidt 1999 for the finite use of the imperfective participle) to be a past tense
subjunctive is not warranted by the Urdu verbal system. It was picked as the ?least
bad? way to characterise it simply because the Urdu irrealis has a usage similar to that
of the past subjunctive in languages included in EAGLES such as German and
(vestigially) English (e.g. ich w?re, I were). For example, Schmidt (1999) translates a
sentence from the poet Ghalib as follows:
agar aur j?t? raht?
if and alive-MASC-PLRL stay-IMPERF.PART-MASC-PLRL
yah? intiz?r h?t?
this-very waiting be-IMPERF.PART-MASC-SING
If I were to live longer it would only be to wait like this
The presence in the translation of the past tense subjunctive (?I were?) in the
first ? but not the second ? of two clauses containing the finite imperfective participle
demonstrates the partial parallelism between an Urdu irrealis and an English past
subjunctive.
Case=nominative
Nominative is usually given meaning by its contrast
with accusative ? a case that does not exist in Urdu. The nominative may in Urdu be
used for either, neither or both of the subject and the direct object. Thus it is not
certain whether the nominative in Urdu really corresponds with the nominative that is
value 1 in the EAGLES guidelines. (Barz (1977) and McGregor (1972) actually call the nominative case the direct case.) Certainly it does not correspond with the
nominative as it exists in, for example, German or Latin. However, I have used value
1 in the intermediate tagset for the Urdu case, on the basis that no Urdu case
resembles the nominative in the European languages for which the EAGLES
guidelines were devised any more closely than the Urdu nominative.
Nongrammatical
lexical element
Words that contain an orthographic space which does not actually represent a
word break ? principally Persian loans such as zimmah d?r, ?responsible?, x?b tar?n,
?best?, and ham z?t, ?of the same caste?63 ? cause a problem for tokenisation as
described in 2.2.6.1. This was solved by the decision to treat every orthographic space
as a word break, so that zimmah d?r, etc., are treated as two tokens. However, this
leads to another problem, greater if anything, concerned with tagging. How are the
two elements to be tagged?
62 This problem is referred to as such because it was first encountered during an attempt to manually
tag a sentence from Schmidt (1999) containing the word zimmah d?r using an early trial version of the
tagset.
63 All examples from Schmidt (1999: 248-256).
193
As it happens, zimmah, x?b and z?t are independent words (?duty?, ?good?
and ?caste? respectively) and could be given the appropriate tags, nominal and
adjectival. The problem then becomes, what to do with d?r, t?r?n and ham? The
former two could be given some tag to indicate that they were adjective forming
clitics or affixes, and the prefix ham could be marked up as an adverb (according to
Haq 2001 the part of speech of ham when it occurs independently). However, this has
two drawbacks. Firstly, it breaks with the design principle that no derivational
information will be included in the tagset by analysing the component morphemes of
complex words ? for zimmah d?r etc. are words, not phrases. The word zimmah d?r?,
?responsibility?, is clear evidence of this ? it has been created by a morphological
process (suffixation of ??) and morphological processes apply to words, not to
syntactic phrases64. Also, the single word zimmah d?r has been given two tags in this
approach ? a contravention of the ?one word, one tag? principle65.
Secondly, it introduces inconsistency into the tagging. The derivational
information would be present for some words formed with the relevant Persian
derivational morphemes, but not for all, because not all words formed with them
contain the superfluous orthographic token break. Examples of single-token derived
words include samajhd?r, ?sensible?, kamtar?n, ?least?, and hamdard?, ?sympathy?.
If zimmah and d?r are to be tagged separately, then for consistency samajh would also
have to be tagged separately ? opening up whole vistas of morphological analysis that
are utterly irrelevant to part-of-speech tagging. Indeed, going down this road subverts
the entire enterprise: we would find ourselves engaged in derivational analysis instead
of morphosyntactic analysis.
To take the opposite approach to tagging zimmah d?r, we might mark a single
tag for the whole word (JJU in this case) ? however this also breaks the ?one word,
one tag? principle as there is now an untagged token and multiword tag. The best
solution to the problem (although far from ideal) would seem to be to use some kind
of special tag on the first part of the two-token word to indicate that this is a case of
the zimmah d?r problem, and put the tag we would like to give to the whole thing on
the second token66.
This tag will be LL, the ?nongrammatical lexical element? listed in the
previous section, and it will be applied thus67:
zimmah_LL d?r_JJU
samajhd?r_JJU
x?b_LL tar?n_JJU
kamtar?n_JJU
ham_LL z?t_JJU
hamdard?_NNUF1N
The first element is described as a nongrammatical lexical element because
while it does not contribute to the morphosyntax of the two-token word, it does
contribute to its meaning. Therefore it is entirely lexical in nature. It is to be hoped
66 Since d?r, tar?n and other affixes involved in the zimmah d?r problem are derivational suffixes, it is
they that determine the part of speech; thus it makes sense for them to carry the actual tag.
67 I use an underscore format to link the words and their tags for clarity in the examples given here; in
practice an XML/SGML markup would be used.
195
that the usage of the LL tag can be restricted to one context: alongside a relatively
small number of affixes such as d?r.
The EAGLES guidelines give four recommended attributes for nouns: type,
gender, number and case. There are also two optional attribute, countability and
definiteness. Type refers to whether a noun is common (denotes one or more members
of a class of things2) or proper (is the name of one or more particular things). This
attribute is an example of one which is marginal to morphosyntax, but should be
included since the distinction between common and proper might well prove useful to
some future linguistic investigation of the text. It has been included in the tagset for
now, but with the reservation that it might have to be collapsed in any subtagset for
automatic tagging. This is because there may well not be any way for the tagger to
make this distinction. Unlike the Roman, Greek and Cyrillic alphabets, the Urdu
alphabet has no uppercase letters. In the European languages for which the EAGLES
guidelines were designed, which use one of the former alphabets, uppercase letters are
often used to identify proper nouns. It is clear that no such simple rule could be
employed in Urdu. Furthermore there are no articles in Urdu (Bhatia and Koul 2000:
318), the absence and presence of an article being typical of proper and common
nouns respectively in English and similar languages.
Urdu has two numbers, singular and plural. This is well agreed on (Schmidt
1999: 1; Bhatia and Koul 2000: 314; Barz 1977: 36; Bailey et al. 1956: 1, 5). The
EAGLES guidelines on noun number allow for exactly this possibility, and thus have
been implemented unproblematically.
The EAGLES guidelines give numerals as a separate major part-of-speech, but
51 In fact the EAGLES guidelines on this point are significantly more complicated. However, the
remainder of the recommendations are concerned with handling phenomena that do not occur in Urdu.
181
say that ?In some languages (e.g. Portuguese) this category is not normally considered
to be a separate part of speech, because it can be subsumed under others? We
recognise that in some tagsets Numeral may therefore occur as subcategory within
other parts of speech? (Leech and Wilson 1999: 65). This approach seems sensible for
Urdu, where numerals display very much the behaviour of adjectives. However, for
purposes of the intermediate tagset, the numeral class has been used, since it contains
the very useful attribute type. In fact, all the EAGLES attributes have been used
(though of course, not all of their values). For case, the oblique / vocative value
( 3 | 5 ) is used, as with adjectives.
Case=dative
There is no value in the EAGLES guidelines for oblique. Nor is there one for
postpositional, locative or instrumental (alternative names used by Bailey et al. 1956
for this case6). Rather than invent an extra value (undesirable for reasons given with
regard to markedness above), I have used the value for dative to represent oblique, on
the grounds that in some European languages (e.g. German) prepositions frequently
govern the dative, and in Urdu postpositions govern the oblique.
As far as marked adjectives are concerned, there is again the problem of tagto-
meaning many-to-one and one-to-many mapping ? but with adjectives it is, if
anything, even greater a problem than it was with nouns. There is no oblique-vocative
distinction at all (Schmidt 1999: 36 goes so far as to say that ?An adjective modifying
a vocative noun is in the oblique case?) ...
Thus the tagset does not distinguish vocative adjectives from oblique
adjectives (or participle forms of verbs: see above). In the intermediate tagset, this is
represented using the OR and bracket operators, as described in the EAGLES
guidelines (Leech and Wilson 1999: 71), as ( 3 | 5 ).
Other pronouns and determiners
In this miscellaneous group of pronouns are included two indefinite pronouns,
k?? and kuch, which may function as pronouns or determiners (just as yah and vah
do). Also included in the PN* category is sab, ?all?, which has an inflected oblique
plural (like numerals ? see section 3.9) which is tagged as PNO.
There is also a tag for indefinite determiners. Two words in this class are
zy?dah ?more? and k?f? ?enough?. Following Schmidt (1999) these are classed
broadly as adjectives for two reasons: to keep them in line with the possessive
adjectives, which are determiners; and because they can also function as adverbs (see
section 3.6 below), which is characteristic of adjectives. These are not marked for
gender, number or case.
Urdu has two participles, the imperfective and the perfective. However, unlike
participles in many European languages, they can be used as the sole verb of a main
clause. This creates the tenses referred to as the irrealis and the simple past
respectively. However, the presence or absence of an auxiliary makes no difference to
the form of the participle. It would therefore be misleading to use two tags for a single
form of the verb. These tags are thus used for both finite and non-finite, and the
notions of irrealis and simple past are not referred to in the precise definitions of the
tags. The dual finite and non-finite nature of the tags is indicated in the intermediate
tagset using the OR operator, | . There is a value in the EAGLES tagset for past tense,
but there is not one for irrealis. The closest approximation to an irrealis in the
EAGLES guidelines is subjunctive past (see the discussion of this point in 3.2 above).
This is not a perfect solution, but without adding extra values to the intermediate
tagset it is the best that can be managed. Thus, the imperfective is finite subjunctive
past with zero aspect or non-finite participle imperfective with zero tense. The
perfective is finite indicative18 past with zero aspect or non-finite participle perfective
with zero tense.
The participles are not marked for person, but are marked for gender and
18 It is hard to justify this use of ?indicative?, since Urdu lexical verbs do not possess any indicative
form as such. Therefore the notion of the indicative is not used in the definitions of the tags themselves,
but only in the intermediate tagset (where something is needed to distinguish the finite use of the
perfective participle from the finite use of the imperfective participle).
123
number. Their inflection is the same as that of adjectives, except that in some
circumstances a distinction is made between feminine singular and plural which is not
made by adjectives. Participles can also function as adjectives (see discussion of
adjectives in 3.3 below), in which case this extra feminine singular/feminine plural
distinction is not made (though this does not affect the tagging). That is to say, an
adjective which agrees with a feminine plural noun or pronoun will always receive an
F2 tag, regardless of whether it has the plural ending ??~ or the more general feminine
ending ??.
When participles are used as adjectives, it would in theory be possible to tag
them as if they were adjectives. However, this has not been done, since even when
being used attributively, participles appear in structures that normal adjectives do not.
For example, they frequently occur in participial phrases with the perfective participle
of the auxiliary verb h?n? (see below). When used adjectivally rather than verbally,
participles may be marked for case as well as number and gender. This feature is also
included in the tagset. Of course, the feature case only applies to the non-finite usage
of the participle; this is reflected in the intermediate tagset by the use of ( 0 | 1 ) for
the nominative or finite form. As with adjectives (see below), the ?oblique? case is
( 3 | 5 ) in the intermediate tagset.
The characters Y and T have been used for the perfective and imperfective
participles respectively, since these are the consonants that indicate the suffixes for
these forms19.
Urdu has two participles, the imperfective and the perfective. However, unlike
participles in many European languages, they can be used as the sole verb of a main
clause. This creates the tenses referred to as the irrealis and the simple past
respectively. However, the presence or absence of an auxiliary makes no difference to
the form of the participle. It would therefore be misleading to use two tags for a single
form of the verb. These tags are thus used for both finite and non-finite, and the
notions of irrealis and simple past are not referred to in the precise definitions of the
tags. The dual finite and non-finite nature of the tags is indicated in the intermediate
tagset using the OR operator, | . There is a value in the EAGLES tagset for past tense,
but there is not one for irrealis. The closest approximation to an irrealis in the
EAGLES guidelines is subjunctive past (see the discussion of this point in 3.2 above).
This is not a perfect solution, but without adding extra values to the intermediate
tagset it is the best that can be managed. Thus, the imperfective is finite subjunctive
past with zero aspect or non-finite participle imperfective with zero tense. The
perfective is finite indicative18 past with zero aspect or non-finite participle perfective
with zero tense.
The participles are not marked for person, but are marked for gender and
18 It is hard to justify this use of ?indicative?, since Urdu lexical verbs do not possess any indicative
form as such. Therefore the notion of the indicative is not used in the definitions of the tags themselves,
but only in the intermediate tagset (where something is needed to distinguish the finite use of the
perfective participle from the finite use of the imperfective participle).
123
number. Their inflection is the same as that of adjectives, except that in some
circumstances a distinction is made between feminine singular and plural which is not
made by adjectives. Participles can also function as adjectives (see discussion of
adjectives in 3.3 below), in which case this extra feminine singular/feminine plural
distinction is not made (though this does not affect the tagging). That is to say, an
adjective which agrees with a feminine plural noun or pronoun will always receive an
F2 tag, regardless of whether it has the plural ending ??~ or the more general feminine
ending ??.
When participles are used as adjectives, it would in theory be possible to tag
them as if they were adjectives. However, this has not been done, since even when
being used attributively, participles appear in structures that normal adjectives do not.
For example, they frequently occur in participial phrases with the perfective participle
of the auxiliary verb h?n? (see below). When used adjectivally rather than verbally,
participles may be marked for case as well as number and gender. This feature is also
included in the tagset. Of course, the feature case only applies to the non-finite usage
of the participle; this is reflected in the intermediate tagset by the use of ( 0 | 1 ) for
the nominative or finite form. As with adjectives (see below), the ?oblique? case is
( 3 | 5 ) in the intermediate tagset.
The characters Y and T have been used for the perfective and imperfective
participles respectively, since these are the consonants that indicate the suffixes for
these forms19.
Urdu has the three normal persons given in the EAGLES guidelines, each in
singular and plural forms. Schmidt (1999: 97) suggests that Urdu verbs also have an
additional polite or honorific form, which although second person in meaning (it
agrees with a pronoun ?p that refers to one or more interlocutors) is identical to the
third person plural form of the verb. In this case I have deviated from the model
described by Schmidt, for reasons discussed in my treatment of the ?p pronoun in
section 3.4.1.2. There will be no tags for honorific verbal forms, and verb forms
which agree with ?p will be tagged as third person forms. The exception to this is the
imperative, discussed in the next section.
The issue of what exactly constitutes a personal pronoun is not an easy one in
the context of the grammar of Urdu as presented by Schmidt (1999). Therefore, in this
section, before discussing the tags of the personal pronouns I elaborate on how I drew
the boundary of this category, justifying the minor claim that the pronouns vah and
yah (and their various inflected forms) are not personal pronouns, as stated by
Schmidt (1999)29. I first consider these third person pronouns (3.4.1.1), and
subsequently the problematic honorific pronoun ?p (3.4.1.2). In 3.4.1.3 I deal with the
tagging of mai~ and t?, the remaining words in the category of personal pronouns.
3.4.1.1 The non-existence of third person personal pronouns
Urdu has no third person personal pronouns. The demonstrative
pronouns/determiners are used in their place. This is claimed contrary to Schmidt,
who states (1999: 15) that ?The demonstrative pronouns ye and vo are identical in
form to the personal pronouns ye and vo (meaning ?he?, ?she?, ?it?)?. However the
differences in behaviour between these pronouns and the first and second person
pronouns that I list below, also drawn from Schmidt, make it clear that the statement
that began this section is justified.
There are absolutely no differences in case / number inflection between the third
person pronouns and the demonstratives (Schmidt 1999: 16)
? In a perfective transitive sentence (the type that some, such as Dixon 1994, would
class as ?ergative?), a third person pronoun subject appears in the oblique case
(like a noun); but a first or second person subject pronoun is in the nominative
case at all times (Schmidt 1999: 22)
? The third person pronouns take special plural oblique forms before the
postposition n? (Schmidt 1999: 22), whereas the first and second do not
? There are no possessive adjectives corresponding to the third person pronouns,
whereas there are such adjectives corresponding to the first and second person
pronouns (Schmidt 1999: 24)
On these grounds, I exclude the third person pronouns from consideration as
personal pronouns, and deal with them as demonstratives/determiners, etc. (see
section 3.4.2).
Thus, the subcategory of first and second person personal pronouns contains
only the pronouns mai~ and t?, and inflectionally related forms such as their plurals
and possessive forms. All tags in this subcategory begin PP? (or PG? for possessives).
Personal pronouns are not marked for gender: as with verbs, that which is
marked for person is not marked for gender. (The ?M? in the tags below signifies
?first person?, not ?masculine?.) They are marked for number and case.
As noted in the preceding section, the intermediate tagset for pronouns
contains an attribute of politeness. All pronouns in this section are given as familiar,
to distinguish their intermediate tags from that for ?p. In practice, the singular/plural
distinction is often also used to indicate formality in the second person pronouns
(Bhatia and Koul 2000: 35-36); tum may apply to one or more than one person.
However, the EAGLES guidelines suggest34 that such a pragmatic usage of the
number distinction may still be encoded as a number distinction. This is what I have
done, tagging tum as plural, on the basis that for purposes of inflection it is the
number of the pronoun, not the number of its referent, that counts.
There are possessive adjectives corresponding to the personal pronouns above.
While the intermediate tagset must treat these as pronouns, within the Urdu tagset
they could have been treated as adjectives (as has been done with some other
determiner-like pronouns; see below). However, this has not been done, since the
possessive adjectives have person. This is not true for any adjectival form, and thus
the possessive adjectives are better classed as personal pronouns.
As they are adjectival, they may be marked for gender, number and case. The
157
case and gender attributes indicate the features that are in agreement with the head
noun rather than inherent features of the pronoun. The number attribute is also for
agreement; the inherent number of the possessive adjective itself is shown by the
attribute possessive.
There are possessive adjectives corresponding to the personal pronouns above.
While the intermediate tagset must treat these as pronouns, within the Urdu tagset
they could have been treated as adjectives (as has been done with some other
determiner-like pronouns; see below). However, this has not been done, since the
possessive adjectives have person. This is not true for any adjectival form, and thus
the possessive adjectives are better classed as personal pronouns.
The EAGLES guidelines treat pronouns and determiners together as a single
category, although one of the recommended attributes, category, distinguishes
between them. Since in Urdu the distinction is not clear (particularly in the area of
third person pronouns), I also treat this category as being single at the most
fundamental level. The difference between what is considered a determiner and what
is considered a pronoun is not made in the EAGLES guidelines, which say ?different
analyses for different languages entail separating [these parts of speech] out in
different ways? (Leech and Wilson 1999: 63). For Urdu, I have mostly followed
Schmidt ? who does not have a separate ?determiner? category ? in the divisions I
make. However, I have classed together all third person pronouns/demonstratives,
interrogative and relative pronouns/determiners, because these form sets of words
149
displaying morphological symmetry (see 3.4.2).
Schmidt counts pronouns such as yah, vah, as both personal pronouns and
determiners. However, for the purposes of the tagset, the division should be sharp;
therefore I have limited the ?personal pronouns? category to the first and second
persons. The justification for this is given in section 3.4.1.1. I have also diverged from
Schmidt in classing together a number of her minor categories of pronoun under the
covering title ?other? for the purposes of this tagset definition.
This gives the following groups of pronoun/determiner-like words
? first and second person personal pronouns
? third person pronouns/demonstratives, interrogative and relative pronouns and
determiners
? reflexive pronouns
? other pronouns and determiners
There is one pronoun, ?p (a kind of honorific personal pronoun) which does
not fit unproblematically into any of these categories. Discussion is devoted to this
pronoun in section 3.4.1.2 below.
There are also in the y-v-k-j sets a number of words that are more like
165
determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone
as pronouns. However they behave in some respects more like adjectives, e.g. they
can be predicative rather than attributive. In terms of the EAGLES guidelines they are
best characterised within the pronoun/determiner category. They correspond to
English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu
tagset, I have classified them as JD ? determiner-like adjectives41.
The EAGLES guidelines allow three options for the markup of word-external
punctuation: firstly, to use a single tag for all punctuation marks (the obligatoryattribute-
only approach); secondly, to give each punctuation mark its own separate
tag; and thirdly, to group punctuation marks into a smaller number of tags according
to how they may position in a sentence. The first approach I rejected on the grounds
that it needlessly excluded potentially useful information. The third approach,
likewise, tags different punctuation marks in the same way. Since punctuation marks
can be tagged utterly unambiguously ? a comma is always a comma ? this is needless.
The decision was therefore taken to give each punctuation mark a unique tag. This tag
is, in fact, the same as the punctuation mark itself (a practice also adhered to in, for
example, the C7 tagset: see 2.1.2.1). However, since the tagset is designed to operate
in Unicode texts, more forms of punctuation can be distinguished (for example,
opening and closing quotation marks). Some of these distinctions may be finer than is
necessary (e.g. that between square and normal brackets is useless if one simply
wishes to search for brackets in general) but it would be trivial to design search
software that could treat the two tags as alike, or to map to a subtagset that collapsed
these to a single ?bracket? category. There are 13 tags in this section. The EAGLES
guidelines underspecify the value of the one attribute, stating values only for the full
stop, comma, and question mark, so I have inferred it (using letters when the available
digits ran out).
For all punctuation marks, the Unicode of the Perso-Arabic tag is the same as
that of the punctuation mark being tagged52. The Roman tags for full stop, comma,
question mark, and semi-colon consist of a different Unicode character to the
punctuation mark being tagged, but otherwise likewise use the same Unicode.
With regard to paired punctuation ? the quotation marks and brackets ? there
is a point to be made as regards directionality. The Unicode Standard specifies
(Unicode 1996: 6-4) that in bi-directional text53 the same character ? i.e. the same
Unicode value ? should represent the opening member of the pair whatever its
appearance, and the same with the closing member of the pair. That is, the code
U+0028 (OPENING PARENTHESIS) ought always to be the first of the pair, and be
rendered as ? ( ? in left-to-right text, such as English, and as ? ) ? in right-to-left text,
such as Urdu. Other paired punctuation marks should function similarly54. Therefore
for each of these marks, the Roman and Perso-Arabic tags are mirror images of one
another, though they are encoded by the same numeric value.
This could potentially create confusion when an analyst tags text by hand,
inasmuch as the (Roman) tag will have the opposite appearance to the (Perso-Arabic)
symbol in the actual text55. However, this will not be problematic when tagging is
automated, ?right? and ?left? meaning nothing to a computerised tagger.
There remain some problematic points, for example, the ellipsis (?), angle
bracket speech marks, and braces. These have not been given tags for now, on the
basis that no Urdu text I have yet seen contains these symbols. However, nor does any
work on Urdu rule out their use, so extra punctuation tags may prove necessary.
rah?
This auxiliary element is used in the formation of tenses in the durative aspect.
It is itself the perfective participle of the lexical verb rahn?, ?remain?, but as Schmidt
(1999: 111) reports, this form ?has been delexicalised?. It is marked for gender and
number. It may seem that treating rah? as auxiliary and rahn? as lexical goes against
the principle laid down in 3.2 that the distinction between lexical and auxiliary should
be inherent to the verb and not dependent on context, and conflicts, for example, with
the treatment of h?n? (see 3.2.2.4 below). However, this is not the case. The verb
h?n? may be main but it is never lexical; rahn? is lexical when it is main, and cannot
act as an auxiliary at all except for the one, very particular, delexicalised form rah?.
There is a problem in the intermediate tagset, in that the EAGLES guidelines
contain no value for durative aspect. Therefore, the aspect attribute is given the value
zero, since the aspect is neither perfective nor imperfective. This is not a very good
solution but it is preferable to adding a value, and there is no satisfactory way to mark
durative in the intermediate tagset by adding an attribute. This solution also ensures
that each form of auxiliary rah? has a unique value in the intermediate tagset, since
every other participial element is either imperfective or perfective. Otherwise in the
intermediate tagset, rah? is considered to be a non-finite participle with zero tense.
When used lexically, rah? receives the tag VVYM1N, rah? receives VVYF1N
or VVYF2N, and so on.
Unlike many European languages, Urdu reflexive pronouns are not personal.
That is, they have the same form regardless of the person of the pronoun they are
reflexing back to. There are two reflexive pronouns, both tagged the same, a
reciprocal pronoun (which only appears within a postpositional phrase) and a
reflexive possessive adjective. The reflexive possessive adjective is classed with the
other possessive adjectives in the hierarchy given in 3.14. See also the discussion of
the honorific usage of ?p in section 3.4.1.2 above.
There are also in the y-v-k-j sets a number of words that are more like
165
determiners than pronouns, i.e. they take adjectival inflection and cannot stand alone
as pronouns. However they behave in some respects more like adjectives, e.g. they
can be predicative rather than attributive. In terms of the EAGLES guidelines they are
best characterised within the pronoun/determiner category. They correspond to
English words like ?such?, ?this/that much/many? and so on. In terms of the Urdu
tagset, I have classified them as JD ? determiner-like adjectives41.
The remaining categories (called ?residual? in the EAGLES guidelines) cover,
quite simply, everything else. This comprises various semi-linguistic and non-Urdu
elements. There are 8 such tags. Although the EAGLES guidelines allows for these
elements having number and gender, I have not included this: if such an element is
inflected as a verb, noun or adjective, then it may be considered sufficiently a part of
that category to be tagged as such. This particularly applies to acronyms and
abbreviations. Thus, the second and third EAGLES attributes, number and gender, are
zero in the intermediate tags below. Every value from the first EAGLES attribute,
type, has been used; with the exception of FX and FS, each tag bears the name of the
value in the intermediate tagset it is mapped onto.
The tag for ?foreign words? is meant to cover words from other languages
written in the Urdu alphabet. It is not meant to cover the large number of Persian,
Arabic and English loanwords that exist in Urdu, although it remains to be seen how
sharp this distinction can be made in actual tagging. The tag for ?non-Perso-Arabic
string? is for foreign words in other alphabets, or for other non-Perso-Arabic
incursions into the text. FU is a catch-all ?Unclassified? category, although it is to be
hoped that the vast majority of tokens will be catered for by at least one of the other
tags outlined in this chapter.
The root consists, as its name suggests, of the root of the verb unadorned by
affixation. It is not marked for person, number or gender and cannot occur as the sole
verb of a main clause; it is, therefore, non-finite (untensed and also neither
imperfective nor perfective in aspect). The exception to this is when it is used as an
imperative form (discussed below). However, it does not fit neatly into any of the
non-finite values for mood (the choices being infinitive, participle, gerund and
supine). Therefore, in the intermediate tagset it is given a 0 for mood. Since this only
has one form, there is only one tag. It should be noted that in the intermediate tags for
this and all the following forms of lexical verb, all the tags give the status attribute the
value main, since by definition a lexical verb is not an auxiliary (see the discussion of
the status attribute in 3.2 above).
The problematic honorific pronoun ?p
The case of ?p, the second person honorific pronoun, is by no means as clear
as that of the third person pronouns. While the fact of its identical appearance with the
reflexive pronoun (also ?p: see 3.4.330) suggests that, like the third person pronouns, it
may be best classified elsewhere, there are two very good reasons for regarding ?p as
a personal pronoun like mai~ and t?.
30 Kellogg (1875: 180-181) gives the common etymology of (what he sees as) these two pronouns in a
single Sanskrit word.
153
The first is semantic. Semantically and pragmatically, ?p has a very similar
meaning to t? and its plural form tum ? they both mean ?you?31. The second reason is
syntactic. From the examples of ?p given by Schmidt (1999), it would appear that ?p
has a very similar distribution to mai~ and t?. It is used, for example, as the subject of
a sentence; the reflexive pronoun ?p, by contrast, can never be the subject of a
sentence for obvious reasons.
There are, on the other hand, a number of reasons to regard ?p as unlike mai~
and t? and either identical or at least more akin to the cognate reflexive pronoun (also
?p. All are morphological. Firstly, ?p (both the honorific and reflexive pronoun) does
not have separate nominative and oblique cases, whereas mai~ and t? do. Secondly, as
noted above, mai~ and t? have associated possessive adjectives. ?p also has such a
possessive adjective, apn?, but this is only used reflexively (see 3.4.3). When the
usage is honorific, possession is expressed phrasally with the postposition k?, ?of?.
Thirdly, while mai~ and t? agree with verbal forms distinct from those used with
nouns or third person pronouns, ?p does not, always taking identical verbal inflections
to the third person. This is what we would expect if it were simply a special usage of a
reflexive pronoun.
So then, is ?p a second person personal pronoun or is it a special usage of the
reflexive pronoun? Either position is tenable. The syntax and semantics of the case
supports the former approach while the morphology backs up the latter approach. The
EAGLES guidelines cannot help in choosing between them, since this problem is an
idiosyncrasy of emille: we would therefore not expect it to be covered by a standard
drawn up for a set of languages which do not include Urdu. Ultimately, this is a case
where an arbitrary decision must be taken: the decision I took was not to treat ?p as a
personal pronoun along with mai~ and t?. However, although arbitrary, this decision
is consistent: ?p will always be treated separately in this way32.
In fact the non-reflexive ?p will be given the tag PA, so that in terms of the
hierarchy of the tagset, it is categorised neither with the personal nor the reflexive
pronouns, but in a separate subdivision of the pronoun category. This is, to an extent,
another arbitrary decision: PPA could have been an equally reasonable tag,
emphasising the similarity of syntactic function with mai~ and t?, or PRA,
emphasising the similarity of its case inflections to those of the reflexive pronouns,
which likewise show no difference between the nominative and oblique cases.
However, to impose either of these interpretations might prove theoretically
controversial, in breach of a stated design principle33.
Note however that in terms of the intermediate tagset, ?p is still treated as a
personal pronoun, because the things that it will map onto in other languages will be
personal pronouns. Its number is ( 1 | 2 ), on the grounds that it may refer to one
person or to more than one. Note that the intermediate tagset for pronouns contains a
value, politeness; ?p has been listed as polite, whereas the intermediate tags for t? as
given in the next section contain the value for familiar.
Sentence tagword
(e.g. s?h?)
This category is rather more open than the other ?unique? categories, and may in certain
circumstances be ambiguous with adverbs.
The subjunctive is the only form that is marked for person in Urdu lexical
verbs. It is not, however, marked for gender. Therefore the intermediate tagset forms
give gender as zero, mood as subjunctive and tense as present.
Unique/unassigned (including particles, clitics and tags)
The Unique category in the EAGLES guidelines is meant to contain words
that are members of a one-word category; for example, the infinitive marker to or the
existential there in English. I will first outline the general nature of the tags defined in
this part of the tagset (3.12.1), before going into some depth on the problem that
motivated the creation of one particular unique category, that of nongrammatical
lexical element: the zimmah d?r problem (3.12.2).
There are a considerable number of factors to be taken into account in a
description and categorisation of the Urdu verbal system. There are a number of
inflected forms, and with the use of one or more auxiliary elements, 15 compound
tenses are built up. Furthermore, any part of the compound verb-phrase may be
marked for number, person or gender agreement12. There are two conceivable
approaches to the markup of such a compound verb-phrase. Firstly, each word could
be tagged separately, regardless of its context. So for example the form that Schmidt
(1999) refers to as the ?perfective participle? would be tagged the same regardless of
what compound tense it was being used in. Secondly, compound verbs could be
treated as multi-word units, each such unit receiving a single tag.
The latter approach was not followed, for three reasons. In the first place, it
goes against the principle that every word should have its own tag, using no
multiword tags. Secondly, it goes against the suggestion made by the EAGLES
guidelines that ?In general, compound tenses are not dealt with at the morphosyntactic
level, since they involve the combination of more than one verb in a larger
construction? (Leech and Wilson 1999: 63). Thirdly, it would result in the tagset
being much more complicated than need be. For example, each of the 15 compound
tenses would need to be distinguished. By contrast the other approach would require a
relatively smaller number of distinctions to be made, between the elements of which
the compound tenses are built. The over-complicated tagset design that multi-word
tagging of compound verbs would necessitate would also have the drawback of going
far beyond the EAGLES guidelines on verbal tags. By treating each word of the
compound verb as separate, it is possible to stick fairly closely to the guidelines.
the agreement attributes number,
gender, and person are clearly relevant to the Urdu verbal system. Some writers
consider that Urdu displays what has been described as split ergativity (as described
in section 1.1.5.4). That is, the verb agrees sometimes with the subject, and sometimes
with the direct object. It may also under some circumstances agree with neither
(Schmidt 1999: 125). As explained in 1.1.5.4, however, some writers (e.g. Butt 1995)
disagree with this analysis. However, for the purposes of defining verbal tags the
matter of ergativity is more or less irrelevant. The agreement suffixes which occur on
verbs ? and, therefore, the morphosyntactic categories displayed by verbs ? are
exactly the same regardless of which argument of the verb is being agreed with. A
single morphosyntactic phenomena receives a single tag; so for example when I give
13 Except for one marginal case (see discussion of c?hi? in section 3.2.2.3 below).
115
a verb a tag VVYF1N14 (see 3.2.1.3), it is not specified whether the feminine
agreement is with a subject or object. Thus, the principle of theoretical neutrality is
upheld: this analysis is as compatible with a theory in which Urdu displays split
ergativity as with a theory in which it does not.
Status
(i.e. whether a verb is main or auxiliary) is relevant throughout. However, the way in
which it has been used is a little different to that given in the EAGLES
recommendations. The EAGLES guidelines suggest a main/auxiliary distinction
which is context dependent. This can be seen by Leech and Wilson?s example tagset
for English (1999: 72-74), in which it is made clear that the verb be can be either a
main verb or an auxiliary verb. However, the distinction I have used is between
lexical verbs and non-lexical auxiliary verbs. This is not context-dependent; English
be would be considered an auxiliary regardless of context. The motivation for this is
the decidedly irregular morphology of Urdu auxiliary verbs, most particularly h?n?,
?be? (see also 3.2.2.4). This goes far beyond the inflectional oddities found in English
non-lexical verbs: h?n? possesses two tenses that no other verb has, and it possesses
them regardless of whether it is a main verb or not. To mark up h?n? as a main verb,
there would have to be a tag, for example, for a present-tense main verb. But to
include such a tag would be to vastly misrepresent the majority of Urdu verbs, which
have no inflected present tense. There are similar problems with such non-lexical
verbal forms as c?hi? and g?. Thus it makes sense to use the status attribute to
distinguish (mostly regular) lexical verbs and (irregular) auxiliary verbs, so that the
unique marking on the latter can be tagged exclusively on the latter. The optional
third value of the status attribute, semi-auxiliary, has been used as described below.