Annotation Model for AnnCorra : Guidelines For POS And Chunk Annotation For Indian Languages as described by Bharati et al. (2006). Unless marked otherwise, all comments here are quotes from this document.
Bharati et al. (2006) claim to provide a tagset applicable to all Indian languages. They explicitly mention Hindi, Bangla, Marathi, Telugu, and Tamil, and the tagset can be assumed to be applicable to at least these languages. The document is, however, mostly working with examples from Hindi and Bangla.
Akshar Bharati, Dipti Misra Sharma, Lakshmi Bai, Rajeev Sangal (2006), AnnCorra : Annotating Corpora. Guidelines For POS And Chunk Annotation For Indian Languages, Tech. Rep., L anguage Technologies Research Centre IIIT, Hyderabad, version of 15-12-2006, http://ltrc.iiit.ac.in/tr031/posguidelines.pdf
JJ Adjective
This tag is also taken from Penn tags. Penn tag set also makes a distinction between comparative and superlative adjectives. This has not been considered here. Therefore, in the current scheme for Indian languages, the tag JJ includes the 'tara' (comparative) and the 'tama' (superlative) forms of adjectives as well.
For example, Hindi adhikatara (more times), sarvottama (best), etc. will also be marked as JJ.
RB Adverb
For the adverbs also, the tag RB has been borrowed from Penn tags. Similar to the adjectives, Penn tags make a distinction between comparative and superlative adverbs as well. This distinction is not made in this tagger. This is in accordance with our philosophy of coarseness in linguistic analysis.
Another important decision for the use of RB for adverbs in the current scheme is that :-
(a) The tag RB will be used ONLY for 'manner adverbs' . Example,
h21. vaha jaldI jaldI khA rahA thA
'he' 'hurriedly' 'eat' 'PROG' 'was'
(b) The tag RB will NOT be used for the time and manner expressions unlike English where time and place expressions are also marked as RB. In our scheme, the time and manner expressions such as 'yahAz ? vahAz, aba ? waba ' etc will be marked as PRP.
VAUX Verb Auxiliary
All auxiliary verbs will be marked as VAUX. This tag has been adopted as such from the Penn tags. (For examples, see h14 ? h16 above). [CC: see MainVerb for examples]
QC Cardinals
Any word denoting a cardinal number will be tagged as QC. Penn tag set has a tag CD for cardinal numbers and they have not talked of ordinals. For example,
h35. vahAz tIna_QC loga bEThe the
'there' 'three' 'people' 'sitting' 'were'
?Three people were sitting there?
?A minimal (non recursive) phrase(partial structure) consisting of correlated, inseparable words/entities, such that the intra-chunk dependencies are not distorted? ... A chunk would contain a 'head' and its modifiers.
CL Classifiers
The tag CL has been included to mark classifiers. Many Indian languages have a rich classifier system. ?A classifier, in linguistics, is a word or morpheme used in some languages to classify a noun according to its meaning? (http://en.wikipedia.org/wiki/Classifier_%28linguistics%29).
For example,
Telugu : (t2) padi mandi pillalu
'ten' 'persons' 'children'
Tamil : (tm1) pattu pEr mANavarakaLa
'ten' 'person' 'students'
The words 'mandi' (Telugu ) and 'per' (Tamil) are classifiers which occur with numerals with human nouns. Such expressions when occurring separately (not suffixed with the noun) will be marked as CL. Therefore :
Telugu : (t2) padi mandi_CL pillalu
'ten' 'persons' 'children'
Tamil : (tm1) pattu pEr_CL mANavarakaLa
'ten' 'person' 'students'
*C Compounds (Make it XC ? where X is a variable of the type of the compound of which the current word is a member of)
The issue of including a tag for marking compounds was discussed extensively. Results of algorithms using IIIT-H tag set which included NNC (part of compound nouns) and NNPC (part of proper nouns) showed that these two tags contributed substantially to the low accuracy of the tagger. Since most elements which occur as NNC or NNPC can also occur as NN and NNP, it affected the learning by the machine. So, the question was, why to include tags which contributed more to the errors ? The other aspect, however, was that while human annotators are annotating the data, they know from the context when a certain element is NNC or NN, NNPC or NNP and if marked, this information can be useful for certain applications. The argument is same as the one in favor of including a tag for proper nouns.
Another point which was discussed was that any word class can have compound forms in Indian languages (including adjectives and adverbs).
Therefore, if we decide to have a tag for showing compounds of each type, the number of tags will go very high. The final decision on this was to include a *C tag which will be realised as catC tag of the type of compound that the element is a part of. For example, if a certain word is part of a compound noun, it will be marked as NNC, if it is part of a compound adjective, it will be marked as JJC and so on and so forth.
Some examples are given below :
Hindi compound noun keMdra sarakAra (Central government) will be tagged as keMdra_NNC sarakAra_NN.
In this example, 'keMdra' and 'sarakAra' are both nouns which are forming a compound noun. All words except the last one, of a compound words will be marked as NNC. Thus any NNC will be always followed by another NNC or an NN. This strategy helps identify these words as one unit although they are not conjoined by a hyphen. Similarly, a compound proper noun will be marked as NNPC excluding the last one. eg. aTala_NNPC bihArI_NNPC vAjapeyI_NNP
The first two words, in the above example, will be tagged as NNPC and the last one will be tagged as NNP. Similar to the NNC tag for common nouns, NNPC tag helps in marking parts of a proper noun.
h41. rAma, mohana aur shyAma ghara gaye.
'Ram', 'Mohan' 'and' 'Shyam' 'home' 'went'
?Ram, Mohan and Shama went home?.
h42. bagIce meM ranga_JJC biraMge_JJ phUla khile the
'garden' 'in' 'colourful' 'flowers' 'flowered' 'were'
?The garden had colorful flowers?
Titles such as Dr., Col., Lt. etc. which may occur before a proper noun will be tagged as NNC. All such titles will always be followed by a Proper Noun. In order to indicate that these are parts of proper nouns but are nonetheless nouns themselves, they will be tagged as NNC, eg. Col._NNC Ranjit_NNPC Deshmukh_NNP
CC Conjuncts(co-ordinating and subordinating)
The tag CC will be used for both, co-ordinating and subordinating conjuncts. The Penn tag set has used IN tag for prepositions and subordinating conjuncts. Their rationale behind this is that subordinating conjuncts and prepositions can be distinguished because subordinating conjuncts are followed by a clause and prepositions by a noun phrase.
But in the current tagger all connectives, other than prepositions, will be marked as CC.
h28. mohana bAzAra jA rahA hE Ora_CC ravi skUla jA rahA hE
'Mohan' 'market' 'go' 'PROG' 'is' 'and' 'Ravi' 'school' 'go' 'PROG' 'is'
?Mohan is going to the market and Ravi is going to the school?
h29. mohana ne mujhe batAyA ki_CC Aja bAzAra banda hE
'Mohan' 'erg' 'to me' 'told' 'that' 'today' 'market' 'close' 'is'
?Mohan told me that the market is closed today.?
DEM Demonstratives
The tag 'DEM' has been included to mark demonstratives. The necessity of including a tag for demonstratives was felt to cover the distinction between a pronoun and a demonstrative. For example,
h12. vaha ladakA merA bhAI hE (demnostrative)
'that' 'boy' 'my' 'brother' 'is'
h13. vaha merA bhAI hE (pronoun)
'he' 'my' 'brother' 'is'
Many Indian languages have different words for demonstrative adjectives and pronouns. A better evidence for including a separate tag for demonstratives is from the following Telugu examples,
t1. A abbAyi nA tammudu
'that' 'boy' 'my' 'brother'
t2. atanu nA tammudu
'he' 'my' 'brother'
(Telugu does not have a copula 'be' in the present tense)
ECH Echo words
Indian languages have a highly productive usage of echo words such as Hindi 'cAya-vAya' ('tea' 'echo'), where 'cAya' is a regular lexical item of Hindi vocabulary and 'vAya' is an echo word indicating the sense ?etc? . These words, on their own, are 'nonsense' words and do not find a place in any dictionary.
Thus, the gloss for 'cAya-vAya' would be 'tea etc'. It is proposed to add the tag ECH for such words.
VGNN Gerunds
A verb chunk having a gerund will be annotated as VGNN. For example,
h18a. sharAba ((pInA_VM))_VGNN sehata ke liye hAnikAraka hE.
'liquor' 'drinking' 'heath' 'for' 'harmful' 'is'
?Drinking (liquor) is bad for health?
h19a. mujhe rAta meM ((khAnA_VM))_VGNN acchA lagatA hai
'to me' 'night' 'in' 'eating' 'good' 'appeals'
?I like eating at night?
h20a. ((sunane_VM meM_PSP))_VGNN saba kuccha acchA lagatA hE
'listening' 'in' 'all' 'things' 'good' 'appeal' 'is'
VGINF Infinitival Verb Chunk
This tag is to mark the infinitival verb form. In Hindi, both, gerunds and infinitive forms of the verb end with a -nA suffix. Since both behave functionally in a similar manner, the distinction is not very clear. However, languages such as Bangla etc have two different forms for the two types. Examples from Bangla are given below.
b8. Borabela ((snAna karA))_VGNN SorIrera pokze BAlo
'Morning' 'bath' 'do-verbal noun' 'health-gen' 'for' 'good'
?Taking bath in the early morning is good for health?
b9. bindu Borabela ((snAna karawe))_VGINF BAlobAse
'Bindu' 'morning' 'bath' 'take-inf' 'love-3pr'
?Bindu likes to take bath in the early morning?
In Bangla, the gerund form takes the suffix ?A / -Ano, while the infinitive marker is ?we. The syntactic distribution of these two forms of verbs is different. For example, the gerund form is allowed in the context of the word darakAra ?necessary? while the infinitive form is not, as exemplified below:
b10 Borabela ((snAna karA))_VGNN darakAra
'Morning' 'bath' 'do-verbal noun' 'necessary'
?It is necessary to take bath in the early morning?
b11. *Borabela ((snAna karawe))_VGINF darakAra
Based on the above evidence from Bangla, the tag VGINF has been included to mark a verb chunk.
INTF Intensifier
This tag is not present in Penn tag set. Words like 'bahuta', 'kama', etc. when intensifying adjectives or adverbs will be annotated as INTF. Example,
h37. hEdarAbAda meM aMgUra bahuta_INTF acche milate hEM
'HyderabAd' 'in' 'grapes' 'very' 'good' 'available' 'are'
?Very good grapes are available in Hyderabad?.
INJ Interjection
The interjections will be marked as INJ. Apart from the interjections, the affirmatives such as Hindi 'HAz'('yes') will also be tagged as INJ. Since, this is the only example of such a word, it has been clubbed under Interjections.
h38. arre_INJ, tuma A gaye !
'oh' 'you' 'come' 'have'
?Oh! you have come?
h39. hAz_INJ, mEM A gayA
'yes', 'I' 'come' 'have'
?Yes, I have come?.
VM Verb Main
Verbal constructions in languages may be composed of more than one word sequences. Typically, a verb group sequence contains a main verb and one more auxiliaries (V AUX AUX ... ... ). In the current tagging scheme the support verbs (such as dAlanA in kara dAlAtA hE, uThanA in cOMka uThA thA etc) are also tagged as VAUX. The group can be finite or non-finite. The main
verb need not be marked for finiteness. Normally, one of the auxiliaries carries the finiteness feature.
The necessity of marking the finiteness or non-finiteness in a verb was discussed extensively and everybody agreed that it was crucial to mark the distinction. However, languages such as Hindi, which have auxiliaries for marking tense, aspect and modalities pose a problem. The finiteness of a verbal expression is known only when we reach the last auxiliary of a verb group. Main verb of a finite verb group (leaving out the single word verbal expressions of the finite type ? eg vaha dillI gayA) does not contain finiteness information. For example,
h14. laDZakA seba khAtA raHA wA
'boy' 'apple' 'eating' 'PROG' 'was'
The boy had kept eating.
h15. seba khAtA huA laDZakA jA rahA thA
'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was'
The boy eating the apple was going.
The expression khAtA raHA in (h29) above is finite and khAtA huA in (h3) is non finite. However, the main verb 'khAtA' is non-finite in both the cases.
So, the issue is - whether to (1a) mark finiteness in ?khAtA rahA thA ( had kept eating)? at the lexical level on the main verb (khA) or (1b) on the auxiliary containing finiteness (wA) or (2) not mark it at the lexical level at all. All the three possibilities were discussed;
1) Mark the finiteness at the lexical level.
If we mark it at the lexical level, following possibilities are available :
1a) Mark the finiteness on the main verb, even though we know that the lexical item itself is not finite.
In this case, the annotator interprets the finiteness from the context. (The POS tags VF, VNF and VNN were earlier decided based on this approach). The main verb, therefore, is marked as finite consciously with a view that the group contains a 'verb root' and its auxiliaries (as TAM etc) is finite even though the main verb does not carry the finiteness at the lexical level. Although, this approach facilitates annotation of both the main verb and the finiteness (of the group) by a single tag, it allows tagging a lexical item (main verb) with the finiteness feature which it does not actually carry. So, this is not a neat solution.
1b) The second possibility is, mark the finiteness on the last auxiliary of the sequence. Here again the decision has to be taken from the context. This possibility was not considered since this also involves marking the verb finiteness at the lexical level.
2) Don't mark the finiteness at the lexical level. Instead mark it as indicated in (2a) or (2b) below.
2a) Introduce a new layer which groups the verb group and mark the verb group as finite or non-finite. This approach proposes the following :
(i) Annotate the main verb as VM (introduce a new tag). Thus,
h14a. laDZakA seba khAtA_VM raHA thA
'boy' 'apple' 'eating' 'PROG' 'was'
h15a. seba khAtA_VM huA laDZakA jA rahA thA
'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was'
(ii) Annotate the auxiliaries as VAUX,
h14a. laDZakA seba khAtA_VM raHA_VAUX thA_VAUX
'boy' 'apple' 'eating' 'PROG' 'was'
h15a. seba khAtA_VM huA_VAUX laDZakA jA rahA thA
'apple' 'eating' ' PROG' 'boy' ' go' 'PROG' 'was'
(iii) Group the verb group (before chunking) and annotate it as finite or non-finite as the case may be,
h14a. laDZakA seba [khAtA_VM raHA_VAUX wA_VAUX]_VF
'boy' 'apple' 'eating' 'PROG' 'was'
h15a. seba [khAtA_VM huA_VAUX]_VNF laDZakA jA rahA thA
'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was'
This approach is more faithful to the available linguistic information. However, it requires introducing another layer. So, this was not considered useful.
2b) Mark the finiteness at the chunk level,
In this approach, the lexical items are marked as in (2). No new layer is introduced. Instead, the decision is postponed to the chunk level. Since the finiteness is in the group, it is marked at the chunk level. This offers the best solution as it facilitates marking the linguistic information as it is without having to introduce a new layer.
h14a. laDZakA seba ((khAtA_VM raHA_VAUX wA_VAUX))_VGF
'boy' 'apple' 'eating' 'PROG' 'was'
h15a. seba ((khAtA_VM huA_VAUX))_VGNF laDZakA jA rahA thA
'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was'
In this case also the decision is made by looking at the entire group. (2b) was most preferred as it facilitates marking the linguistic information correctly, at the same time no new layer needs to be introduced. Therefore, the current tagging scheme has adopted this approach. Thus, the main verbs in a given verb group will be marked as VM, irrespective of whether the total verb group is finite of non finite. Given underneath are some examples of other verb group types :
1) Non finite verb groups - Non-finite verb groups can have two functions :
a) Adverbial participial, for example : khAte-khAte in the following Hindi sentence,
h16. mEMne khAte ? khAte ghode ko dekhA
'I erg' 'while eating' 'horse' 'acc' 'saw'
?I saw a horse while eating?.
The main verb in (h16) would be annotated as follows :
h16a. mEMne khAte ? khAte_VM ghode ko dekhA
b) Adjectival participial, for example : 'khAte Hue' in the following Hindi sentence ,
h17. mEMne ghAsa khAte_VM hue ghoDe ko dekhA *
'I erg' 'grass' 'eating' 'PROG' 'horse' 'acc' 'saw'
I saw the horse eating grass.
(* (h17) is ambiguous in Hindi. The other sense that it can have is, I saw the horse while (I was) eating grass. In such cases, the annotator would disambiguate the sentence depending on the context and mark accordingly.)
2) Gerunds
Functionally, gerunds are nominals. However, even though they function like nouns, they are capable of taking their own arguments,eg. pInA in the following Hindi sentence can occur on its own or take an argument (given in parenthesis):
h18. (sharAba) pInA_VM sehata ke liye hAnikAraka hE.
'liquor' 'drinking' 'health' 'for' 'harmful' 'is'
?Drinking (liquor) is bad for health?
h19. mujhe khAnA_VM acchA lagatA hai
'to me' 'eating' 'good' 'appeals'
?I like eating?
h20. sunane meM saba kuccha acchA lagatA hE
'listening' 'in' 'all' 'things' 'good' 'appeal' 'is'
As mentioned above, noun 'sharAba' in (h18) is an object of the verb 'pInA' and has no relation to the main verb (hE). In order to be able to show the exact verb-argument structure in the sentence, it is essential that the crucial information of a noun derived from a verb is preserved. Therefore, even gerunds have to be marked as verbs. It is proposed that in keeping with the approach adopted for non-finite verbs, mark gerunds also as VM at the lexical level. For capturing the information that they are gerunds, such verbs will be marked as VGNN (see the section on Chunk tags for details) at the chunk level to capture their gerundial nature. The verbs having 'vAlA' vibhakti will also be marked as VM. For example, 'khonevAlA' (one who looses).
NEG Negative
Negatives like Hindi 'nahIM' (not), 'nA' (no, not), etc. will be marked as NEG.
For example,
h40. vaha Aja nahIM_NEG A pAyegA
'he' 'today' 'not' 'come' 'will be able'
Also, see examples (b2) and (h25) given above.
Indian languages have reiteration of NEG in certain constructions. For example,
b5. tumi chobitA dekhbe ?
'you' 'picture-def' 'will see' ?
?Will you see the picture ??
b6. nA_NEG, xekhabo nA_NEG
'no' 'will see (I)' 'not'
?No, I will not see (it)?
The first occurrence of 'nA' in such constructions will also be marked as NEG.
The tag NN for nouns has been adopted from Penn tags as such. The Penn tag set makes a distinction between noun singular (NN) and noun plural (NNS). As mentioned earlier, distinct tags based on grammatical information are
avoided in IL tagging scheme. Any information that can be obtained from any other source is not incorporated in the POS tag. Plurality, for example, can be obtained from a morph analyzer. Moreover, as mentioned earlier, if a particular information is considered crucial at the POS tagging level itself, it can be incorporated at a later date with the help of heuristics and linguistic rules. This approach brings the number of tags down, and helps achieve simplicity, consistency, better machine learning with a small corpora etc. Therefore, the current scheme has only one tag (NN) for common nouns without getting into any distinction based on the grammatical information contained in a given noun word. (Bharati et al. 2006, ?5.1.1)
NP Noun Chunk
Noun Chunks will be given the tag NP and include non-recursive noun phrases and postpositional phrases. The head of a noun chunk would be a noun. Specifiers will form the left side boundary for a noun chunk and the vibhakti or head noun will mark the right hand boundary for it. Descriptive adjective/s modifying the noun will be part of the noun chunk. The particle which anchors to the head noun in a noun chunk will also be grouped within the chunk. If it occurs after the noun or vibhakti, it will make the right boundary of the chunk.
Some example noun chunks are :
((bacce_NN))_NP, ((kucha_QF bacce_NN))_NP,
'children' 'some' 'children'
((kucha_QF acche_JJ bacce_NN))_NP, ((Dibbe_NN meM_PSP))_NP,
'some' 'good' 'children' 'box' 'in'
(( eka_QC kAlA__JJ ghoDZA_NN))_NP ,
'one' 'black' 'horse'
((yaha_DEM nayI_JJ kitAba_NN))_NP,
'this' 'new' 'book'
(( isa_DEM nayI_JJ kitAba_NN meM_PREP))_NP,
'this' 'new' 'book' 'in'
(( isa_DEM nayI_JJ kitAba_NN meM_PSP bhI_RP))_NP
'this' 'new' 'book' 'in' 'also'
The issue of genitive marker and its grouping with the nouns that it relates to was discussed in detail. For example, the noun phrase 'rAma kA beTA' contains two nouns 'rAma' and 'beTA'. The two nouns are related to each other by the vibhakti 'kA'. The issue is whether to chunk the two nouns separately or together? Linguistically, 'beTA' is the head of the phrase ?rAma kA beTA?.
'rAma' is related to 'beTA' by a genitive relation which is expressed through the vibhakti 'kA'. Going by our definition of a 'chunk' we should break 'rAma kA beTA' into two chunks ( ((rAma kA))_NP, ((beTA))_NP ) by breaking 'rAma kA' at 'kA' vibhakti . Moreover, if we chunk 'rAma kA beTA' as one chunk, linguistically, we will end up with a recursive noun phrase as a single chunk ((((rAma kA)) beTA)) which also is against our definition of a chunk.
Therefore, it was decided that the genetive markers will be chunked along with the preceding noun. Thus, the noun group 'rAma kA beTA' would be chunked into two chunks.
h54. ((rAma kA))NP ((beTA))NP acchA hE ?Ram's son is good?
h55. ((kitAba))NP ((rAma kI))NP hE ?The book belongs to Ram?
For the noun groups such as ?usakA beTA? it was decided that they should be chunked together.
QO Ordinals
Expressions denoting ordinals will be marked as QO.
h36. mEMne kitAba tIsare_QO laDake ko dI thI
'I' 'book' 'third' 'boy' 'to' 'give' 'was'
I gave the book to the third boy?
BLK Miscellaneous entities
Entities such as interjections and discourse markers that cannot fall into any of the above mentioned chunks will be kept within a separate chunk.
eg. ((oh_INJ))_BLK, ((arre_INJ))_BLK
8.3 Some Special Cases
Apart from the above, some special cases related to certain lexical types are discussed below.
RP Particle
Expressions such as bhI, to, jI, sA, hI, nA, etc in Hindi would be marked as RP.
The nA in the above list is different from the negative nA. Hindi and some other Indian languages have an ambiguous 'nA' which is used both for negation (NEG) and for reaffirmation (RP). Similarly, the particle wo is different from CC wo. For example in Bangla and Hindi:
Bangla : (b1) tumi nA_RP khub dushtu
'you' 'particle' 'very' 'naughty'
?You are very naughty? (comment)
Hindi : (h24) tuma nA_RP, bahuta dushta ho
'you' 'particle very naughty
?You are very naughty? (comment)
Bangla : (b2) cheleta dushtu nA_NEG
'the boy' 'naughty' 'not'
?The boy is not naughty?
Hindi : (h25) mEM nA_NEG jA sakUMgA
'I' 'not' 'go' 'will able'
?I will not be able to go?
Bangla : (b3) binu yYoxi khAya to_CC Ami khAba
'Binu' 'if' 'eats' 'then' 'I' 'will eat'
?If Binu eats then I will eat (too)?
Hindi : (h26) yadi binu khAyegA wo_CC mEM khAUMgI
'if' 'Binu' 'eats' 'then' 'I' 'will eat'
?Only if Binu eats, I will eat (too)?
Bangla : (b4) Ami to_RP jAni nA
'I' 'particile' 'know' 'not'
?I don?t know?
Hindi : (h27) mujhako to_RP nahIM patA
'I' 'particile' 'not' 'know'
?I don?t know?
(Bharati et al. 2006, ?5.9)
Hindi (and some other Indian languages) has particles such as 'jI' or 'sAHaba' etc. after proper nouns or personal pronouns. These particles are added to denote respect to the referred person. Such honorific words will be treated like particles and will be tagged RP like other particles.
h53. mantrI_NN jI_RP sabhA meM dera se pahuMce .
'minister' 'hon' 'meeting' 'in' 'late' 'part' 'reached'
?The minister reached late for the meeting?.
(Bharati et al. 2006, ?6.2)
"The Penn tags are most commonly used tags for English. Many tag sets designed subsequently have been a variant of this tag set (eg. Lancaster tag set). So, while deciding the tags for this tagger, the Penn tags have been used as a benchmark. Since the Penn tag set is an established tag set for English, we have used the same tags as the Penn tags for common lexical types. However, new tags have been introduced wherever Penn tags have been found inadequate for Indian language descriptions. For example, for verbs none of the Penn tags have been used. Instead, AnnCorra has only two tags for annotating verbs, VM (main verb) and VAUX (auxiliary verb)." (Bharati et al. 2006)
PSP Postposition
All Indian languages have the phenomenon of postpositions. Postpositions express certain grammatical functions such as case etc. The postposition will be marked as PSP in the current tagging scheme. For example,
h22. mohana kheta meM khAda dAla rahA thA
'Mohan' 'field' 'in' 'fertilizer' 'put ' 'PROG' 'was'
meM in the above example is a postposition and will be tagged as PSP.
A postposition will be annotated as PSP ONLY if it is written separately. In case it is conjoined with the preceding word it will not be marked separately. For example, in Hindi pronouns the postpositions are conjoined with the pronoun,
h23. mEne usako bAzAra meM dekhA
'I' 'him' 'market' 'in' 'saw'
(h23) above has three instances of 'postposition' (in bold) usage. The postpositions 'ne' and 'ko' are conjoined with the pronouns mEM and usa respectively. The third postposition 'meM' is written separately. In the first two instances, the postposition will not be annotated. Such words will be annotated with the category of the head word. Therefore, the three instances mentioned above will be annotated as shown in (h23a) below :
h23a. mEne_PRP usako_PRP bAzAra_NN meM_PSP dekhA
PRP Pronoun
Penn tags make a distinction between personal pronouns and possessive pronouns. This distinction is avoided here. All pronouns are marked as PRP. In Indian languages all pronouns inflect for all cases (accusative, dative, possessive etc.). In case we have a separate tag for possessive pronouns, new tags will have to be designed for all the other cases as well. This will increase the number of tags which is unnecessary. So only one tag is used for all the pronouns. The necessity for keeping a separate tag for pronouns was also discussed, as linguistically, a pronoun is a variable and functionally it is a noun. However, it was decided that the tag for pronouns will be helpful for anaphora resolution tasks and should be retained.
NNP Proper Nouns
The need for a separate tag for proper nouns and its usability was discussed. Following points were raised against the inclusion of a separate tag for proper nouns :
a) Indian languages, unlike English, do not have any specific marker for proper nouns in orthographic conventions. English proper nouns begin with a capital letter which distinguishes them from common nouns.
b) All the words which occur as proper nouns in Indian languages can also occur as common nouns denoting a lexical meaning. For example,
English : John, Harry, Mary occur only as proper nouns whereas
Hindi : aTala bihArI, saritA, aravinda etc are used as 'names' and they also belong to grammatical categories of words with various senses . For example given below is a list of Hindi words with their grammatical class and sense.
aTala adj immovable
bihArI adj from Bihar
saritA noun river
aravinda noun lotus
Any of the above words can occur in texts as common lexical items or as proper names. (h9) - (h11) below show their occurrences as proper nouns,
h9. atala bihAri bAjapaI bhArata ke pradhAna mantrI the.
'Atal' 'Bihari' 'Vajpayee' 'India' 'of' 'prime' 'minister' 'was'
?Atal Behari Vajpayee was the Prime Minister of India?.
h10. merI mitra saritA tAIvAna jA rahI hE.
'my' 'friend' 'Sarita' 'Taiwan' 'go' 'PROG' 'is'
?My friend Sarita is going to Taiwan?
h11. aravinda ne mohana ko kitAba dI.
'Aravind' 'erg' 'Mohan' 'to' 'book' 'gave'
?Aravind gave the book to Mohan?.
Therefore, in the Indian languages' context, annotating proper nouns with a separate tag will not be very fruitful from machine learning point of view. In fact, the identification of proper nouns can be better achieved by named entity filters.
Another point that was considered in this context was the effort involved in manual tagging of proper nouns in a given text. It is felt that not much extra effort is required in manual tagging of proper nouns. However, the data annotated with proper nouns can be useful for certain applications. Therefore, there is no harm in marking the information if it does not require much effort.
Finally, it was decided to have a separate tag for proper nouns for manual annotation and ignore it for machine learning algorithms. Following this decision, the tag NNP is included in the tag set. This tag is the same as the
Penn tag for proper nouns. However, in this case also AnnCorra has only one tag for both singular and plural proper nouns unlike Penn tags where a distinction is made between proper noun singular and proper noun plural by having two tags NNP and NNPS respectively.
QF Quantifiers
All quantifiers like Hindi kama (less), jyAdA (more), bahuwa (lots), etc. will be marked as QF.
h34. vahAz bahuta_QF loga Aye the
'there' many' 'people' 'came' 'was'
?Many people came there?.
In case these words are used in constructions like 'baHutoM ne jAne se inkAra kiyA' ('many' 'by' 'to go' 'refused'; Many refues to go) where it is functioning like a noun, it will be marked as NN (noun). Quantifiers of number will be marked as below.
WQ Question Words
The Penn tag set makes a distinction between various uses of 'wh-' words and marks them accordingly (WDT, WRB, WP, WQ etc). The 'wh-' words in English can act as questions, as relative pronouns and as determiners. However, for Indian languages we need not keep this distinction. Therefore, we tag the question words as WQ.
h30. kOna AyA hE ?
'who' 'come' 'has'
?Who has come ?
h31. tuma kala kyA kara rahe ho ?
'you' 'tomorrow' what' 'doing' 'are'
What are you doing tomorrow ?
h32. tuma kala kahAz jA rahe ho ?
'you' 'tomorrow' 'where' 'going' 'are'
?Where are you going tomorrow ?
h33. kyA tuma kala Aoge ?
'?' 'you' 'tomorrow' 'will come'
?Will you come tomorrow ?
RDP Reduplication
In this phenomenon of Indian languages, the same word is written twice for various purposes such as indicating emphasis, deriving a category from another category etc. eg. choTe choTe ('small' 'small'; very small), lAla lAla ('red' 'red'; red), jaldI jaldI ('quickyl' 'quickly' ; very quickly)
There are two ways in which such word sequences may be written. They can be written ? (a) separated by a space or (b) separated by a hyphen.
The question to be resolved is that in case, they are written as two words (separated by space)? how should they be tagged? Earlier decision was to use the same tag for both the words. However, in this approach, the morphological
character of reduplication is missed out. That is, the reduplicated item will then be treated exactly like two independent words of the same category. For example,
h43. vaha mahaMgI_JJ mahaMgI_JJ cIjZeM kharIda lAyA
'he' 'expensive' 'expensive' 'things' 'buy' 'bring'
?He bought all expensive things?.
h44. una catura_JJ buddhimAna_JJ baccoM ne samasyA sulajhA lI
'those' 'smart' 'intelligent' 'children' 'erg' 'problem' 'solved'
?Those smart and intelligent children solved the problem.
Both (h43) and (h44) have a sequence of adjectives - mahaMgI_JJ mahaMgI_JJ and catura_JJ buddhimAna_JJ respectively. In the first case, the sequence of two adjectives is a case of reduplication (same adjective is repeated twice to indicate the intensity of 'expensive') whereas in the second case the two adjectives refer to two different properties attributed to the following noun. Since reduplication is a highly productive process in Indian languages, it is proposed to include a new tag RDP for annotating reduplicatives. The first word in a reduplicative construction will be tagged by its respective lexical category and the second word will be tagged as RDP to indicate that it is a case of reduplication distinguishing it from a normal sequence such as in (h44) above. Some more examples are given underneath to make it more explicit,
h45. vaha dhIre_RB dhIre_RDP cala rahA thA.
'he' 'slowly' 'slowly' 'walk' 'PROG' 'was'
?He was walking (very) slowly?.
h46. usake bAla choTe_JJ choTe_RDP the.
'his' 'hair' 'short' 'short' 'were'
?He had (very) short hair?
h47. yaha bAta galI_NN galI_RDP meM phEla gayI.
'this' 'talk' 'lane' 'lane' 'in' 'spread' 'went'
?The word was spread in every lane?.
NST Noun denoting spatial and temporal expressions
"A tag NST has been included to cover an important phenomenon of Indian languages. Certain expressions such as 'Upara' (above/up), 'nIce' (below) 'pahale' (before), 'Age' (front) etc are content words denoting time and space. These expressions, however, are used in various ways. For example,
5.1.2.1 These words often occur as temporal or spatial arguments of a verb in a given sentence taking the appropriate vibhakti (case marker):
h3. vaha Upara so rahA thA .
'he' 'upstairs' 'sleep' 'PROG' 'was'
?He was sleepign upstairs?.
h4. vaha pahale se kamare meM bEThA thA .
'he' 'beforehand' 'from' ' room' 'in' 'sitting' 'was'
?He was sitting in the room from beforehand?
h5. tuma bAhara bETho
'you' 'outside' 'sit'
?You sit outside?.
Apart from functioning like an argument of a verb, these elements also modify another noun taking postposition 'kA'.
h6. usakA baDZA bhAI Upara ke hisse meM rahatA hE
'his' 'elder' 'brother' 'upstairs' 'of' 'portion' 'in' 'live' 'PRES'
?His elder brother lives in the upper portion of the house?.
5.1.2.2 Apart from occuring as a nominal expression, they also occur as a part of a postposition along with 'ke'. For example,
h7. ghaDZe ke Upara thAlI rakhI hE.
'pot' 'of' 'above' 'plate' 'kept' 'is'
The plate is kept on the pot?.
h8. tuma ghara ke bAhara bETho
'you' 'home' 'of' 'outside' 'sit'
?You sit outside the house?.
'Upara' and 'bAhara' are parts of complex postpositions 'ke Upara' and 'ke bAhara' in (h6) and (h7) respectively which can be translated into English prepositions 'on' and 'outside'.
For tagging such words, one possible option is to tag them according to their syntactic function in the given context. For example in 5.2.2 (h7) above, the word 'Upara' is occurring as part of a postposition or a relation marker. It can, therefore, be marked as a postposition. Similarly, in 5.2.1. (h3) and (h6) above, it is a noun, therefore, mark it as a noun and so on. Alternatively, since these words are more like nouns, as is evident from 5.2.1 above they can be tagged as nouns in all there occurrences. The same would apply to 'bAhAra' (outside)
in examples examples (h4), (h5) and (h8).
However, if we follow any of the above approaches we miss out on the fact that this class of words is slightly different from other nouns. These are nouns which indicate 'location' or 'time'. At the same time, they also function as postpositions in certain contexts. Moreover, such words, if tagged according to their syntactic function, will hamper machine learning. Considering their special status, it was considered whether to introduce a new tag, NST, for such expressions. The following five possibilities were discussed :
a) Tag both (h5) & (h8) as NN
b) Tag both (h5) & (h8) as NST
c) Tag (h5) as NN & (h8) as NST
d) Tag (h5) as NST & (h8) as PSP
e) Tag (h5) as NN & (h8) as PSP
After considering all the above, the decision was taken in favour of (b). The decision was primarily based on the following observations:
(i) 'bAhara' in both (h5) and (h8) denotes the same expression (place expression 'outside')
(ii) In both (h5) and (h8), 'bAhara' can take a vibhakti like a noun ( bAhara ko bETho, ghara ke bAhara ko bETho)
(iii) If a single tag is kept for both the usages, the decision making for annotators would also be easier.
Therefore, a new tag NST is introduced for such expressions. The tag NST will be used for a finite set of such words in any language. For example, Hindi has
Age (front), pIche (behind), Upara (above/upstairs), nIce (below/down), bAda (after), pahale (before), andara (inside), bAhara (outside) etc."
SYM Special Symbol
All those words which cannot be classified in any of the other tags will be tagged as SYM. This tag is similar to the Penn 'SYM'. Also special symbols like $, %, etc are treated as SYM. Since the frequency of occurrence of such symbols is very less in Indian languages, no separate tag is used for such symbols.
UNK Unknown
A special tag to indicate unknown words is also included in the tag set. The annotators can use this tag to mark the words whose category they are not
aware of. This tag has to be used very cautiously and sparsely, i.e., only if it is absolutely necessary.
(Bharati et al. 2006, ?5.20)
Presence of loan words is a fairly common phenomenon in languages. Most Indian languages have a number of loan word from English. One may also come across words from other Indian languages or Sanskrit in a given text. Such foreign words will be tagged as per the syntactic function of the word in the given context. In special cases, such as when the annotator is not sure of the category of a word, it will be tagged as UNK.
(Bharati et al. 2006, ?6.3)
The tag NN for nouns has been adopted from Penn tags as such. The Penn tag set makes a distinction between noun singular (NN) and noun plural (NNS). As mentioned earlier, distinct tags based on grammatical information are
avoided in IL tagging scheme. Any information that can be obtained from any other source is not incorporated in the POS tag. Plurality, for example, can be obtained from a morph analyzer. Moreover, as mentioned earlier, if a particular information is considered crucial at the POS tagging level itself, it can be incorporated at a later date with the help of heuristics and linguistic rules. This approach brings the number of tags down, and helps achieve simplicity, consistency, better machine learning with a small corpora etc. Therefore, the current scheme has only one tag (NN) for common nouns without getting into any distinction based on the grammatical information contained in a given noun word