Ontology - ancorra

Abstract
Annotation Model for AnnCorra : Guidelines For POS And Chunk Annotation For Indian Languages as described by Bharati et al. (2006). Unless marked otherwise, all comments here are quotes from this document. Bharati et al. (2006) claim to provide a tagset applicable to all Indian languages. They explicitly mention Hindi, Bangla, Marathi, Telugu, and Tamil, and the tagset can be assumed to be applicable to at least these languages. The document is, however, mostly working with examples from Hindi and Bangla. Akshar Bharati, Dipti Misra Sharma, Lakshmi Bai, Rajeev Sangal (2006), AnnCorra : Annotating Corpora. Guidelines For POS And Chunk Annotation For Indian Languages, Tech. Rep., L anguage Technologies Research Centre IIIT, Hyderabad, version of 15-12-2006, http://ltrc.iiit.ac.in/tr031/posguidelines.pdf
Latest Version
http://purl.org/olia/ancorra.owl#

Imports

Classes - Overview

G Chunk Chunk AdjectiveChunk Adjective Chunk Chunk->AdjectiveChunk is a AdverbChunk Adverb Chunk Chunk->AdverbChunk is a ChunkFragment Chunk Fragment Chunk->ChunkFragment is a Conjunct Conjunct Chunk->Conjunct is a NegativeChunk Negative Chunk Chunk->NegativeChunk is a NounChunk Noun Chunk Chunk->NounChunk is a OtherEntity Other Entity Chunk->OtherEntity is a VerbChunk Verb Chunk Chunk->VerbChunk is a POSTag P O S Tag Adjective Adjective POSTag->Adjective is a Adverb Adverb POSTag->Adverb is a AuxiliaryVerb Auxiliary Verb POSTag->AuxiliaryVerb is a CommonNoun Common Noun POSTag->CommonNoun is a Compound Compound POSTag->Compound is a Conjunction Conjunction POSTag->Conjunction is a EchoWord Echo Word POSTag->EchoWord is a Intensifier Intensifier POSTag->Intensifier is a Interjection Interjection POSTag->Interjection is a MainVerb Main Verb POSTag->MainVerb is a Negative Negative POSTag->Negative is a Particle Particle POSTag->Particle is a Postposition Postposition POSTag->Postposition is a PronounOrDemonstrative Pronoun Or Demonstrative POSTag->PronounOrDemonstrative is a ProperNoun Proper Noun POSTag->ProperNoun is a QuantifierOrClassifier Quantifier Or Classifier POSTag->QuantifierOrClassifier is a QuestionWord Question Word POSTag->QuestionWord is a Quotative Quotative POSTag->Quotative is a Reduplication Reduplication POSTag->Reduplication is a Symbol Symbol POSTag->Symbol is a UnknownWord Unknown Word POSTag->UnknownWord is a CardinalNumber Cardinal Number Classifier Classifier Noun Noun CommonNoun->Noun is a SpatiotemporalNoun Spatiotemporal Noun CommonNoun->SpatiotemporalNoun is a Demonstrative Demonstrative FiniteVerbChunk Finite Verb Chunk GerundChunk Gerund Chunk InfiniteVerbChunk Infinite Verb Chunk NonFiniteVerbChunk Non Finite Verb Chunk OrdinalNumber Ordinal Number Pronoun Pronoun PronounOrDemonstrative->Demonstrative is a PronounOrDemonstrative->Pronoun is a Quantifier Quantifier QuantifierOrClassifier->CardinalNumber is a QuantifierOrClassifier->Classifier is a QuantifierOrClassifier->OrdinalNumber is a QuantifierOrClassifier->Quantifier is a VerbChunk->FiniteVerbChunk is a VerbChunk->GerundChunk is a VerbChunk->InfiniteVerbChunk is a VerbChunk->NonFiniteVerbChunk is a

Classes

Adjective G Adjective Adjective
Abstract JJ Adjective This tag is also taken from Penn tags. Penn tag set also makes a distinction between comparative and superlative adjectives. This has not been considered here. Therefore, in the current scheme for Indian languages, the tag JJ includes the 'tara' (comparative) and the 'tama' (superlative) forms of adjectives as well. For example, Hindi adhikatara (more times), sarvottama (best), etc. will also be marked as JJ.
SubClass Of
AdjectiveChunk G AdjectiveChunk Adjective Chunk
Abstract e.g., Hindi: vaha laDaZkI hE((suMdara_JJ sI_RP))_JJP (Bharati et al. 2006, ?13.2)
SubClass Of
Adverb G Adverb Adverb
Abstract RB Adverb For the adverbs also, the tag RB has been borrowed from Penn tags. Similar to the adjectives, Penn tags make a distinction between comparative and superlative adverbs as well. This distinction is not made in this tagger. This is in accordance with our philosophy of coarseness in linguistic analysis. Another important decision for the use of RB for adverbs in the current scheme is that :- (a) The tag RB will be used ONLY for 'manner adverbs' . Example, h21. vaha jaldI jaldI khA rahA thA 'he' 'hurriedly' 'eat' 'PROG' 'was' (b) The tag RB will NOT be used for the time and manner expressions unlike English where time and place expressions are also marked as RB. In our scheme, the time and manner expressions such as 'yahAz ? vahAz, aba ? waba ' etc will be marked as PRP.
SubClass Of
AdverbChunk G AdverbChunk Adverb Chunk
Abstract e.g., Hindi : vaha ((dhIre-dhIre_RB))_RBP cala rahA thA (Bharati et al. 2006, ?13.2)
SubClass Of
AuxiliaryVerb G AuxiliaryVerb Auxiliary Verb
Abstract VAUX Verb Auxiliary All auxiliary verbs will be marked as VAUX. This tag has been adopted as such from the Penn tags. (For examples, see h14 ? h16 above). [CC: see MainVerb for examples]
SubClass Of
CardinalNumber G CardinalNumber Cardinal Number
Abstract QC Cardinals Any word denoting a cardinal number will be tagged as QC. Penn tag set has a tag CD for cardinal numbers and they have not talked of ordinals. For example, h35. vahAz tIna_QC loga bEThe the 'there' 'three' 'people' 'sitting' 'were' ?Three people were sitting there?
SubClass Of
Chunk G Chunk Chunk
Abstract ?A minimal (non recursive) phrase(partial structure) consisting of correlated, inseparable words/entities, such that the intra-chunk dependencies are not distorted? ... A chunk would contain a 'head' and its modifiers.
Sub-Classes
ChunkFragment G ChunkFragment Chunk Fragment
Abstract e.g., Hindi; rAma (jo merA baDZA bhAI hE) ne kahA... (Bharati et al. 2006, ?13.2)
SubClass Of
Classifier G Classifier Classifier
Abstract CL Classifiers The tag CL has been included to mark classifiers. Many Indian languages have a rich classifier system. ?A classifier, in linguistics, is a word or morpheme used in some languages to classify a noun according to its meaning? (http://en.wikipedia.org/wiki/Classifier_%28linguistics%29). For example, Telugu : (t2) padi mandi pillalu 'ten' 'persons' 'children' Tamil : (tm1) pattu pEr mANavarakaLa 'ten' 'person' 'students' The words 'mandi' (Telugu ) and 'per' (Tamil) are classifiers which occur with numerals with human nouns. Such expressions when occurring separately (not suffixed with the noun) will be marked as CL. Therefore : Telugu : (t2) padi mandi_CL pillalu 'ten' 'persons' 'children' Tamil : (tm1) pattu pEr_CL mANavarakaLa 'ten' 'person' 'students'
SubClass Of
CommonNoun G CommonNoun Common Noun
SubClass Of
Sub-Classes
Compound G Compound Compound
Abstract *C Compounds (Make it XC ? where X is a variable of the type of the compound of which the current word is a member of) The issue of including a tag for marking compounds was discussed extensively. Results of algorithms using IIIT-H tag set which included NNC (part of compound nouns) and NNPC (part of proper nouns) showed that these two tags contributed substantially to the low accuracy of the tagger. Since most elements which occur as NNC or NNPC can also occur as NN and NNP, it affected the learning by the machine. So, the question was, why to include tags which contributed more to the errors ? The other aspect, however, was that while human annotators are annotating the data, they know from the context when a certain element is NNC or NN, NNPC or NNP and if marked, this information can be useful for certain applications. The argument is same as the one in favor of including a tag for proper nouns. Another point which was discussed was that any word class can have compound forms in Indian languages (including adjectives and adverbs). Therefore, if we decide to have a tag for showing compounds of each type, the number of tags will go very high. The final decision on this was to include a *C tag which will be realised as catC tag of the type of compound that the element is a part of. For example, if a certain word is part of a compound noun, it will be marked as NNC, if it is part of a compound adjective, it will be marked as JJC and so on and so forth. Some examples are given below : Hindi compound noun keMdra sarakAra (Central government) will be tagged as keMdra_NNC sarakAra_NN. In this example, 'keMdra' and 'sarakAra' are both nouns which are forming a compound noun. All words except the last one, of a compound words will be marked as NNC. Thus any NNC will be always followed by another NNC or an NN. This strategy helps identify these words as one unit although they are not conjoined by a hyphen. Similarly, a compound proper noun will be marked as NNPC excluding the last one. eg. aTala_NNPC bihArI_NNPC vAjapeyI_NNP The first two words, in the above example, will be tagged as NNPC and the last one will be tagged as NNP. Similar to the NNC tag for common nouns, NNPC tag helps in marking parts of a proper noun. h41. rAma, mohana aur shyAma ghara gaye. 'Ram', 'Mohan' 'and' 'Shyam' 'home' 'went' ?Ram, Mohan and Shama went home?. h42. bagIce meM ranga_JJC biraMge_JJ phUla khile the 'garden' 'in' 'colourful' 'flowers' 'flowered' 'were' ?The garden had colorful flowers? Titles such as Dr., Col., Lt. etc. which may occur before a proper noun will be tagged as NNC. All such titles will always be followed by a Proper Noun. In order to indicate that these are parts of proper nouns but are nonetheless nouns themselves, they will be tagged as NNC, eg. Col._NNC Ranjit_NNPC Deshmukh_NNP
SubClass Of
Conjunct G Conjunct Conjunct
Abstract e.g., Hindi: ((rAma))_NP ((Ora))_CCP ((SyAma))_NP (Bharati et al. 2006, ?13.2)
SubClass Of
Conjunction G Conjunction Conjunction
Abstract CC Conjuncts(co-ordinating and subordinating) The tag CC will be used for both, co-ordinating and subordinating conjuncts. The Penn tag set has used IN tag for prepositions and subordinating conjuncts. Their rationale behind this is that subordinating conjuncts and prepositions can be distinguished because subordinating conjuncts are followed by a clause and prepositions by a noun phrase. But in the current tagger all connectives, other than prepositions, will be marked as CC. h28. mohana bAzAra jA rahA hE Ora_CC ravi skUla jA rahA hE 'Mohan' 'market' 'go' 'PROG' 'is' 'and' 'Ravi' 'school' 'go' 'PROG' 'is' ?Mohan is going to the market and Ravi is going to the school? h29. mohana ne mujhe batAyA ki_CC Aja bAzAra banda hE 'Mohan' 'erg' 'to me' 'told' 'that' 'today' 'market' 'close' 'is' ?Mohan told me that the market is closed today.?
SubClass Of
Demonstrative G Demonstrative Demonstrative
Abstract DEM Demonstratives The tag 'DEM' has been included to mark demonstratives. The necessity of including a tag for demonstratives was felt to cover the distinction between a pronoun and a demonstrative. For example, h12. vaha ladakA merA bhAI hE (demnostrative) 'that' 'boy' 'my' 'brother' 'is' h13. vaha merA bhAI hE (pronoun) 'he' 'my' 'brother' 'is' Many Indian languages have different words for demonstrative adjectives and pronouns. A better evidence for including a separate tag for demonstratives is from the following Telugu examples, t1. A abbAyi nA tammudu 'that' 'boy' 'my' 'brother' t2. atanu nA tammudu 'he' 'my' 'brother' (Telugu does not have a copula 'be' in the present tense)
SubClass Of
EchoWord G EchoWord Echo Word
Abstract ECH Echo words Indian languages have a highly productive usage of echo words such as Hindi 'cAya-vAya' ('tea' 'echo'), where 'cAya' is a regular lexical item of Hindi vocabulary and 'vAya' is an echo word indicating the sense ?etc? . These words, on their own, are 'nonsense' words and do not find a place in any dictionary. Thus, the gloss for 'cAya-vAya' would be 'tea etc'. It is proposed to add the tag ECH for such words.
SubClass Of
FiniteVerbChunk G FiniteVerbChunk Finite Verb Chunk
Abstract e.g., Hindi: mEMne ghara para khAnA ((khAyA_VM))_VGF (Bharati et al. 2006, ?13.2)
SubClass Of
GerundChunk G GerundChunk Gerund Chunk
Abstract VGNN Gerunds A verb chunk having a gerund will be annotated as VGNN. For example, h18a. sharAba ((pInA_VM))_VGNN sehata ke liye hAnikAraka hE. 'liquor' 'drinking' 'heath' 'for' 'harmful' 'is' ?Drinking (liquor) is bad for health? h19a. mujhe rAta meM ((khAnA_VM))_VGNN acchA lagatA hai 'to me' 'night' 'in' 'eating' 'good' 'appeals' ?I like eating at night? h20a. ((sunane_VM meM_PSP))_VGNN saba kuccha acchA lagatA hE 'listening' 'in' 'all' 'things' 'good' 'appeal' 'is'
SubClass Of
InfiniteVerbChunk G InfiniteVerbChunk Infinite Verb Chunk
Abstract VGINF Infinitival Verb Chunk This tag is to mark the infinitival verb form. In Hindi, both, gerunds and infinitive forms of the verb end with a -nA suffix. Since both behave functionally in a similar manner, the distinction is not very clear. However, languages such as Bangla etc have two different forms for the two types. Examples from Bangla are given below. b8. Borabela ((snAna karA))_VGNN SorIrera pokze BAlo 'Morning' 'bath' 'do-verbal noun' 'health-gen' 'for' 'good' ?Taking bath in the early morning is good for health? b9. bindu Borabela ((snAna karawe))_VGINF BAlobAse 'Bindu' 'morning' 'bath' 'take-inf' 'love-3pr' ?Bindu likes to take bath in the early morning? In Bangla, the gerund form takes the suffix ?A / -Ano, while the infinitive marker is ?we. The syntactic distribution of these two forms of verbs is different. For example, the gerund form is allowed in the context of the word darakAra ?necessary? while the infinitive form is not, as exemplified below: b10 Borabela ((snAna karA))_VGNN darakAra 'Morning' 'bath' 'do-verbal noun' 'necessary' ?It is necessary to take bath in the early morning? b11. *Borabela ((snAna karawe))_VGINF darakAra Based on the above evidence from Bangla, the tag VGINF has been included to mark a verb chunk.
SubClass Of
Intensifier G Intensifier Intensifier
Abstract INTF Intensifier This tag is not present in Penn tag set. Words like 'bahuta', 'kama', etc. when intensifying adjectives or adverbs will be annotated as INTF. Example, h37. hEdarAbAda meM aMgUra bahuta_INTF acche milate hEM 'HyderabAd' 'in' 'grapes' 'very' 'good' 'available' 'are' ?Very good grapes are available in Hyderabad?.
SubClass Of
Interjection G Interjection Interjection
Abstract INJ Interjection The interjections will be marked as INJ. Apart from the interjections, the affirmatives such as Hindi 'HAz'('yes') will also be tagged as INJ. Since, this is the only example of such a word, it has been clubbed under Interjections. h38. arre_INJ, tuma A gaye ! 'oh' 'you' 'come' 'have' ?Oh! you have come? h39. hAz_INJ, mEM A gayA 'yes', 'I' 'come' 'have' ?Yes, I have come?.
SubClass Of
MainVerb G MainVerb Main Verb
Abstract VM Verb Main Verbal constructions in languages may be composed of more than one word sequences. Typically, a verb group sequence contains a main verb and one more auxiliaries (V AUX AUX ... ... ). In the current tagging scheme the support verbs (such as dAlanA in kara dAlAtA hE, uThanA in cOMka uThA thA etc) are also tagged as VAUX. The group can be finite or non-finite. The main verb need not be marked for finiteness. Normally, one of the auxiliaries carries the finiteness feature. The necessity of marking the finiteness or non-finiteness in a verb was discussed extensively and everybody agreed that it was crucial to mark the distinction. However, languages such as Hindi, which have auxiliaries for marking tense, aspect and modalities pose a problem. The finiteness of a verbal expression is known only when we reach the last auxiliary of a verb group. Main verb of a finite verb group (leaving out the single word verbal expressions of the finite type ? eg vaha dillI gayA) does not contain finiteness information. For example, h14. laDZakA seba khAtA raHA wA 'boy' 'apple' 'eating' 'PROG' 'was' The boy had kept eating. h15. seba khAtA huA laDZakA jA rahA thA 'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was' The boy eating the apple was going. The expression khAtA raHA in (h29) above is finite and khAtA huA in (h3) is non finite. However, the main verb 'khAtA' is non-finite in both the cases. So, the issue is - whether to (1a) mark finiteness in ?khAtA rahA thA ( had kept eating)? at the lexical level on the main verb (khA) or (1b) on the auxiliary containing finiteness (wA) or (2) not mark it at the lexical level at all. All the three possibilities were discussed; 1) Mark the finiteness at the lexical level. If we mark it at the lexical level, following possibilities are available : 1a) Mark the finiteness on the main verb, even though we know that the lexical item itself is not finite. In this case, the annotator interprets the finiteness from the context. (The POS tags VF, VNF and VNN were earlier decided based on this approach). The main verb, therefore, is marked as finite consciously with a view that the group contains a 'verb root' and its auxiliaries (as TAM etc) is finite even though the main verb does not carry the finiteness at the lexical level. Although, this approach facilitates annotation of both the main verb and the finiteness (of the group) by a single tag, it allows tagging a lexical item (main verb) with the finiteness feature which it does not actually carry. So, this is not a neat solution. 1b) The second possibility is, mark the finiteness on the last auxiliary of the sequence. Here again the decision has to be taken from the context. This possibility was not considered since this also involves marking the verb finiteness at the lexical level. 2) Don't mark the finiteness at the lexical level. Instead mark it as indicated in (2a) or (2b) below. 2a) Introduce a new layer which groups the verb group and mark the verb group as finite or non-finite. This approach proposes the following : (i) Annotate the main verb as VM (introduce a new tag). Thus, h14a. laDZakA seba khAtA_VM raHA thA 'boy' 'apple' 'eating' 'PROG' 'was' h15a. seba khAtA_VM huA laDZakA jA rahA thA 'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was' (ii) Annotate the auxiliaries as VAUX, h14a. laDZakA seba khAtA_VM raHA_VAUX thA_VAUX 'boy' 'apple' 'eating' 'PROG' 'was' h15a. seba khAtA_VM huA_VAUX laDZakA jA rahA thA 'apple' 'eating' ' PROG' 'boy' ' go' 'PROG' 'was' (iii) Group the verb group (before chunking) and annotate it as finite or non-finite as the case may be, h14a. laDZakA seba [khAtA_VM raHA_VAUX wA_VAUX]_VF 'boy' 'apple' 'eating' 'PROG' 'was' h15a. seba [khAtA_VM huA_VAUX]_VNF laDZakA jA rahA thA 'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was' This approach is more faithful to the available linguistic information. However, it requires introducing another layer. So, this was not considered useful. 2b) Mark the finiteness at the chunk level, In this approach, the lexical items are marked as in (2). No new layer is introduced. Instead, the decision is postponed to the chunk level. Since the finiteness is in the group, it is marked at the chunk level. This offers the best solution as it facilitates marking the linguistic information as it is without having to introduce a new layer. h14a. laDZakA seba ((khAtA_VM raHA_VAUX wA_VAUX))_VGF 'boy' 'apple' 'eating' 'PROG' 'was' h15a. seba ((khAtA_VM huA_VAUX))_VGNF laDZakA jA rahA thA 'apple' 'eating' 'PROG' 'boy' ' go' 'PROG' 'was' In this case also the decision is made by looking at the entire group. (2b) was most preferred as it facilitates marking the linguistic information correctly, at the same time no new layer needs to be introduced. Therefore, the current tagging scheme has adopted this approach. Thus, the main verbs in a given verb group will be marked as VM, irrespective of whether the total verb group is finite of non finite. Given underneath are some examples of other verb group types : 1) Non finite verb groups - Non-finite verb groups can have two functions : a) Adverbial participial, for example : khAte-khAte in the following Hindi sentence, h16. mEMne khAte ? khAte ghode ko dekhA 'I erg' 'while eating' 'horse' 'acc' 'saw' ?I saw a horse while eating?. The main verb in (h16) would be annotated as follows : h16a. mEMne khAte ? khAte_VM ghode ko dekhA b) Adjectival participial, for example : 'khAte Hue' in the following Hindi sentence , h17. mEMne ghAsa khAte_VM hue ghoDe ko dekhA * 'I erg' 'grass' 'eating' 'PROG' 'horse' 'acc' 'saw' I saw the horse eating grass. (* (h17) is ambiguous in Hindi. The other sense that it can have is, I saw the horse while (I was) eating grass. In such cases, the annotator would disambiguate the sentence depending on the context and mark accordingly.) 2) Gerunds Functionally, gerunds are nominals. However, even though they function like nouns, they are capable of taking their own arguments,eg. pInA in the following Hindi sentence can occur on its own or take an argument (given in parenthesis): h18. (sharAba) pInA_VM sehata ke liye hAnikAraka hE. 'liquor' 'drinking' 'health' 'for' 'harmful' 'is' ?Drinking (liquor) is bad for health? h19. mujhe khAnA_VM acchA lagatA hai 'to me' 'eating' 'good' 'appeals' ?I like eating? h20. sunane meM saba kuccha acchA lagatA hE 'listening' 'in' 'all' 'things' 'good' 'appeal' 'is' As mentioned above, noun 'sharAba' in (h18) is an object of the verb 'pInA' and has no relation to the main verb (hE). In order to be able to show the exact verb-argument structure in the sentence, it is essential that the crucial information of a noun derived from a verb is preserved. Therefore, even gerunds have to be marked as verbs. It is proposed that in keeping with the approach adopted for non-finite verbs, mark gerunds also as VM at the lexical level. For capturing the information that they are gerunds, such verbs will be marked as VGNN (see the section on Chunk tags for details) at the chunk level to capture their gerundial nature. The verbs having 'vAlA' vibhakti will also be marked as VM. For example, 'khonevAlA' (one who looses).
SubClass Of
Negative G Negative Negative
Abstract NEG Negative Negatives like Hindi 'nahIM' (not), 'nA' (no, not), etc. will be marked as NEG. For example, h40. vaha Aja nahIM_NEG A pAyegA 'he' 'today' 'not' 'come' 'will be able' Also, see examples (b2) and (h25) given above. Indian languages have reiteration of NEG in certain constructions. For example, b5. tumi chobitA dekhbe ? 'you' 'picture-def' 'will see' ? ?Will you see the picture ?? b6. nA_NEG, xekhabo nA_NEG 'no' 'will see (I)' 'not' ?No, I will not see (it)? The first occurrence of 'nA' in such constructions will also be marked as NEG.
SubClass Of
NegativeChunk G NegativeChunk Negative Chunk
Abstract e.g., Hindi: ((binA))_NEGP ((kucha))_NP((bole))_VG ((kAma))_NP ((nahIM calatA))_VG (Bharati et al. 2006, ?13.2)
SubClass Of
NonFiniteVerbChunk G NonFiniteVerbChunk Non Finite Verb Chunk
Abstract e.g., Hindi: mEMne ((khAte ?khAte_VM))_VGNF ghode ko dekhA (Bharati et al. 2006, ?13.2)
SubClass Of
Noun G Noun Noun
Abstract The tag NN for nouns has been adopted from Penn tags as such. The Penn tag set makes a distinction between noun singular (NN) and noun plural (NNS). As mentioned earlier, distinct tags based on grammatical information are avoided in IL tagging scheme. Any information that can be obtained from any other source is not incorporated in the POS tag. Plurality, for example, can be obtained from a morph analyzer. Moreover, as mentioned earlier, if a particular information is considered crucial at the POS tagging level itself, it can be incorporated at a later date with the help of heuristics and linguistic rules. This approach brings the number of tags down, and helps achieve simplicity, consistency, better machine learning with a small corpora etc. Therefore, the current scheme has only one tag (NN) for common nouns without getting into any distinction based on the grammatical information contained in a given noun word. (Bharati et al. 2006, ?5.1.1)
SubClass Of
NounChunk G NounChunk Noun Chunk
Abstract NP Noun Chunk Noun Chunks will be given the tag NP and include non-recursive noun phrases and postpositional phrases. The head of a noun chunk would be a noun. Specifiers will form the left side boundary for a noun chunk and the vibhakti or head noun will mark the right hand boundary for it. Descriptive adjective/s modifying the noun will be part of the noun chunk. The particle which anchors to the head noun in a noun chunk will also be grouped within the chunk. If it occurs after the noun or vibhakti, it will make the right boundary of the chunk. Some example noun chunks are : ((bacce_NN))_NP, ((kucha_QF bacce_NN))_NP, 'children' 'some' 'children' ((kucha_QF acche_JJ bacce_NN))_NP, ((Dibbe_NN meM_PSP))_NP, 'some' 'good' 'children' 'box' 'in' (( eka_QC kAlA__JJ ghoDZA_NN))_NP , 'one' 'black' 'horse' ((yaha_DEM nayI_JJ kitAba_NN))_NP, 'this' 'new' 'book' (( isa_DEM nayI_JJ kitAba_NN meM_PREP))_NP, 'this' 'new' 'book' 'in' (( isa_DEM nayI_JJ kitAba_NN meM_PSP bhI_RP))_NP 'this' 'new' 'book' 'in' 'also' The issue of genitive marker and its grouping with the nouns that it relates to was discussed in detail. For example, the noun phrase 'rAma kA beTA' contains two nouns 'rAma' and 'beTA'. The two nouns are related to each other by the vibhakti 'kA'. The issue is whether to chunk the two nouns separately or together? Linguistically, 'beTA' is the head of the phrase ?rAma kA beTA?. 'rAma' is related to 'beTA' by a genitive relation which is expressed through the vibhakti 'kA'. Going by our definition of a 'chunk' we should break 'rAma kA beTA' into two chunks ( ((rAma kA))_NP, ((beTA))_NP ) by breaking 'rAma kA' at 'kA' vibhakti . Moreover, if we chunk 'rAma kA beTA' as one chunk, linguistically, we will end up with a recursive noun phrase as a single chunk ((((rAma kA)) beTA)) which also is against our definition of a chunk. Therefore, it was decided that the genetive markers will be chunked along with the preceding noun. Thus, the noun group 'rAma kA beTA' would be chunked into two chunks. h54. ((rAma kA))NP ((beTA))NP acchA hE ?Ram's son is good? h55. ((kitAba))NP ((rAma kI))NP hE ?The book belongs to Ram? For the noun groups such as ?usakA beTA? it was decided that they should be chunked together.
SubClass Of
OrdinalNumber G OrdinalNumber Ordinal Number
Abstract QO Ordinals Expressions denoting ordinals will be marked as QO. h36. mEMne kitAba tIsare_QO laDake ko dI thI 'I' 'book' 'third' 'boy' 'to' 'give' 'was' I gave the book to the third boy?
SubClass Of
OtherEntity G OtherEntity Other Entity
Abstract BLK Miscellaneous entities Entities such as interjections and discourse markers that cannot fall into any of the above mentioned chunks will be kept within a separate chunk. eg. ((oh_INJ))_BLK, ((arre_INJ))_BLK 8.3 Some Special Cases Apart from the above, some special cases related to certain lexical types are discussed below.
SubClass Of
Particle G Particle Particle
Abstract RP Particle Expressions such as bhI, to, jI, sA, hI, nA, etc in Hindi would be marked as RP. The nA in the above list is different from the negative nA. Hindi and some other Indian languages have an ambiguous 'nA' which is used both for negation (NEG) and for reaffirmation (RP). Similarly, the particle wo is different from CC wo. For example in Bangla and Hindi: Bangla : (b1) tumi nA_RP khub dushtu 'you' 'particle' 'very' 'naughty' ?You are very naughty? (comment) Hindi : (h24) tuma nA_RP, bahuta dushta ho 'you' 'particle very naughty ?You are very naughty? (comment) Bangla : (b2) cheleta dushtu nA_NEG 'the boy' 'naughty' 'not' ?The boy is not naughty? Hindi : (h25) mEM nA_NEG jA sakUMgA 'I' 'not' 'go' 'will able' ?I will not be able to go? Bangla : (b3) binu yYoxi khAya to_CC Ami khAba 'Binu' 'if' 'eats' 'then' 'I' 'will eat' ?If Binu eats then I will eat (too)? Hindi : (h26) yadi binu khAyegA wo_CC mEM khAUMgI 'if' 'Binu' 'eats' 'then' 'I' 'will eat' ?Only if Binu eats, I will eat (too)? Bangla : (b4) Ami to_RP jAni nA 'I' 'particile' 'know' 'not' ?I don?t know? Hindi : (h27) mujhako to_RP nahIM patA 'I' 'particile' 'not' 'know' ?I don?t know? (Bharati et al. 2006, ?5.9) Hindi (and some other Indian languages) has particles such as 'jI' or 'sAHaba' etc. after proper nouns or personal pronouns. These particles are added to denote respect to the referred person. Such honorific words will be treated like particles and will be tagged RP like other particles. h53. mantrI_NN jI_RP sabhA meM dera se pahuMce . 'minister' 'hon' 'meeting' 'in' 'late' 'part' 'reached' ?The minister reached late for the meeting?. (Bharati et al. 2006, ?6.2)
SubClass Of
POSTag G POSTag P O S Tag
Abstract "The Penn tags are most commonly used tags for English. Many tag sets designed subsequently have been a variant of this tag set (eg. Lancaster tag set). So, while deciding the tags for this tagger, the Penn tags have been used as a benchmark. Since the Penn tag set is an established tag set for English, we have used the same tags as the Penn tags for common lexical types. However, new tags have been introduced wherever Penn tags have been found inadequate for Indian language descriptions. For example, for verbs none of the Penn tags have been used. Instead, AnnCorra has only two tags for annotating verbs, VM (main verb) and VAUX (auxiliary verb)." (Bharati et al. 2006)
Sub-Classes
Postposition G Postposition Postposition
Abstract PSP Postposition All Indian languages have the phenomenon of postpositions. Postpositions express certain grammatical functions such as case etc. The postposition will be marked as PSP in the current tagging scheme. For example, h22. mohana kheta meM khAda dAla rahA thA 'Mohan' 'field' 'in' 'fertilizer' 'put ' 'PROG' 'was' meM in the above example is a postposition and will be tagged as PSP. A postposition will be annotated as PSP ONLY if it is written separately. In case it is conjoined with the preceding word it will not be marked separately. For example, in Hindi pronouns the postpositions are conjoined with the pronoun, h23. mEne usako bAzAra meM dekhA 'I' 'him' 'market' 'in' 'saw' (h23) above has three instances of 'postposition' (in bold) usage. The postpositions 'ne' and 'ko' are conjoined with the pronouns mEM and usa respectively. The third postposition 'meM' is written separately. In the first two instances, the postposition will not be annotated. Such words will be annotated with the category of the head word. Therefore, the three instances mentioned above will be annotated as shown in (h23a) below : h23a. mEne_PRP usako_PRP bAzAra_NN meM_PSP dekhA
SubClass Of
Pronoun G Pronoun Pronoun
Abstract PRP Pronoun Penn tags make a distinction between personal pronouns and possessive pronouns. This distinction is avoided here. All pronouns are marked as PRP. In Indian languages all pronouns inflect for all cases (accusative, dative, possessive etc.). In case we have a separate tag for possessive pronouns, new tags will have to be designed for all the other cases as well. This will increase the number of tags which is unnecessary. So only one tag is used for all the pronouns. The necessity for keeping a separate tag for pronouns was also discussed, as linguistically, a pronoun is a variable and functionally it is a noun. However, it was decided that the tag for pronouns will be helpful for anaphora resolution tasks and should be retained.
SubClass Of
PronounOrDemonstrative G PronounOrDemonstrative Pronoun Or Demonstrative
SubClass Of
Sub-Classes
ProperNoun G ProperNoun Proper Noun
Abstract NNP Proper Nouns The need for a separate tag for proper nouns and its usability was discussed. Following points were raised against the inclusion of a separate tag for proper nouns : a) Indian languages, unlike English, do not have any specific marker for proper nouns in orthographic conventions. English proper nouns begin with a capital letter which distinguishes them from common nouns. b) All the words which occur as proper nouns in Indian languages can also occur as common nouns denoting a lexical meaning. For example, English : John, Harry, Mary occur only as proper nouns whereas Hindi : aTala bihArI, saritA, aravinda etc are used as 'names' and they also belong to grammatical categories of words with various senses . For example given below is a list of Hindi words with their grammatical class and sense. aTala adj immovable bihArI adj from Bihar saritA noun river aravinda noun lotus Any of the above words can occur in texts as common lexical items or as proper names. (h9) - (h11) below show their occurrences as proper nouns, h9. atala bihAri bAjapaI bhArata ke pradhAna mantrI the. 'Atal' 'Bihari' 'Vajpayee' 'India' 'of' 'prime' 'minister' 'was' ?Atal Behari Vajpayee was the Prime Minister of India?. h10. merI mitra saritA tAIvAna jA rahI hE. 'my' 'friend' 'Sarita' 'Taiwan' 'go' 'PROG' 'is' ?My friend Sarita is going to Taiwan? h11. aravinda ne mohana ko kitAba dI. 'Aravind' 'erg' 'Mohan' 'to' 'book' 'gave' ?Aravind gave the book to Mohan?. Therefore, in the Indian languages' context, annotating proper nouns with a separate tag will not be very fruitful from machine learning point of view. In fact, the identification of proper nouns can be better achieved by named entity filters. Another point that was considered in this context was the effort involved in manual tagging of proper nouns in a given text. It is felt that not much extra effort is required in manual tagging of proper nouns. However, the data annotated with proper nouns can be useful for certain applications. Therefore, there is no harm in marking the information if it does not require much effort. Finally, it was decided to have a separate tag for proper nouns for manual annotation and ignore it for machine learning algorithms. Following this decision, the tag NNP is included in the tag set. This tag is the same as the Penn tag for proper nouns. However, in this case also AnnCorra has only one tag for both singular and plural proper nouns unlike Penn tags where a distinction is made between proper noun singular and proper noun plural by having two tags NNP and NNPS respectively.
SubClass Of
Quantifier G Quantifier Quantifier
Abstract QF Quantifiers All quantifiers like Hindi kama (less), jyAdA (more), bahuwa (lots), etc. will be marked as QF. h34. vahAz bahuta_QF loga Aye the 'there' many' 'people' 'came' 'was' ?Many people came there?. In case these words are used in constructions like 'baHutoM ne jAne se inkAra kiyA' ('many' 'by' 'to go' 'refused'; Many refues to go) where it is functioning like a noun, it will be marked as NN (noun). Quantifiers of number will be marked as below.
SubClass Of
QuantifierOrClassifier G QuantifierOrClassifier Quantifier Or Classifier
SubClass Of
Sub-Classes
QuestionWord G QuestionWord Question Word
Abstract WQ Question Words The Penn tag set makes a distinction between various uses of 'wh-' words and marks them accordingly (WDT, WRB, WP, WQ etc). The 'wh-' words in English can act as questions, as relative pronouns and as determiners. However, for Indian languages we need not keep this distinction. Therefore, we tag the question words as WQ. h30. kOna AyA hE ? 'who' 'come' 'has' ?Who has come ? h31. tuma kala kyA kara rahe ho ? 'you' 'tomorrow' what' 'doing' 'are' What are you doing tomorrow ? h32. tuma kala kahAz jA rahe ho ? 'you' 'tomorrow' 'where' 'going' 'are' ?Where are you going tomorrow ? h33. kyA tuma kala Aoge ? '?' 'you' 'tomorrow' 'will come' ?Will you come tomorrow ?
SubClass Of
Quotative G Quotative Quotative
Abstract e.g., ani (Telugu), endru (Tamil), bole/mAne (Bangla), mhaNaje (Marathi), mAne (Hindi) (Bharati et al. 2006, ?13.1)
SubClass Of
Reduplication G Reduplication Reduplication
Abstract RDP Reduplication In this phenomenon of Indian languages, the same word is written twice for various purposes such as indicating emphasis, deriving a category from another category etc. eg. choTe choTe ('small' 'small'; very small), lAla lAla ('red' 'red'; red), jaldI jaldI ('quickyl' 'quickly' ; very quickly) There are two ways in which such word sequences may be written. They can be written ? (a) separated by a space or (b) separated by a hyphen. The question to be resolved is that in case, they are written as two words (separated by space)? how should they be tagged? Earlier decision was to use the same tag for both the words. However, in this approach, the morphological character of reduplication is missed out. That is, the reduplicated item will then be treated exactly like two independent words of the same category. For example, h43. vaha mahaMgI_JJ mahaMgI_JJ cIjZeM kharIda lAyA 'he' 'expensive' 'expensive' 'things' 'buy' 'bring' ?He bought all expensive things?. h44. una catura_JJ buddhimAna_JJ baccoM ne samasyA sulajhA lI 'those' 'smart' 'intelligent' 'children' 'erg' 'problem' 'solved' ?Those smart and intelligent children solved the problem. Both (h43) and (h44) have a sequence of adjectives - mahaMgI_JJ mahaMgI_JJ and catura_JJ buddhimAna_JJ respectively. In the first case, the sequence of two adjectives is a case of reduplication (same adjective is repeated twice to indicate the intensity of 'expensive') whereas in the second case the two adjectives refer to two different properties attributed to the following noun. Since reduplication is a highly productive process in Indian languages, it is proposed to include a new tag RDP for annotating reduplicatives. The first word in a reduplicative construction will be tagged by its respective lexical category and the second word will be tagged as RDP to indicate that it is a case of reduplication distinguishing it from a normal sequence such as in (h44) above. Some more examples are given underneath to make it more explicit, h45. vaha dhIre_RB dhIre_RDP cala rahA thA. 'he' 'slowly' 'slowly' 'walk' 'PROG' 'was' ?He was walking (very) slowly?. h46. usake bAla choTe_JJ choTe_RDP the. 'his' 'hair' 'short' 'short' 'were' ?He had (very) short hair? h47. yaha bAta galI_NN galI_RDP meM phEla gayI. 'this' 'talk' 'lane' 'lane' 'in' 'spread' 'went' ?The word was spread in every lane?.
SubClass Of
SpatiotemporalNoun G SpatiotemporalNoun Spatiotemporal Noun
Abstract NST Noun denoting spatial and temporal expressions "A tag NST has been included to cover an important phenomenon of Indian languages. Certain expressions such as 'Upara' (above/up), 'nIce' (below) 'pahale' (before), 'Age' (front) etc are content words denoting time and space. These expressions, however, are used in various ways. For example, 5.1.2.1 These words often occur as temporal or spatial arguments of a verb in a given sentence taking the appropriate vibhakti (case marker): h3. vaha Upara so rahA thA . 'he' 'upstairs' 'sleep' 'PROG' 'was' ?He was sleepign upstairs?. h4. vaha pahale se kamare meM bEThA thA . 'he' 'beforehand' 'from' ' room' 'in' 'sitting' 'was' ?He was sitting in the room from beforehand? h5. tuma bAhara bETho 'you' 'outside' 'sit' ?You sit outside?. Apart from functioning like an argument of a verb, these elements also modify another noun taking postposition 'kA'. h6. usakA baDZA bhAI Upara ke hisse meM rahatA hE 'his' 'elder' 'brother' 'upstairs' 'of' 'portion' 'in' 'live' 'PRES' ?His elder brother lives in the upper portion of the house?. 5.1.2.2 Apart from occuring as a nominal expression, they also occur as a part of a postposition along with 'ke'. For example, h7. ghaDZe ke Upara thAlI rakhI hE. 'pot' 'of' 'above' 'plate' 'kept' 'is' The plate is kept on the pot?. h8. tuma ghara ke bAhara bETho 'you' 'home' 'of' 'outside' 'sit' ?You sit outside the house?. 'Upara' and 'bAhara' are parts of complex postpositions 'ke Upara' and 'ke bAhara' in (h6) and (h7) respectively which can be translated into English prepositions 'on' and 'outside'. For tagging such words, one possible option is to tag them according to their syntactic function in the given context. For example in 5.2.2 (h7) above, the word 'Upara' is occurring as part of a postposition or a relation marker. It can, therefore, be marked as a postposition. Similarly, in 5.2.1. (h3) and (h6) above, it is a noun, therefore, mark it as a noun and so on. Alternatively, since these words are more like nouns, as is evident from 5.2.1 above they can be tagged as nouns in all there occurrences. The same would apply to 'bAhAra' (outside) in examples examples (h4), (h5) and (h8). However, if we follow any of the above approaches we miss out on the fact that this class of words is slightly different from other nouns. These are nouns which indicate 'location' or 'time'. At the same time, they also function as postpositions in certain contexts. Moreover, such words, if tagged according to their syntactic function, will hamper machine learning. Considering their special status, it was considered whether to introduce a new tag, NST, for such expressions. The following five possibilities were discussed : a) Tag both (h5) & (h8) as NN b) Tag both (h5) & (h8) as NST c) Tag (h5) as NN & (h8) as NST d) Tag (h5) as NST & (h8) as PSP e) Tag (h5) as NN & (h8) as PSP After considering all the above, the decision was taken in favour of (b). The decision was primarily based on the following observations: (i) 'bAhara' in both (h5) and (h8) denotes the same expression (place expression 'outside') (ii) In both (h5) and (h8), 'bAhara' can take a vibhakti like a noun ( bAhara ko bETho, ghara ke bAhara ko bETho) (iii) If a single tag is kept for both the usages, the decision making for annotators would also be easier. Therefore, a new tag NST is introduced for such expressions. The tag NST will be used for a finite set of such words in any language. For example, Hindi has Age (front), pIche (behind), Upara (above/upstairs), nIce (below/down), bAda (after), pahale (before), andara (inside), bAhara (outside) etc."
SubClass Of
Symbol G Symbol Symbol
Abstract SYM Special Symbol All those words which cannot be classified in any of the other tags will be tagged as SYM. This tag is similar to the Penn 'SYM'. Also special symbols like $, %, etc are treated as SYM. Since the frequency of occurrence of such symbols is very less in Indian languages, no separate tag is used for such symbols.
SubClass Of
UnknownWord G UnknownWord Unknown Word
Abstract UNK Unknown A special tag to indicate unknown words is also included in the tag set. The annotators can use this tag to mark the words whose category they are not aware of. This tag has to be used very cautiously and sparsely, i.e., only if it is absolutely necessary. (Bharati et al. 2006, ?5.20) Presence of loan words is a fairly common phenomenon in languages. Most Indian languages have a number of loan word from English. One may also come across words from other Indian languages or Sanskrit in a given text. Such foreign words will be tagged as per the syntactic function of the word in the given context. In special cases, such as when the annotator is not sure of the category of a word, it will be tagged as UNK. (Bharati et al. 2006, ?6.3)
SubClass Of
VerbChunk G VerbChunk Verb Chunk
SubClass Of
Sub-Classes

Individuals

CC G CC CC Conjunction
Class
CL G CL CL Classifier
Class
DEM G DEM DEM Demonstrative
Class
ECH G ECH ECH EchoWord
Class
INJ G INJ INJ Interjection
Class
INTF G INTF INTF Intensifier
Class
JJ G JJ JJ Adjective
Class
JJC G JJC JJC Adjective Compound
Class
NEG G NEG NEG Negative
Class
NN G NN NN Noun
Abstract The tag NN for nouns has been adopted from Penn tags as such. The Penn tag set makes a distinction between noun singular (NN) and noun plural (NNS). As mentioned earlier, distinct tags based on grammatical information are avoided in IL tagging scheme. Any information that can be obtained from any other source is not incorporated in the POS tag. Plurality, for example, can be obtained from a morph analyzer. Moreover, as mentioned earlier, if a particular information is considered crucial at the POS tagging level itself, it can be incorporated at a later date with the help of heuristics and linguistic rules. This approach brings the number of tags down, and helps achieve simplicity, consistency, better machine learning with a small corpora etc. Therefore, the current scheme has only one tag (NN) for common nouns without getting into any distinction based on the grammatical information contained in a given noun word
Class
NNC G NNC NNC Compound Noun
Class
NNP G NNP NNP ProperNoun
Class
NNPC G NNPC NNPC Compound ProperNoun
Class
NST G NST NST SpatiotemporalNoun
Class
PRP G PRP PRP Pronoun
Class
PSP G PSP PSP Postposition
Class
QC G QC QC CardinalNumber
Class
QF G QF QF Quantifier
Class
QO G QO QO OrdinalNumber
Class
RB G RB RB Adverb
Class
RDP G RDP RDP Reduplication
Class
RP G RP RP Particle
Class
SYM G SYM SYM Symbol
Class
UNK G UNK UNK UnknownWord
Class
UT G UT UT Quotative
Class
VAUX G VAUX VAUX AuxiliaryVerb
Class
VM G VM VM MainVerb
Class
WQ G WQ WQ QuestionWord
Class
_C G _C C Compound
Class