Ontology - iiit

Abstract
OLiA Annotation Model for a Part of Speech Tagger for Indian Languages (IIIT 2007). Languages mentioned in the document include Hindi, Marathi, and Telugu. To a certain extent, IIIT (2007) seems to be a revision of http://ltrc.iiit.ac.in/tr031/posguidelines.pdf that was developed at the same institute. Unless marked otherwise, all comments are quotes from IIIT (2007). IIIT (2007), A Part of Speech Tagger for Indian Languages (POS tagger), Tagset developed at IIIT - Hyderabad after consultations with several institutions through two workshops. available under http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Latest Version
http://purl.org/olia/iiit.owl#

Imports

Classes - Overview

G POSTag P O S Tag LanguageSpecificPOSTag Language Specific P O S Tag POSTag->LanguageSpecificPOSTag is a ModifiedPTBPOSTag Modified P T B P O S Tag POSTag->ModifiedPTBPOSTag is a PTBInspiredPOSTag P T B Inspired P O S Tag POSTag->PTBInspiredPOSTag is a AdjectivalNonFiniteVerb Adjectival Non Finite Verb Adjective Adjective Adverb Adverb AdverbialNonFiniteVerb Adverbial Non Finite Verb AuxiliaryVerb Auxiliary Verb CompoundNoun Compound Noun CompoundProperNoun Compound Proper Noun Conjunction Conjunction Intensifier Intensifier Interjection Interjection LanguageSpecificPOSTag->CompoundNoun is a LanguageSpecificPOSTag->CompoundProperNoun is a LanguageSpecificPOSTag->Intensifier is a LightVerb Light Verb LanguageSpecificPOSTag->LightVerb is a LocationNoun Location Noun LanguageSpecificPOSTag->LocationNoun is a Negative Negative LanguageSpecificPOSTag->Negative is a MainVerb Main Verb ModifiedPTBPOSTag->MainVerb is a NonFiniteVerb Non Finite Verb ModifiedPTBPOSTag->NonFiniteVerb is a Number Number ModifiedPTBPOSTag->Number is a Postposition Postposition ModifiedPTBPOSTag->Postposition is a Quantifier Quantifier ModifiedPTBPOSTag->Quantifier is a QuestionWord Question Word ModifiedPTBPOSTag->QuestionWord is a NominalNonFiniteVerb Nominal Non Finite Verb NonFiniteVerb->AdjectivalNonFiniteVerb is a NonFiniteVerb->AdverbialNonFiniteVerb is a NonFiniteVerb->NominalNonFiniteVerb is a Noun Noun PTBInspiredPOSTag->Adjective is a PTBInspiredPOSTag->Adverb is a PTBInspiredPOSTag->AuxiliaryVerb is a PTBInspiredPOSTag->Conjunction is a PTBInspiredPOSTag->Interjection is a PTBInspiredPOSTag->Noun is a Particle Particle PTBInspiredPOSTag->Particle is a Pronoun Pronoun PTBInspiredPOSTag->Pronoun is a ProperNoun Proper Noun PTBInspiredPOSTag->ProperNoun is a Symbol Symbol PTBInspiredPOSTag->Symbol is a

Classes

AdjectivalNonFiniteVerb G AdjectivalNonFiniteVerb Adjectival Non Finite Verb
Abstract VJJ Verb Non-Finite Adjectival Unlike Penn tagset all non finite verbs which are used as adjectives will be marked as VJJ. The Penn tagger does not make a distinction between the gerunds and adjectival participles or simple 'ing' type verb forms. For Hindi, constructions like 'khAte Hue' will be tagged as follows: khAte/VJJ Hue/VAUX. ("eating") As explained earlier in the paper, this distinction is made in order to preserve the information that this word is a form of a verb. Every verb is capable of taking its own arguments in a sentence, even if it is not the main verb. In order to be able to show the exact verb-argument structure in the sentence, it is essential that this crucial information is preserved. So this tagger marks all non-finite adjectival participles as VJJ i.e. an adjective which is formed out of a verb. e.g., (khAte/VJJ Hue/VAUX) ("eating") negative VJJN :telugu
SubClass Of
Adjective G Adjective Adjective
Abstract JJ Adjective This tag is again same as in Penn tagset. Penn tagset also makes a distinction between comparative and superlative adjectives. This has not been considered here. So this tag includes the 'tara'(comparative) and the 'tama' (superlative) forms of adjectives in Hindi. e.g. adhikatara, sarvottama, etc. ("more times", "best") (includes comparative and superlative forms also, adhikatara, sarvottama)
SubClass Of
Adverb G Adverb Adverb
Abstract RB Adverb This tag is the same RB tag of Penn tagset. Penn tagset also makes a difference between comparative and superlative adverbs, which is not adopted in this tagger. This is in accordance with our philosophy of coarseness in linguistic analysis. e.g., (dhIre/RB dhIre/RB, tejI/RB se/RP) ("slowly slowly", "fast")
SubClass Of
AdverbialNonFiniteVerb G AdverbialNonFiniteVerb Adverbial Non Finite Verb
Abstract VRB Verb Non-finite Adverbial Again unlike Penn tagset, non-finite forms of verbs which are used as adverbs will be tagged with a different tag VRB. In Hindi constructions like 'khAte khAte'("while eating"), 'khAkara'("after eating"), etc will be tagged as VRB. The reason for this distinction between non-finite verbs used as adverbs and other verbs is as explained in VJJ. e.g., khAkara, pIte/VRB Hue/VAUX) ("after eating", "drinking") negative VRBN :telugu
SubClass Of
AuxiliaryVerb G AuxiliaryVerb Auxiliary Verb
Abstract VAUX Verb Auxiliary All auxiliary verbs will be marked as VAUX. This tag has been adopted as such from the Penn tagset. e.g., (khA/VFM cukA/VAUX HE/VAUX) ("eat en has")
SubClass Of
CompoundNoun G CompoundNoun Compound Noun
Abstract NNC Compound Nouns There is no separate tag for Compound nouns in the Penn tagset. But in this tagger, the tag NNC is used for compound nouns. This tag has been introduced in order to indentify unhyphenated compound words as one unit. e.g. 'keMdra sarakAra' will be tagged as keMdra/NNC sakakAra/NN. ("center" "government") In this example, 'keMxra' and 'sarakAra' are both nouns which are forming a compound noun. All words except the last one, of compound words will be marked as NNC. Thus any NNC will be always followed by another NNC or an NN. This strategy helps indentify these words as one unit although they are not conjoined by a hyhen. NNC Compound Common Nouns (kendra/NNC sarakAra/NN ("center government", rAma/NNC moHana/NN ("Ram, Mohan"), laDaZke/NNC laDaZkiyAz/NN ("girls boys"), laDaZke/NNC laDaZkiyoM/NN ne khAnA khAyA (girls boys food ate").)
SubClass Of
CompoundProperNoun G CompoundProperNoun Compound Proper Noun
Abstract NNPC Compound Proper Nouns This tag is also an addition. All words in a compound proper noun will be marked as NNPC excluding the last one. e.g. aTala/NNPC biHArI/NNPC vAjapeyI/NNP. Here the first two words are NNPC and the last one will be NNP. Just as the NNC tag this tag too helps identify a compound proper noun as one unit and not confuse it with a list of proper nouns. e.g. rAma, moHana aur shAma ghara gaye. ("Ram", "Mohan" "and" "Shyam" "home" "went") Any title like Dr., Col., Lt. etc. which occurs before a proper noun will be tagged as NNC. All such titles are nouns which will always be followed by a Proper Noun. To indicate that these are a part of the proper noun but are nouns they will be tagged as NNC. e.g. Col./NNC Ranjit/NNPC Deshmukh/NNP NNC Compound Common Nouns (kendra/NNC sarakAra/NN ("center government", rAma/NNC moHana/NN ("Ram, Mohan"), laDaZke/NNC laDaZkiyAz/NN ("girls boys"), laDaZke/NNC laDaZkiyoM/NN ne khAnA khAyA (girls boys food ate").)
SubClass Of
Conjunction G Conjunction Conjunction
Abstract CC Conjuncts (coordinating and subordinating) The tag CC will be used for coordinating and subordinating conjuncts both. The Penn tagset has used ?IN? tag for prepositions and subordinating conjuncts. Their rationale behind this is that subordinating conjuncts and prepositions can be distinguished because subordinating conjuncts are followed by a clause and a prepositions by a noun phrase. But in the current tagger all connectors other than prepositions will be marked as CC. e.g. (Ora, yA, ki) ("and", "or", "that")
SubClass Of
Intensifier G Intensifier Intensifier
Abstract INTF Intensifier This tag is not present in Penn tagset. Words like 'baHuta', 'kama', etc. will be covered under this. e.g., ("baHuta" jyAdA, "Ora" jyAdA) # But note that: [baHutoM/noun ne] ("too much", "much more")
SubClass Of
Interjection G Interjection Interjection
Abstract UH Interjection Just as in Penn tagset, interjections will be marked as UH. In addition the affirmative word 'HAz'("yes") will also be tagged as UH. This is the only example of such a word so has been clubbed under Interjection. UH Interjection words (HAM and interjections)
SubClass Of
LanguageSpecificPOSTag G LanguageSpecificPOSTag Language Specific P O S Tag
Abstract This set is of new tags designed to cater some phenomena which are specific to Indian languages.
SubClass Of
Sub-Classes
LightVerb G LightVerb Light Verb
Abstract NVB,JVB,RBVB Kriyamula (light verbs) This tag has been introduced to account for the concept of kriyamuls of Indian Languages. Kriyamuls are verbs formed by combining a noun or an adjective or an adverb with a (helping) verb. The kriyamuls formed by joining a noun will be NVB, those formed with an adjective will be JVB and those formed by joining adverbs will be RBVB. e.g. snAna/NVB karatA/VFM HE/VAUX ("bath" "does") In the above example 'snAna' is a noun which is joined to the verb 'karanA' to express the sense of the verb 'to bathe'. So here 'snAna' is marked as NVB and the main verb is marked as VFM and 'HE' is its auxilliary. e.g. lAla/JVB HotA/VFM HE/VAUX ("red" "happens") In this example the adjective 'lAla' is joined with 'HonA' to express the sense of the verb 'to redden'. So 'lAla' is marked as JVB, 'HotA' as VFM and 'HE' as VAUX. e.g. yaHa to jarUra/RBVB HE/VFM........ In this example the adverb 'jarUra' is joined with 'HonA' to express the sense 'to be sure'. So 'jarUra' is marked as RBVB and 'HE' is the main verb marked as VFM. Kriyamula: NVB Noun in kriya mula (snAna/NVB karatA/VFM HE/VAUX) (snAna/NVB karate/VJJ Hue/VAUX) (snAna/NVB karake/VRB) (snAna/NVB karane/VNN para/PREP) JVB Adj in kriya mula (lAla/JVB HotA/VFM HE/VAUX) (pUrA/JVB HotA/VFM HE/VAUX) (pUrA/JVB Hote/VRB Hue/VAUX) (pUrA/JVB Hokara/VRB (pUrA/JVB Hone/VNM para/PREP RBVB Adv in kriya mula In case there is such a usage with xxxx (xxxx/RBVB HotA/VFM HE/VAUX)
SubClass Of
LocationNoun G LocationNoun Location Noun
Abstract NLOC Noun Location This is an entirely new tag introduced to cover an important phenomenon of Indian Languages. Words like 'Age', 'upara', 'pahele', 'bAda', etc. are used in various ways in Hindi. 1. They act as a postposition along with 'ke' e.g. ghade ke upara thAlI rakhI HE. ("pot" "on" "plate" "kept" "is") Here 'ke upara' is a post position which is the direct equivalent of the English preposition 'on'. 2. They also act as adverbs. e.g. tuma upara jAo. ("You" "up" "go") Here 'upara' is an adverbial of place. 3. These words also take post positions themselves and so in some sense behave like nouns. e.g. vaHa upara se AyA. ("He" "above" "from" "came") 4. As pointed out in 3. above, these words take postpositions and act as arguments of the verb in the sentence. And they also take a post position to join with a another noun. So in that sense also they behave like nouns. e.g. upara kA HissA ("above" "of" "portion") To tag such words one option is to tag them according to the category to which they belong in the given sentence. For example in 1. above, the word is occurring as a postposition so can be marked as a postposition. In example 2. above, it is an adverb so can be marked as an adverb and so on. But we feel that these words are more like nouns as is evident from 3. and 4. above, and also if we consider for examples, 'aage', 'upara', etc. as places which are in front, up, etc then we can tag them as nouns. But these are not pure nouns. They are nouns which indicate a location or time. These also function as adverbs or prepositions in a context. So a new tag NLOC is introduced for such words. This tag will cater to a finite set of such words. set: (Age, piche, upara, nIce, bAda, pahele) ("front", "behind", "above", "below", "before") Such words if tagged according to their syntactic function, it will hamper machine learning. So a single tag, NLOC has been devised for such words which indicate location and time. e.g., (upara, Age, pahele, bAda)
SubClass Of
MainVerb G MainVerb Main Verb
Abstract VFM Verb Finite Main The entire verb category has been dealt with differently in this tagger. The following discussions explain how the verbal category has been dealt with. The VFM tag is a modification of the VB tag of Penn tagset. Main verb of a finite verb group of a sentence is considered as VFM. Whether the form of the particular word is finite or non-finite it will be tagged as VFM. E.g. laDZakA seba khA/VFM raHA thA. ("boy" "apple" "eating" "was") e.g. (vaHa "pItA" HE, vaHa laDaZkA "HE") ("he drinks", "he boy is")
SubClass Of
ModifiedPTBPOSTag G ModifiedPTBPOSTag Modified P T B P O S Tag
Abstract This group includes those tags which are a modification of some tags in the Penn tagset.
SubClass Of
Sub-Classes
Negative G Negative Negative
Abstract NEG Negative Negatives like 'nahI', 'na', etc. will be marked as NEG. e.g. (nA, naHIM) ("no", "not")
SubClass Of
NominalNonFiniteVerb G NominalNonFiniteVerb Nominal Non Finite Verb
Abstract VNN Verb Non-Finite Nominal In the Penn tagger, VBG is used for gerunds, participles and progressive verb forms. But this tagger will mark gerunds as VNN. This distinction is being made in order that consructions like 'pIna', etc can be accounted for. e.g. sharAba pInA/VNN seHata ke liye KAnikAraka HE. ("liquor" "drinking" "heath" "for" "harmful" "is") e.g., (pInA) ("drinking") negative VNNN :telugu
SubClass Of
NonFiniteVerb G NonFiniteVerb Non Finite Verb
SubClass Of
Sub-Classes
Noun G Noun Noun
Abstract NN Noun Penn tagset makes a distinction between noun singular and noun plural. As mentioned earlier, this distinction is avoided here. This reduces the number of tags and thus enhances machine learning. Plurality is not crucial information with respect to dependancy level parsing or any other higher level analysis of the sentence. As said before if that information is needed at a later stage it can be incorporated with the help of heuristics and linguistic rules. e.g., (laDaZkA, nadI, vicAra, kaThoratA) ("boy", "river", "thought", "hardness")
SubClass Of
Number G Number Number
Abstract QFNUM Quantifiers Number No distinction will be made between cardinal and ordinal numbers. Any word denoting numbers will be tagged as QFNUM. Penn tagset has a tag CD for cardinal numbers and they have not talked of ordinals! e.g. (tIsarA, tInoM, tIna) ("third", "three"(oblique), "three")
SubClass Of
Particle G Particle Particle
Abstract RP Particle In Indian languages words like bhI, sA, etc. (Hindi for example) will be marked as RP. e.g. (mIThA sA/RP, taka/RP, HI/RP, to/RP, bhI/RP)
SubClass Of
POSTag G POSTag P O S Tag
Abstract All the tags used in this tagger are broadly classified into three types. There are some tags which have been adopted with some minor changes in the Penn tagset. They are grouped into one group. The second category of tags is of those which are a modification over the Penn tagset. The last group is of all those tags which are not present in the Penn tagset. They have been designed to cater some phenomena which are specific to Indian languages.
Sub-Classes
Postposition G Postposition Postposition
Abstract ??PREP Postposition All Indian languages have the phenomenon of postpositions. Some languages separate the post positions from the noun e.g. Hindi. In such a case, a postposition will be marked as PREP. For example in Hindi, kheta/NN meM/PREP ("the field"/NN "in"/PREP), here meM is the postposition and is written separately from the noun. So it will be tagged as PREP. But in Marathi (another Indian language), mulAne/NN("boy by"/NN), here the postposition is written along with the noun. So it will not be tagged separately. This tag is the same as the IN tag used for prepositions in Penn tagset. But it has been adopted for a parallel concept in this tagger. Postpositions of Indian languages have more or less the same functions as prepositions in English. The same tag is used by Penn tagset for subordinating conjuncts also. They feel that subordinating conjuncts and prepositions can be distinguished because subordinating conjuncts are followed by a clause and prepositions by a noun phrase. But as pointed out earlier, in this tagger all conjuncts have been clubbed under the tag CC. e.g. (ne, ke/PREP liye/PREP) ("by", "for")
SubClass Of
Pronoun G Pronoun Pronoun
Abstract PRP Pronoun Penn tagset makes a distinction between personal pronouns and possessive pronouns. This distinction is avoided here. All pronouns are marked as PRP. In Indian languages all pronouns inflect for all cases (accusative, dative, possessive etc.) Incase we have a separate tag for possessive pronouns; new tags will have to be designed for all the cases. This will increase the number of tags which is unnecessary. So, only one tag is used for all pronouns. e.g. (jo, vo, vaHa,"jisa" laDaZke ne, jisane) ("who", "that", "he", "the boy who", "by whom")
SubClass Of
ProperNoun G ProperNoun Proper Noun
Abstract NNP Proper Nouns This tag is also similar to the Penn tagset. Here too we have not made a distinction between Proper Noun singular and Proper Noun plural as in the Penn tagset. e.g. (rAma, bhAjapA) (Ram, BJP)
SubClass Of
PTBInspiredPOSTag G PTBInspiredPOSTag P T B Inspired P O S Tag
Abstract All tags in this group are similar to the Penn tagset. Penn tagset makes finer distinction between singular and plural or comparative and superlative forms, which is not considered in the current tagger. This is in accordance with our policy about fineness and coarseness.
SubClass Of
Sub-Classes
Quantifier G Quantifier Quantifier
Abstract QF Quantifiers All quantifiers like kama, jyAdA, bahuwa, etc. will be marked as QF. In case these words are used in constructions like 'baHutoM/NN ne/PREP jAne se inkAra kiyA'("many" "by" "to go" "refused") where it is a noun, it will be marked as noun. Quantifiers of number will be marked as below. e.g. (jyAdA/QF, thoDA/QF, saba/QF, kama/QF, baHuta/QF) ("more", "little", "all", "much")
SubClass Of
QuestionWord G QuestionWord Question Word
Abstract QW Question Words The Penn tagset makes distinction between the wh words which act as questions, as relative pronouns and as determiners. But in this tagger all wh words (ka'kAra's in Hindi) will be tagged as QW. The reason being, in Indian languages the category where 'wh' words act as pronouns or determiners is not present. They all become pronouns like 'jo', 'jisne', etc. in Hindi e.g. The man who wrote a book ... (vaHa AdamI jisne kItAba likhI ... ) ("that" "man" "who" "book" "wrote") e.g. (kyA/QW, kEsA/QW) ("what", "how")
SubClass Of
Symbol G Symbol Symbol
Abstract SYM Special Symbol All those words which cannot be classified in any of the other tags will be tagged as SYM. This tag is similar to the Penn 'SYM'. Also special symbols like $, %, etc are treated as SYM. Since the frequency of occurrence of such symbols is very less in Indian languages, no separate tag is used for such symbols. SYM Special: Not classified in any of the above
SubClass Of

Individuals

CC G CC CC Conjunction
Class
INTF G INTF INTF Intensifier
Class
JJ G JJ JJ Adjective
Class
JVB G JVB JVB LightVerb
Class
NEG G NEG NEG Negative
Class
NLOC G NLOC NLOC LocationNoun
Class
NN G NN NN Noun
Class
NNC G NNC NNC CompoundNoun
Class
NNP G NNP NNP ProperNoun
Class
NNPC G NNPC NNPC CompoundProperNoun
Class
NVB G NVB NVB LightVerb
Class
PREP G PREP PREP Postposition
Class
PRP G PRP PRP Pronoun
Class
QF G QF QF Quantifier
Class
QFNUM G QFNUM QFNUM Number
Class
QW G QW QW QuestionWord
Class
RB G RB RB Adverb
Class
RBVB G RBVB RBVB LightVerb
Class
RP G RP RP Particle
Class
SYM G SYM SYM Symbol
Class
UH G UH UH Interjection
Class
VAUX G VAUX VAUX AuxiliaryVerb
Class
VFM G VFM VFM MainVerb
Class
VJJ G VJJ VJJ AdjectivalNonFiniteVerb
Class
VJJN G VJJN VJJN AdjectivalNonFiniteVerb
Class
VNN G VNN VNN NominalNonFiniteVerb
Class
VNNN G VNNN VNNN NominalNonFiniteVerb
Class
VRB G VRB VRB AdverbialNonFiniteVerb
Class
VRBN G VRBN VRBN AdverbialNonFiniteVerb
Class