iiit

Ontology - iiit

Abstract: OLiA Annotation Model for a Part of Speech Tagger for Indian Languages (IIIT 2007). Languages mentioned in the document include Hindi, Marathi, and Telugu. To a certain extent, IIIT (2007) seems to be a revision of http://ltrc.iiit.ac.in/tr031/posguidelines.pdf that was developed at the same institute. Unless marked otherwise, all comments are quotes from IIIT (2007). IIIT (2007), A Part of Speech Tagger for Indian Languages (POS tagger), Tagset developed at IIIT - Hyderabad after consultations with several institutions through two workshops. available under http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
Latest Version: http://purl.org/olia/iiit.owl#

Imports

http://purl.org/olia/system.owl olia_system

Classes - Overview

Classes

AdjectivalNonFiniteVerb
Abstract	VJJ Verb Non-Finite Adjectival Unlike Penn tagset all non finite verbs which are used as adjectives will be marked as VJJ. The Penn tagger does not make a distinction between the gerunds and adjectival participles or simple 'ing' type verb forms. For Hindi, constructions like 'khAte Hue' will be tagged as follows: khAte/VJJ Hue/VAUX. ("eating") As explained earlier in the paper, this distinction is made in order to preserve the information that this word is a form of a verb. Every verb is capable of taking its own arguments in a sentence, even if it is not the main verb. In order to be able to show the exact verb-argument structure in the sentence, it is essential that this crucial information is preserved. So this tagger marks all non-finite adjectival participles as VJJ i.e. an adjective which is formed out of a verb. e.g., (khAte/VJJ Hue/VAUX) ("eating") negative VJJN :telugu
SubClass Of	NonFiniteVerb
Adjective
Abstract	JJ Adjective This tag is again same as in Penn tagset. Penn tagset also makes a distinction between comparative and superlative adjectives. This has not been considered here. So this tag includes the 'tara'(comparative) and the 'tama' (superlative) forms of adjectives in Hindi. e.g. adhikatara, sarvottama, etc. ("more times", "best") (includes comparative and superlative forms also, adhikatara, sarvottama)
SubClass Of	PTBInspiredPOSTag
Adverb
Abstract	RB Adverb This tag is the same RB tag of Penn tagset. Penn tagset also makes a difference between comparative and superlative adverbs, which is not adopted in this tagger. This is in accordance with our philosophy of coarseness in linguistic analysis. e.g., (dhIre/RB dhIre/RB, tejI/RB se/RP) ("slowly slowly", "fast")
SubClass Of	PTBInspiredPOSTag
AdverbialNonFiniteVerb
Abstract	VRB Verb Non-finite Adverbial Again unlike Penn tagset, non-finite forms of verbs which are used as adverbs will be tagged with a different tag VRB. In Hindi constructions like 'khAte khAte'("while eating"), 'khAkara'("after eating"), etc will be tagged as VRB. The reason for this distinction between non-finite verbs used as adverbs and other verbs is as explained in VJJ. e.g., khAkara, pIte/VRB Hue/VAUX) ("after eating", "drinking") negative VRBN :telugu
SubClass Of	NonFiniteVerb
AuxiliaryVerb
Abstract	VAUX Verb Auxiliary All auxiliary verbs will be marked as VAUX. This tag has been adopted as such from the Penn tagset. e.g., (khA/VFM cukA/VAUX HE/VAUX) ("eat en has")
SubClass Of	PTBInspiredPOSTag
CompoundNoun
Abstract	NNC Compound Nouns There is no separate tag for Compound nouns in the Penn tagset. But in this tagger, the tag NNC is used for compound nouns. This tag has been introduced in order to indentify unhyphenated compound words as one unit. e.g. 'keMdra sarakAra' will be tagged as keMdra/NNC sakakAra/NN. ("center" "government") In this example, 'keMxra' and 'sarakAra' are both nouns which are forming a compound noun. All words except the last one, of compound words will be marked as NNC. Thus any NNC will be always followed by another NNC or an NN. This strategy helps indentify these words as one unit although they are not conjoined by a hyhen. NNC Compound Common Nouns (kendra/NNC sarakAra/NN ("center government", rAma/NNC moHana/NN ("Ram, Mohan"), laDaZke/NNC laDaZkiyAz/NN ("girls boys"), laDaZke/NNC laDaZkiyoM/NN ne khAnA khAyA (girls boys food ate").)
SubClass Of	LanguageSpecificPOSTag
CompoundProperNoun
Abstract	NNPC Compound Proper Nouns This tag is also an addition. All words in a compound proper noun will be marked as NNPC excluding the last one. e.g. aTala/NNPC biHArI/NNPC vAjapeyI/NNP. Here the first two words are NNPC and the last one will be NNP. Just as the NNC tag this tag too helps identify a compound proper noun as one unit and not confuse it with a list of proper nouns. e.g. rAma, moHana aur shAma ghara gaye. ("Ram", "Mohan" "and" "Shyam" "home" "went") Any title like Dr., Col., Lt. etc. which occurs before a proper noun will be tagged as NNC. All such titles are nouns which will always be followed by a Proper Noun. To indicate that these are a part of the proper noun but are nouns they will be tagged as NNC. e.g. Col./NNC Ranjit/NNPC Deshmukh/NNP NNC Compound Common Nouns (kendra/NNC sarakAra/NN ("center government", rAma/NNC moHana/NN ("Ram, Mohan"), laDaZke/NNC laDaZkiyAz/NN ("girls boys"), laDaZke/NNC laDaZkiyoM/NN ne khAnA khAyA (girls boys food ate").)
SubClass Of	LanguageSpecificPOSTag
Conjunction
Abstract	CC Conjuncts (coordinating and subordinating) The tag CC will be used for coordinating and subordinating conjuncts both. The Penn tagset has used ?IN? tag for prepositions and subordinating conjuncts. Their rationale behind this is that subordinating conjuncts and prepositions can be distinguished because subordinating conjuncts are followed by a clause and a prepositions by a noun phrase. But in the current tagger all connectors other than prepositions will be marked as CC. e.g. (Ora, yA, ki) ("and", "or", "that")
SubClass Of	PTBInspiredPOSTag
Intensifier
Abstract	INTF Intensifier This tag is not present in Penn tagset. Words like 'baHuta', 'kama', etc. will be covered under this. e.g., ("baHuta" jyAdA, "Ora" jyAdA) # But note that: [baHutoM/noun ne] ("too much", "much more")
SubClass Of	LanguageSpecificPOSTag
Interjection
Abstract	UH Interjection Just as in Penn tagset, interjections will be marked as UH. In addition the affirmative word 'HAz'("yes") will also be tagged as UH. This is the only example of such a word so has been clubbed under Interjection. UH Interjection words (HAM and interjections)
SubClass Of	PTBInspiredPOSTag
LanguageSpecificPOSTag
Abstract	This set is of new tags designed to cater some phenomena which are specific to Indian languages.
SubClass Of	POSTag
Sub-Classes	CompoundNoun CompoundProperNoun Intensifier LightVerb LocationNoun Negative
LightVerb
Abstract	NVB,JVB,RBVB Kriyamula (light verbs) This tag has been introduced to account for the concept of kriyamuls of Indian Languages. Kriyamuls are verbs formed by combining a noun or an adjective or an adverb with a (helping) verb. The kriyamuls formed by joining a noun will be NVB, those formed with an adjective will be JVB and those formed by joining adverbs will be RBVB. e.g. snAna/NVB karatA/VFM HE/VAUX ("bath" "does") In the above example 'snAna' is a noun which is joined to the verb 'karanA' to express the sense of the verb 'to bathe'. So here 'snAna' is marked as NVB and the main verb is marked as VFM and 'HE' is its auxilliary. e.g. lAla/JVB HotA/VFM HE/VAUX ("red" "happens") In this example the adjective 'lAla' is joined with 'HonA' to express the sense of the verb 'to redden'. So 'lAla' is marked as JVB, 'HotA' as VFM and 'HE' as VAUX. e.g. yaHa to jarUra/RBVB HE/VFM........ In this example the adverb 'jarUra' is joined with 'HonA' to express the sense 'to be sure'. So 'jarUra' is marked as RBVB and 'HE' is the main verb marked as VFM. Kriyamula: NVB Noun in kriya mula (snAna/NVB karatA/VFM HE/VAUX) (snAna/NVB karate/VJJ Hue/VAUX) (snAna/NVB karake/VRB) (snAna/NVB karane/VNN para/PREP) JVB Adj in kriya mula (lAla/JVB HotA/VFM HE/VAUX) (pUrA/JVB HotA/VFM HE/VAUX) (pUrA/JVB Hote/VRB Hue/VAUX) (pUrA/JVB Hokara/VRB (pUrA/JVB Hone/VNM para/PREP RBVB Adv in kriya mula In case there is such a usage with xxxx (xxxx/RBVB HotA/VFM HE/VAUX)
SubClass Of	LanguageSpecificPOSTag
LocationNoun
Abstract	NLOC Noun Location This is an entirely new tag introduced to cover an important phenomenon of Indian Languages. Words like 'Age', 'upara', 'pahele', 'bAda', etc. are used in various ways in Hindi. 1. They act as a postposition along with 'ke' e.g. ghade ke upara thAlI rakhI HE. ("pot" "on" "plate" "kept" "is") Here 'ke upara' is a post position which is the direct equivalent of the English preposition 'on'. 2. They also act as adverbs. e.g. tuma upara jAo. ("You" "up" "go") Here 'upara' is an adverbial of place. 3. These words also take post positions themselves and so in some sense behave like nouns. e.g. vaHa upara se AyA. ("He" "above" "from" "came") 4. As pointed out in 3. above, these words take postpositions and act as arguments of the verb in the sentence. And they also take a post position to join with a another noun. So in that sense also they behave like nouns. e.g. upara kA HissA ("above" "of" "portion") To tag such words one option is to tag them according to the category to which they belong in the given sentence. For example in 1. above, the word is occurring as a postposition so can be marked as a postposition. In example 2. above, it is an adverb so can be marked as an adverb and so on. But we feel that these words are more like nouns as is evident from 3. and 4. above, and also if we consider for examples, 'aage', 'upara', etc. as places which are in front, up, etc then we can tag them as nouns. But these are not pure nouns. They are nouns which indicate a location or time. These also function as adverbs or prepositions in a context. So a new tag NLOC is introduced for such words. This tag will cater to a finite set of such words. set: (Age, piche, upara, nIce, bAda, pahele) ("front", "behind", "above", "below", "before") Such words if tagged according to their syntactic function, it will hamper machine learning. So a single tag, NLOC has been devised for such words which indicate location and time. e.g., (upara, Age, pahele, bAda)
SubClass Of	LanguageSpecificPOSTag
MainVerb
Abstract	VFM Verb Finite Main The entire verb category has been dealt with differently in this tagger. The following discussions explain how the verbal category has been dealt with. The VFM tag is a modification of the VB tag of Penn tagset. Main verb of a finite verb group of a sentence is considered as VFM. Whether the form of the particular word is finite or non-finite it will be tagged as VFM. E.g. laDZakA seba khA/VFM raHA thA. ("boy" "apple" "eating" "was") e.g. (vaHa "pItA" HE, vaHa laDaZkA "HE") ("he drinks", "he boy is")
SubClass Of	ModifiedPTBPOSTag
ModifiedPTBPOSTag
Abstract	This group includes those tags which are a modification of some tags in the Penn tagset.
SubClass Of	POSTag
Sub-Classes	MainVerb NonFiniteVerb Number Postposition Quantifier QuestionWord
Negative
Abstract	NEG Negative Negatives like 'nahI', 'na', etc. will be marked as NEG. e.g. (nA, naHIM) ("no", "not")
SubClass Of	LanguageSpecificPOSTag
NominalNonFiniteVerb
Abstract	VNN Verb Non-Finite Nominal In the Penn tagger, VBG is used for gerunds, participles and progressive verb forms. But this tagger will mark gerunds as VNN. This distinction is being made in order that consructions like 'pIna', etc can be accounted for. e.g. sharAba pInA/VNN seHata ke liye KAnikAraka HE. ("liquor" "drinking" "heath" "for" "harmful" "is") e.g., (pInA) ("drinking") negative VNNN :telugu
SubClass Of	NonFiniteVerb
NonFiniteVerb
SubClass Of	ModifiedPTBPOSTag
Sub-Classes	AdjectivalNonFiniteVerb AdverbialNonFiniteVerb NominalNonFiniteVerb
Noun
Abstract	NN Noun Penn tagset makes a distinction between noun singular and noun plural. As mentioned earlier, this distinction is avoided here. This reduces the number of tags and thus enhances machine learning. Plurality is not crucial information with respect to dependancy level parsing or any other higher level analysis of the sentence. As said before if that information is needed at a later stage it can be incorporated with the help of heuristics and linguistic rules. e.g., (laDaZkA, nadI, vicAra, kaThoratA) ("boy", "river", "thought", "hardness")
SubClass Of	PTBInspiredPOSTag
Number
Abstract	QFNUM Quantifiers Number No distinction will be made between cardinal and ordinal numbers. Any word denoting numbers will be tagged as QFNUM. Penn tagset has a tag CD for cardinal numbers and they have not talked of ordinals! e.g. (tIsarA, tInoM, tIna) ("third", "three"(oblique), "three")
SubClass Of	ModifiedPTBPOSTag
Particle
Abstract	RP Particle In Indian languages words like bhI, sA, etc. (Hindi for example) will be marked as RP. e.g. (mIThA sA/RP, taka/RP, HI/RP, to/RP, bhI/RP)
SubClass Of	PTBInspiredPOSTag
POSTag
Abstract	All the tags used in this tagger are broadly classified into three types. There are some tags which have been adopted with some minor changes in the Penn tagset. They are grouped into one group. The second category of tags is of those which are a modification over the Penn tagset. The last group is of all those tags which are not present in the Penn tagset. They have been designed to cater some phenomena which are specific to Indian languages.
Sub-Classes	LanguageSpecificPOSTag ModifiedPTBPOSTag PTBInspiredPOSTag
Postposition
Abstract	??PREP Postposition All Indian languages have the phenomenon of postpositions. Some languages separate the post positions from the noun e.g. Hindi. In such a case, a postposition will be marked as PREP. For example in Hindi, kheta/NN meM/PREP ("the field"/NN "in"/PREP), here meM is the postposition and is written separately from the noun. So it will be tagged as PREP. But in Marathi (another Indian language), mulAne/NN("boy by"/NN), here the postposition is written along with the noun. So it will not be tagged separately. This tag is the same as the IN tag used for prepositions in Penn tagset. But it has been adopted for a parallel concept in this tagger. Postpositions of Indian languages have more or less the same functions as prepositions in English. The same tag is used by Penn tagset for subordinating conjuncts also. They feel that subordinating conjuncts and prepositions can be distinguished because subordinating conjuncts are followed by a clause and prepositions by a noun phrase. But as pointed out earlier, in this tagger all conjuncts have been clubbed under the tag CC. e.g. (ne, ke/PREP liye/PREP) ("by", "for")
SubClass Of	ModifiedPTBPOSTag
Pronoun
Abstract	PRP Pronoun Penn tagset makes a distinction between personal pronouns and possessive pronouns. This distinction is avoided here. All pronouns are marked as PRP. In Indian languages all pronouns inflect for all cases (accusative, dative, possessive etc.) Incase we have a separate tag for possessive pronouns; new tags will have to be designed for all the cases. This will increase the number of tags which is unnecessary. So, only one tag is used for all pronouns. e.g. (jo, vo, vaHa,"jisa" laDaZke ne, jisane) ("who", "that", "he", "the boy who", "by whom")
SubClass Of	PTBInspiredPOSTag
ProperNoun
Abstract	NNP Proper Nouns This tag is also similar to the Penn tagset. Here too we have not made a distinction between Proper Noun singular and Proper Noun plural as in the Penn tagset. e.g. (rAma, bhAjapA) (Ram, BJP)
SubClass Of	PTBInspiredPOSTag
PTBInspiredPOSTag
Abstract	All tags in this group are similar to the Penn tagset. Penn tagset makes finer distinction between singular and plural or comparative and superlative forms, which is not considered in the current tagger. This is in accordance with our policy about fineness and coarseness.
SubClass Of	POSTag
Sub-Classes	Adjective Adverb AuxiliaryVerb Conjunction Interjection Noun Particle Pronoun ProperNoun Symbol
Quantifier
Abstract	QF Quantifiers All quantifiers like kama, jyAdA, bahuwa, etc. will be marked as QF. In case these words are used in constructions like 'baHutoM/NN ne/PREP jAne se inkAra kiyA'("many" "by" "to go" "refused") where it is a noun, it will be marked as noun. Quantifiers of number will be marked as below. e.g. (jyAdA/QF, thoDA/QF, saba/QF, kama/QF, baHuta/QF) ("more", "little", "all", "much")
SubClass Of	ModifiedPTBPOSTag
QuestionWord
Abstract	QW Question Words The Penn tagset makes distinction between the wh words which act as questions, as relative pronouns and as determiners. But in this tagger all wh words (ka'kAra's in Hindi) will be tagged as QW. The reason being, in Indian languages the category where 'wh' words act as pronouns or determiners is not present. They all become pronouns like 'jo', 'jisne', etc. in Hindi e.g. The man who wrote a book ... (vaHa AdamI jisne kItAba likhI ... ) ("that" "man" "who" "book" "wrote") e.g. (kyA/QW, kEsA/QW) ("what", "how")
SubClass Of	ModifiedPTBPOSTag
Symbol
Abstract	SYM Special Symbol All those words which cannot be classified in any of the other tags will be tagged as SYM. This tag is similar to the Penn 'SYM'. Also special symbols like $, %, etc are treated as SYM. Since the frequency of occurrence of such symbols is very less in Indian languages, no separate tag is used for such symbols. SYM Special: Not classified in any of the above
SubClass Of	PTBInspiredPOSTag

Individuals

CC
Class	Conjunction
INTF
Class	Intensifier
JJ
Class	Adjective
JVB
Class	LightVerb
NEG
Class	Negative
NLOC
Class	LocationNoun
NN
Class	Noun
NNC
Class	CompoundNoun
NNP
Class	ProperNoun
NNPC
Class	CompoundProperNoun
NVB
Class	LightVerb
PREP
Class	Postposition
PRP
Class	Pronoun
QF
Class	Quantifier
QFNUM
Class	Number
QW
Class	QuestionWord
RB
Class	Adverb
RBVB
Class	LightVerb
RP
Class	Particle
SYM
Class	Symbol
UH
Class	Interjection
VAUX
Class	AuxiliaryVerb
VFM
Class	MainVerb
VJJ
Class	AdjectivalNonFiniteVerb
VJJN
Class	AdjectivalNonFiniteVerb
VNN
Class	NominalNonFiniteVerb
VNNN
Class	NominalNonFiniteVerb
VRB
Class	AdverbialNonFiniteVerb
VRBN
Class	AdverbialNonFiniteVerb