OLiA Annotation Model for a Part of Speech Tagger for Indian Languages (IIIT 2007). Languages mentioned in the document include Hindi, Marathi, and Telugu. To a certain extent, IIIT (2007) seems to be a revision of http://ltrc.iiit.ac.in/tr031/posguidelines.pdf that was developed at the same institute.
Unless marked otherwise, all comments are quotes from IIIT (2007).
IIIT (2007), A Part of Speech Tagger for Indian Languages (POS tagger), Tagset developed at IIIT - Hyderabad after consultations with several institutions through two workshops. available under http://shiva.iiit.ac.in/SPSAL2007/iiit_tagset_guidelines.pdf
VJJ Verb Non-Finite Adjectival
Unlike Penn tagset all non finite verbs which are used as adjectives
will be marked as VJJ. The Penn tagger does not make a distinction
between the gerunds and adjectival participles or simple 'ing' type
verb forms.
For Hindi, constructions like 'khAte Hue' will be tagged as follows:
khAte/VJJ Hue/VAUX.
("eating")
As explained earlier in the paper, this distinction is made in order to
preserve the information that this word is a form of a verb. Every verb
is capable of taking its own arguments in a sentence, even if it is not
the main verb. In order to be able to show the exact verb-argument
structure in the sentence, it is essential that this crucial
information is preserved. So this tagger marks all non-finite
adjectival participles as VJJ i.e. an adjective which is formed out of
a verb.
e.g., (khAte/VJJ Hue/VAUX) ("eating")
negative VJJN :telugu
JJ Adjective
This tag is again same as in Penn tagset. Penn tagset also makes a
distinction between comparative and superlative adjectives. This has
not been considered here.
So this tag includes the 'tara'(comparative) and the 'tama'
(superlative) forms of adjectives in Hindi.
e.g. adhikatara, sarvottama, etc.
("more times", "best")
(includes comparative and superlative forms also,
adhikatara, sarvottama)
RB Adverb
This tag is the same RB tag of Penn tagset. Penn tagset also makes a
difference between comparative and superlative adverbs, which is not
adopted in this tagger. This is in accordance with our philosophy of
coarseness in linguistic analysis.
e.g., (dhIre/RB dhIre/RB, tejI/RB se/RP)
("slowly slowly", "fast")
VRB Verb Non-finite Adverbial
Again unlike Penn tagset, non-finite forms of verbs which are used as
adverbs will be tagged with a different tag VRB.
In Hindi constructions like 'khAte khAte'("while eating"),
'khAkara'("after eating"), etc will be tagged as VRB.
The reason for this distinction between non-finite verbs used as
adverbs and other verbs is as explained in VJJ.
e.g., khAkara, pIte/VRB Hue/VAUX) ("after eating", "drinking")
negative VRBN :telugu
VAUX Verb Auxiliary
All auxiliary verbs will be marked as VAUX. This tag has been adopted
as such from the Penn tagset.
e.g., (khA/VFM cukA/VAUX HE/VAUX)
("eat en has")
NNC Compound Nouns
There is no separate tag for Compound nouns in the Penn tagset. But in
this tagger, the tag NNC is used for compound nouns. This tag has been
introduced in order to indentify unhyphenated compound words as one
unit.
e.g. 'keMdra sarakAra' will be tagged as keMdra/NNC sakakAra/NN.
("center" "government")
In this example, 'keMxra' and 'sarakAra' are both nouns which are
forming a compound noun. All words except the last one, of compound
words will be marked as NNC. Thus any NNC will be always followed by
another NNC or an NN. This strategy helps indentify these words as one
unit although they are not conjoined by a hyhen.
NNC Compound Common Nouns (kendra/NNC sarakAra/NN ("center
government", rAma/NNC
moHana/NN ("Ram, Mohan"),
laDaZke/NNC laDaZkiyAz/NN ("girls boys"),
laDaZke/NNC laDaZkiyoM/NN ne khAnA khAyA
(girls boys food ate").)
NNPC Compound Proper Nouns
This tag is also an addition. All words in a compound proper noun will
be marked as NNPC excluding the last one.
e.g. aTala/NNPC biHArI/NNPC vAjapeyI/NNP.
Here the first two words are NNPC and the last one will be NNP. Just as
the NNC tag this tag too helps identify a compound proper noun as one
unit and not confuse it with a list of proper nouns.
e.g. rAma, moHana aur shAma ghara gaye.
("Ram", "Mohan" "and" "Shyam" "home" "went")
Any title like Dr., Col., Lt. etc. which occurs before a proper noun
will be tagged as NNC. All such titles are nouns which will always be
followed by a Proper Noun. To indicate that these are a part of the
proper noun but are nouns they will be tagged as NNC.
e.g. Col./NNC Ranjit/NNPC Deshmukh/NNP
NNC Compound Common Nouns (kendra/NNC sarakAra/NN ("center
government", rAma/NNC
moHana/NN ("Ram, Mohan"),
laDaZke/NNC laDaZkiyAz/NN ("girls boys"),
laDaZke/NNC laDaZkiyoM/NN ne khAnA khAyA
(girls boys food ate").)
CC Conjuncts (coordinating and subordinating)
The tag CC will be used for coordinating and subordinating conjuncts
both. The Penn tagset has used ?IN? tag for prepositions and
subordinating conjuncts. Their rationale behind this is that
subordinating conjuncts and prepositions can be distinguished because
subordinating conjuncts are followed by a clause and a prepositions by
a noun phrase.
But in the current tagger all connectors other than prepositions will
be marked as CC.
e.g. (Ora, yA, ki)
("and", "or", "that")
INTF Intensifier
This tag is not present in Penn tagset. Words like 'baHuta', 'kama',
etc. will be covered under this.
e.g., ("baHuta" jyAdA, "Ora" jyAdA) # But note that:
[baHutoM/noun ne]
("too much", "much more")
UH Interjection
Just as in Penn tagset, interjections will be marked as UH. In addition
the affirmative word 'HAz'("yes") will also be tagged as UH. This is
the only example of such a word so has been clubbed under Interjection.
UH Interjection words (HAM and interjections)
NVB,JVB,RBVB Kriyamula (light verbs)
This tag has been introduced to account for the concept of kriyamuls of
Indian Languages. Kriyamuls are verbs formed by combining a noun or an
adjective or an adverb with a (helping) verb. The kriyamuls formed by
joining a noun will be NVB, those formed with an adjective will be JVB
and those formed by joining adverbs will be RBVB.
e.g. snAna/NVB karatA/VFM HE/VAUX
("bath" "does")
In the above example 'snAna' is a noun which is joined to the verb
'karanA' to express the sense of the verb 'to bathe'. So here 'snAna'
is marked as NVB and the main verb is marked as VFM and 'HE' is its
auxilliary.
e.g. lAla/JVB HotA/VFM HE/VAUX
("red" "happens")
In this example the adjective 'lAla' is joined with 'HonA' to express
the sense of the verb 'to redden'. So 'lAla' is marked as JVB, 'HotA'
as VFM and 'HE' as VAUX.
e.g. yaHa to jarUra/RBVB HE/VFM........
In this example the adverb 'jarUra' is joined with 'HonA' to express
the sense 'to be sure'. So 'jarUra' is marked as RBVB and 'HE' is the
main verb marked as VFM.
Kriyamula:
NVB Noun in kriya mula
(snAna/NVB karatA/VFM HE/VAUX)
(snAna/NVB karate/VJJ Hue/VAUX)
(snAna/NVB karake/VRB)
(snAna/NVB karane/VNN para/PREP)
JVB Adj in kriya mula
(lAla/JVB HotA/VFM HE/VAUX)
(pUrA/JVB HotA/VFM HE/VAUX)
(pUrA/JVB Hote/VRB Hue/VAUX)
(pUrA/JVB Hokara/VRB
(pUrA/JVB Hone/VNM para/PREP
RBVB Adv in kriya mula
In case there is such a usage with xxxx
(xxxx/RBVB HotA/VFM HE/VAUX)
NLOC Noun Location
This is an entirely new tag introduced to cover an important phenomenon
of Indian Languages. Words like 'Age', 'upara', 'pahele', 'bAda', etc.
are used in various ways in Hindi.
1. They act as a postposition along with 'ke'
e.g. ghade ke upara thAlI rakhI HE.
("pot" "on" "plate" "kept" "is")
Here 'ke upara' is a post position which is the direct equivalent of
the English preposition 'on'.
2. They also act as adverbs.
e.g. tuma upara jAo.
("You" "up" "go")
Here 'upara' is an adverbial of place.
3. These words also take post positions themselves and so in some sense
behave like nouns.
e.g. vaHa upara se AyA.
("He" "above" "from" "came")
4. As pointed out in 3. above, these words take postpositions and act
as arguments of the verb in the sentence. And they also take a post
position to join with a another noun. So in that sense also they behave
like nouns.
e.g. upara kA HissA
("above" "of" "portion")
To tag such words one option is to tag them according to the category
to which they belong in the given sentence. For example in 1. above,
the word is occurring as a postposition so can be marked as a
postposition. In example 2. above, it is an adverb so can be marked as
an adverb and so on.
But we feel that these words are more like nouns as is evident from 3.
and 4. above, and also if we consider for examples, 'aage', 'upara',
etc. as places which are in front, up, etc then we can tag them as
nouns.
But these are not pure nouns. They are nouns which indicate a location
or time. These also function as adverbs or prepositions in a context.
So a new tag NLOC is introduced for such words. This tag will cater to
a finite set of such words.
set: (Age, piche, upara, nIce, bAda, pahele)
("front", "behind", "above", "below", "before")
Such words if tagged according to their syntactic function, it will
hamper machine learning. So a single tag, NLOC has been devised for
such words which indicate location and time.
e.g., (upara, Age, pahele, bAda)
VFM Verb Finite Main
The entire verb category has been dealt with differently in this
tagger. The following discussions explain how the verbal category has
been dealt with.
The VFM tag is a modification of the VB tag of Penn tagset. Main verb
of a finite verb group of a sentence is considered as VFM. Whether the
form of the particular word is finite or non-finite it will be tagged
as VFM.
E.g. laDZakA seba khA/VFM raHA thA.
("boy" "apple" "eating" "was")
e.g. (vaHa "pItA" HE, vaHa laDaZkA "HE")
("he drinks", "he boy is")
VNN Verb Non-Finite Nominal
In the Penn tagger, VBG is used for gerunds, participles and
progressive verb forms. But this tagger will mark gerunds as VNN. This
distinction is being made in order that consructions like 'pIna', etc
can be accounted for.
e.g. sharAba pInA/VNN seHata ke liye KAnikAraka HE.
("liquor" "drinking" "heath" "for" "harmful" "is")
e.g., (pInA) ("drinking")
negative VNNN :telugu
NN Noun
Penn tagset makes a distinction between noun singular and noun plural.
As mentioned earlier, this distinction is avoided here. This reduces
the number of tags and thus enhances machine learning. Plurality is not
crucial information with respect to dependancy level parsing or any
other higher level analysis of the sentence. As said before if that
information is needed at a later stage it can be incorporated with the
help of heuristics and linguistic rules.
e.g., (laDaZkA, nadI, vicAra, kaThoratA)
("boy", "river", "thought", "hardness")
QFNUM Quantifiers Number
No distinction will be made between cardinal and ordinal numbers. Any
word denoting numbers will be tagged as QFNUM. Penn tagset has a tag CD
for cardinal numbers and they have not talked of ordinals!
e.g. (tIsarA, tInoM, tIna)
("third", "three"(oblique), "three")
All the tags used in this tagger are broadly classified into three
types. There are some tags which have been adopted with some minor
changes in the Penn tagset. They are grouped into one group. The second
category of tags is of those which are a modification over the Penn
tagset. The last group is of all those tags which are not present in
the Penn tagset. They have been designed to cater some phenomena which
are specific to Indian languages.
??PREP Postposition
All Indian languages have the phenomenon of postpositions. Some
languages separate the post positions from the noun e.g. Hindi. In such
a case, a postposition will be marked as PREP.
For example in Hindi, kheta/NN meM/PREP ("the field"/NN "in"/PREP),
here meM is the postposition and is written separately from the noun.
So it will be tagged as PREP.
But in Marathi (another Indian language), mulAne/NN("boy by"/NN), here
the postposition is written along with the noun. So it will not be
tagged separately.
This tag is the same as the IN tag used for prepositions in Penn
tagset. But it has been adopted for a parallel concept in this tagger.
Postpositions of Indian languages have more or less the same functions
as prepositions in English.
The same tag is used by Penn tagset for subordinating conjuncts also.
They feel that subordinating conjuncts and prepositions can be
distinguished because subordinating conjuncts are followed by a clause
and prepositions by a noun phrase. But as pointed out earlier, in this
tagger all conjuncts have been clubbed under the tag CC.
e.g. (ne, ke/PREP liye/PREP)
("by", "for")
PRP Pronoun
Penn tagset makes a distinction between personal pronouns and
possessive pronouns. This distinction is avoided here. All pronouns are
marked as PRP. In Indian languages all pronouns inflect for all cases
(accusative, dative, possessive etc.) Incase we have a separate tag for
possessive pronouns; new tags will have to be designed for all the
cases. This will increase the number of tags which is unnecessary. So,
only one tag is used for all pronouns.
e.g. (jo, vo, vaHa,"jisa" laDaZke ne, jisane)
("who", "that", "he", "the boy who", "by whom")
NNP Proper Nouns
This tag is also similar to the Penn tagset. Here too we have not made
a distinction between Proper Noun singular and Proper Noun plural as in
the Penn tagset.
e.g. (rAma, bhAjapA)
(Ram, BJP)
All tags in this group are similar to the Penn tagset. Penn tagset
makes finer distinction between singular and plural or comparative and
superlative forms, which is not considered in the current tagger. This
is in accordance with our policy about fineness and coarseness.
QF Quantifiers
All quantifiers like kama, jyAdA, bahuwa, etc. will be marked as QF. In
case these words are used in constructions like 'baHutoM/NN ne/PREP
jAne se inkAra kiyA'("many" "by" "to go" "refused") where it is a noun,
it will be marked as noun. Quantifiers of number will be marked as
below.
e.g. (jyAdA/QF, thoDA/QF, saba/QF, kama/QF, baHuta/QF)
("more", "little", "all", "much")
QW Question Words
The Penn tagset makes distinction between the wh words which act as
questions, as relative pronouns and as determiners. But in this tagger
all wh words (ka'kAra's in Hindi) will be tagged as QW. The reason
being, in Indian languages the category where 'wh' words act as
pronouns or determiners is not present. They all become pronouns like
'jo', 'jisne', etc. in Hindi
e.g. The man who wrote a book ... (vaHa AdamI jisne kItAba likhI ... )
("that" "man" "who" "book" "wrote")
e.g. (kyA/QW, kEsA/QW)
("what", "how")
SYM Special Symbol
All those words which cannot be classified in any of the other tags
will be tagged as SYM. This tag is similar to the Penn 'SYM'. Also
special symbols like $, %, etc are treated as SYM. Since the frequency
of occurrence of such symbols is very less in Indian languages, no
separate tag is used for such symbols.
SYM Special: Not classified in any of the above