Annotation Model for the IL-POSTS tagset, a pan-Indian annotation scheme (Baskaran et al. 2008), primarily applied to Bangla, Hindi, Kannada, Malayalam, Marathi, Sanskrit, Tamil and Telugu. Unless marked otherwise, all comments refer to Baskaran et al. (2008).
"There are four main language families found in India, viz., Austro-Asiatic, Dravidian, Indo-Aryan and Tibeto-Burman, of which Dravidian and Indo-Aryan (IA) form the largest group of languages spoken in the sub-continent. This framework concentrates on Dravidian and IA language families for two main reasons: (i) practical issues of manageability, (ii) the fact that of the 22 official languages in India a large majority belonged to these two language families. However, the detailed linguistic analysis and discussions that led to the design of this framework leads us to believe that it is broad enough to cover Indian Languages from the other language families as well."
Sankaran Baskaran, Kalika Bali, Tanmoy Bhattacharya, Pushpak Bhattacharyya, Monojit Choudhury, Girish Nath Jha, Rajendran S., Saravanan K., Sobha L., and KVS Subbarao (2008), A Common Parts-of-Speech Tagset Framework for Indian Languages, In Proceedings of LREC 2008, p. 1331-1337, http://www.lrec-conf.org/proceedings/lrec2008/pdf/337_paper.pdf
The function of a word offers significant clues for the subsequent stages of processing (e.g. parsing) and hence cannot be missed. Here, we adopt a balance between the form and the function of a word in a systematic and consistent way. Based on our analysis, when a word is morphologically derived from other words then we propose to tag them by their function. In all other cases, the words are tagged by their form. Thus, for example, the infinitive form of the Hindi verb ?....? [rahanaa] ?to stay?, is tagged as verbal noun in the example ?.... .. .... .....? [rahane kaa kamaraa chaahiye] ?(subj) needs a room to stay? (Literally, ?staying room is needed?).
However, in ?.... .... ... .... ..? [mujhe hotal me rahanaa hai] ?I want to stay in a hotel? it is marked as a main verb with an infinitive attribute.
Similarly, in Tamil, the same form of a verb ?....? [paadu] ?sing. in ?...... ....? [paadum paRavai] ?singing bird., is tagged as relative participle, but in ?.... ......? [paRavai paadum] ?bird will sing., is tagged as a verb.
Apparently, "Case-marker" refers to specific cases (and its subcategories are named as such, hence "ErgativeCase" instead of "ErgativeCaseMarker"). However, this is not explicitly stated in the document.
According to its usage, CaseMarker is a feature of Postposition, but also of Nouns. Accordingly, it seems to conflate two aspects: the case governed by an adposition, and the morphological case of nouns. (Chiarcos)
IL-POSTS has a hierarchical layout of decomposable tags with three levels in the hierarchy viz., categories, types (subcategories) and attributes (features). ...
Attributes are morphosyntactic features of Types. All attributes are optional, though in some cases they may be recommended. Further, Special extensions to attributes provide for features to be specified for future use that are not covered in the currently defined list of attributes. These can be generic attributes that may be needed for a special purpose including those outside the scope of morphosyntax, and language-specific attributes that may be applicable to only a very small group of or even a single language(s).
In the ontology, these special extensions have not been adopted as they were not further discussed in the paper. (Christian Chiarcos)