OLiA Annotation Model for the morphosyntactic annotations of the Penn Chinese Treebank (PCTB)
"This document is designed for the Penn Chinese Treebank Project [XPX+00]. The goal of the project is the creation of a 100-thousand word corpus of Mandarin Chinese text with syntactic bracketing. The annotation consists of two stages: the first phrase is word segmentation and part-of-speech (POS) tagging and the second phrase is syntactic bracketing. ... We have chosen syntactic distribution as the main criterion for our POS tagging because itcomplies with the principles adopted in contemporary linguistics theories, such as the notion of head projections in the X-bar theory and the GB theory." (Xia 2000, p.4f)
Unless specified otherwise, all comments are quotes from Xia (2000)
Fei Xia (2000), The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0), version of October 17, 2000, http://www.cis.upenn.edu/~chinese/posguide.3rd.ch.pdf
Adverb: AD
The adverb is a big class. It includes manner adverbs, frequency adverbs, degree adverbs, conjunctive adverbs, and so on. The behaviors of adverbs differ a lot. The main function of most adverbs is to modify a VP or an S.
Ex: ??[still], ?[very], ?[most], ??[greatly], ?[again], ?[approximately].
ba3 in ba-construction: BA
This only includes ? and ? when they occur in the ba-construction (i.e., NP0 + BA + NP1 + VP). For example, ?[he] ?/BA ?[you] ?[cheat] ?/AS $\langle$He cheated you$\rangle$.\\
Note: ? has other tags: AD and VV (e.g., ?[he] ?[check]/VV ?[AS] ?[I] ?[DEG] ?[king] $\langle$ (In chess) My king is in check by him$\rangle$).
Note: Whether bei4(?)and ba3(?) are prepositions or verbs is highly controversial. In the tagging stage,we don't make any commitment to that. That's the reason why we tag them as LB, SB, BA, respectively, rather than tagging them as P or VV.
Cardinal Number: CD
It includes cardinal numbers (optionally followed by ???[approximate number indicators] such as ?[over/odd], ?[odd], and ??[over]) and words such as ??[some], ??[several], ?[half], ??[many], ??[many] (e.g., ??[many] ??[student]).
Ex: 1245, ??[a hundred].
de5 as a complementizer or a nominalizer: DEC
This only includes ? and ? when they function as a complementizer or a nominalizer(e.g., ?[eat] ?/DEC).
The pattern is: S/VP DEC \{NP\}.
Note: ? also has other tags:
(i) DEG: ?[he] ?/DEG ?[car] $\langle$his car$\rangle$.
(ii) SP: ?[he] ?/VC ??[definitely] ?[must/should] ?[come] ?/SP $\langle$He should definitely come$\rangle$.
(iii) AS: ?[he] ?/VC ?[at] ??[here] ?[get off] ?/AS ?[car] $\langle$It was here that he got off the bus$\rangle$.
Conjunctions: CC, CS
Note: the words that are called ??[connective words] in traditional Chinese grammar books are tagged as CC, CS, or AD according to their syntactic distribution. CC conjoins two equivalent constituents (noun phrases, clauses, etc.) of the same function, whereas CS precedes a subordinating clause. Conjunctive adverbs often appear in the main clause and pair with a subordinating conjunction (e.g., ??[if]/CS ... ?[then]/AD).
'And': ? (also P [with]), ? (also P[with]), ? (also P[with]), ? (also P[with]), ?, ??, ? (also AD), ??, ? (?[big]/VA ?[and]/CC ?[complete]/VA), ??, ?.
'Or': ?, ??, ?? (e.g., ?[go] ??[or]/CC ?[not] ?[go], also AD).
Paired-CCs:
?[both]/CC .. ?[and]/CC, ?[both]/CC .. ?[and]/CC , ??[not only]/CC ... ??[but also]/CC.
(Like similar collocations in English such as ``either ... or'' and ``both ... and'', it is possible that the first words in the pairs are not CCs. These words are consistently marked as CCs.)
Others:
(i) ?[to]: ???[1991] ?[to] ???[1995] \\
(ii) ?[to]: ??[January] ?[to]/CC ??[March] \\
(iii) ?[and]: ??[state affairs] ??[committee member] ?[and]/CC ??[committee of science and technology] ??[director])\\
(It could be argued that these words listed here are prepositions or verbs when they appear in the pattern ``YP X YP''. But the conversion of the POS tags and the corresponding structures are pretty easy.)
Copula: VC
The words ?[be] and ?[be] are tagged as VC. ? is also tagged as VC if it means ?[not] ?[be] and there is no other verb in the sentence.
The word ?[be] has several usages:
(i) Link two NPs/Ss: ?[he] ?[be]/VC ??[student] $\langle$He is a student$\rangle$,
(ii) In cleft-sentences: ?[he] ?/VC ??[yesterday] ?[come] ?/SP $\langle$It was yesterday that he came$\rangle$.
(iii) For emphasis: ?[he] ?[VC] ??[enjoy] ?[see] ?[book]$\langle$He does enjoy reading books$\rangle$.
Currently, in all these cases the word is tagged as VC.
ETC
The tag is used for the word ? and ??. Two patterns are:
(i) XP ? NP: ??[science and technology] ???[culture and education] ?/ETC ??[area].
(ii) XP ?/??: ??[science and technology] ???[culture and education] ??/ETC.
Foreign Word: FW
FW is used to tag foreign words. FW excludes the translations of foreign words. It also excludes the words that have mingled with Chinese words (e.g., ??OK[karaoke]/NN, A?[type A]/NN). It also excludes words whose meaning and POS is clear from the context. We should avoid the tag as much as possible. It is used only when the POS tag is not clear from the context.
de5 as a genitive marker and an associative marker: DEG
This only includes ? and ? when they function as a genitive marker or an associative marker.
The pattern is: NP/PP/JJ/DT DEG \{NP\}. Note: ? has other tags: DEC, SP, and AS.
Localizer: LC
Many nouns alone cannot be the argument of prepositions such as ?[at] and ?[until] or modify VP/S directly. One function of localizers is to attach to the preceding NP/S so that the whole phrase can act as the argument of those prepositions or modify VP/S.
Some localizers can stand alone as the arguments of the prepositions/verbs. Some localizers can be modified by ?[the most]. Localizers cannot be modified by Det+M.
Localizers are of two types:
(1) fan1wei4ci2 (???): this type of localizer denotes direction, location and so on. They come from nouns. Some can stand alone as the arguments of the prepositions/verbs. Some can be modified by ?[the most]. They cannot be modified by Det+M.
(1.a) mono-syllabic localizers: e.g., ?[before], ?[after], ?[in], ?[out], ?[in], ?[north], ?[east], ?[side], ? [side], ?[end/bottom], ?[between], ?[end], ?[next to].
(1.b) bisyllabic localizers: they are formed by
(1.b.i) mono-syllabic localizers plus morphemes such as ?, ? etc., e.g., ??[between], ??[to the north of].
(1.b.ii) two mono-syllabic localizers, e.g. ??[around], ??[around], ??[or so], ??[northeast].
(2) others: we tag the following as LCs. (We could choose to mark some of them as verbs, but this will complicate the bracketing annotation.)
(2.a) ??[until]: ?[at] ??[present] ??[until] $\langle$until now$\rangle$.
(2.b) ??[starting from]: ?[from] ??[April] ??[starting from] $\langle$starting from April$\rangle$.
(2.c) ?[ever since]: 5 ?[year] ?[ever since] $\langle$in the past five years$\rangle$.
(2.d) ??[since]: 1998?[1998] ??[since] $\langle$since 1998$\rangle$.
(2.e) ?[since]: ?????[1993] ?[since] $\langle$since 1993$\rangle$.
(2.f) ??[inside]: ??[include] ?[he] ??[inside] $\langle$including him$\rangle$.
Manner de5: DEV
This only includes ? when it occurs in ``XP ? VP'', where XP modifies the VP. In some old literature, ? is used in this pattern too. In that case, we will tag that ? as DEV.
Ex: ??[happy]/VA ?/DEV ?[speak]/VV $\langle$speak happily$\rangle$.
There are about 130 Ms in our corpus:
(i) classifiers: ?, ?, ?, ?, ?, ?, ?, ?, ?, ?.
(ii) unit: ?[ton], ??[kilometer], ????[square kilometer].
(iii) currency: ??[Mark], ??[Australian dollar].
(iv) compound measure word: ??[number of people], ??[number of flights], ??[row].
(v) unit of time: ?[year], ?[date], ?[second], ??[minute] (We tag them as Ms because no measure words can be inserted between them and the preceding CDs. According to the same test, we tag ??[hour] and ?[month] as NNs.)
A noun can be an argument of a predicate or a preposition. In general,
(i) Nouns cannot be modified by degree and negation adverbs such as ?[very] and ?[not].
(ii) Many nouns can be modified by Det+M structure.
(iii) Nouns can modify nouns directly (i.e., without ?/DEG).
If a word is the head of an NP, it is tagged as a noun. (In this Treebank, we assume the head of a NP is a noun, not a classifier or a determiner.)
other noun-modifier: JJ
JJs include the following three types:
[Type 1]``???''(?????): They modify nouns in the pattern JJ+?+{N} or JJ+N, but they cannot be the predicate of a sentence without the help of ?. They cannot be modified by degree adverbs.
The patterns: JJ + ?/DEG + N, JJ+N.
Ex: ??[mutual]/JJ \{?/DEG\} ??[goal]/NN, ?[she] ?[VC] ?[female]/JJ ?/DEG $\langle$She is a woman.$\rangle$
[Type 2] ``hyphenated-compound'': Those words can be seen as shortened forms of relative clauses or preposition phrases. The words normally have two syllables. One (or both) is a shortened form of a longer word. The common POS combinations for this type of JJ are V+N, P+N/LC, AD+VA, and so on.
The pattern: JJ+N.
Ex: ??[having studied in the US]/JJ scholar/NN.
[Type 3] adjectives: ?[new]/JJ ??[news]/NN.
The pattern: JJ+N.
Ex: ?[new]/JJ ??[news]/NN.
Note: when ?/DEC is inserted between the adjective and the noun, the adjective is tagged as VA.
We tag an adjective X in X+N as JJ because:
(i) Sometimes the distinction between adjectives and non-predicative adjectives in X+N is not clear.
(ii) Unlike in the predicative position, the adjectives in the noun modifier position cannot be modified by adverbs (e.g., ?[good] ??[student], *??[very good] ??[student]).
(iii) Many adjectives cannot occur in this position; that is, without ?/DEC they cannot modify nouns (e.g., ?[this] ?[M] ??[student] ?[very] ??[disappointed], * ??[disappointed] ??[student]).
Onomatopoeia: ON
The term ???[onomatopoeia], a word that imitates sounds, has been mentioned in several Chinese grammar books and POS tagsets. However, it is not clear to us whether those words form a unique syntactic category. We have not found any occurrence of ???[onomatopoeia] in our 100K-word Treebank; nevertheless, we reserve the tag ON for this type of word. The following are some patterns in which an ON can occur:
(i) modify VPs in the pattern ``ON ? V'': ?[rain] ??[ON] ?[DEV] ?[fall down] ?[AS] ?[one] ?[night] $\langle$The rain has been pouring down for the whole night$\rangle$.
(ii) modify NPs in the pattern ``ON ? N'': ?[ON] ?/DEG ??[a sound] $\langle$Bang$\rangle$!
(iii) form a sentence by itself: ??[ON]! ??[in the house] ??[spread] ?[two] ?/M ??[gunfire] $\langle $Bang! Bang! Two sounds of gunfire spread out from the house$\rangle$.
(iv) ONs normally cannot be modified by adverbs, etc.
Ex: ???, ??, ?
Ordinal Number: OD
Ordinal numbers(???) are tagged as ODs. We treat ?+CD as one word, and tag it as OD.
Ex: ???[the one hundredth], ??[the first], ?[the first]
Other Noun: NN
NN includes all other nouns. NNs, except the ones for locations, normally cannot modify VPs with or without ?/DEV.
Ex.:
Phrase-word: ??[one of] (??[purpose]/NN ??[one of]/NN $\langle$one of the purposes$\rangle$)\\
NN with N+LC structure: ??[domestic], ??[oversea].
MSPs occur in our corpus:
(i) ?: ?/MSP ??[fortify]/VV ??[overall]/JJ ??[competitiveness]/NN ??[strength]/NN $\langle$so as to fortify overall strength of competitiveness$\rangle$\\
(ii) ?: ?[for]/P ??[survive]/VV ??[continue]/VV ?/MSP ???[have no choice but to]/VV ??[take]/VV ?/DEC ?? [action]/NN $\langle$the action which must be taken in order to survive$\rangle$\\
(iii) ?, ?: ?[use] ... ?/? ??[maintain] $\langle$use... to maintain$\rangle$\\
(iv) ?: ?[he] ? ??[need] ?[DEC] $\langle$The thing that he needs...$\rangle$\\
The following are not MSPs: ??[if so]/SP, ??[so that]/AD, ??[so that]/AD.
Other verb: VV
This includes the rest of the verbs, such as modals, raising predicates (e.g., ??[maybe, probably]), control verbs (e.g., ?[want], ?[want to]), action verbs (e.g., ?[walk]), psych-verb (e.g., ??[like]/??[understand]/??[hate]), and so on.
(i) AD-like VV: ??[whether or not]/VV.
(ii) Phrase-word: ??[be present]/VV, ??[respond with]/VV, ??[scheduled for a duration of time]/VV, ??[be in a certain condition]/VV.
(iii) Words such as ??[this way] and ??[that way] are tagged as VVs when they are followed by AS, or when there is no other verb in that clause. For example, ?[then] ??[do...this way]/VV ?[SP] $\langle$Let's do it this way$\rangle$.
Predicative adjective (VA) roughly corresponds to adjectives in English and stative verbs in the literature on Chinese grammar.
Our VAs include two types:
Type 1: predicates that have no object and can be modified by ?[very].
Type 2: predicates derived from type 1 either through reduplication(e.g., ???[bright red]) or through the pattern N + A meaning ``as A as an N''(e.g., ??[snow white]). This type of VAs don't have objects, but some of them cannot be modified by ?[very] either, because the intensifying meaning is already built-in.
There are about 350 VAs in our corpus, e.g., ??[inexpensive], ??[not bad], ??[convenient].
Note: when a word in set(VA) modifies N without ?[DEC], it is tagged as JJ or a noun, rather than as VA. When a word in set(VA) has an object, it is tagged as VV, rather than VA. For example, ?[this] ?/M ??[activity] ??[enrich]/VV ?/AS ?[he] ?/DEG
??[life] "This activity enriched his life".
Preposition: P
A prepositions can take a noun phrase or a clause as its argument.
Ex: ?[from], ?[to/for].
There are about 70 Ps in our corpus.
(i) VV-like prep: ??[through], ??[until], ??[about], ?[from]. \\
(ii) CS-like prep: ??[along with], ??[along], ??[due to], ??[except], ??[in order to] \\
(iii) AD-like prep: ?[on]/P (?[on]/P ??[system/mechanism]/NN ??[question/issue]/NN).
Note: words such as ?/BA and ?/LB-or-SB are not tagged as P. See OtherPOS for detail.
Proper Nouns (NRs) are a subclass of nouns.
An NR is a name of a particular person, politically or geographically defined location (cities, countries, rivers, mountains, etc.), or organization (corporate, governmental, or other organizational entity). A proper noun is usually unique and cannot be modified by a Det+M.
Ex: ???[Argentina], ??[Berlin], ???[Clinton].
The names of the following are NRs:
location: region/country/county/city, mountain/river,
organization: newspaper/journal, organization/company, school/association/foundation,
person: person/family.
The names of the following are NOT NRs:
nationality (e.g., ???[Chinese]), race (e.g., ??[Caucasian]), title (e.g., ??[professor]), disease, occupation, organ (e.g., ?[lung]), instrument (e.g., ??[piano]), game (e.g., ??[soccer]), flower (e.g., ??[rose]), etc.
Punctuation marks are tagged as PU. If they are part of other words, they are not tagged.
Ex: ??/NR ,/PU ??/NR ?[and]/CC ??/NR, \, 123,456/CD.
There are 31 PUs in our corpus: ?? \ ? \ ?\ ? \ ? \ ?
Resultative de5: DER
de5(?) is tagged as DER in potential form V-?-R, and in V-de construction (?[he] ?[run] ?/DER ?[very] ?[fast] $\langle$He runs very fast$\rangle$).
Note: Some collocations ending with ? are not V-de constructions. They are verbs (e.g., ??[remember], ??[gain]).
Sentence-final particle: SP
SP often appears at the end of a sentence. For example, ?[he] ?[good] ?[SP] $\langle$Is he OK$\rangle$ ?
Some of them can also be used for a pause. For example, ?[he] ?[SP], ?[people] ?[very] ?[good] $\langle$Speaking of him, he is a very nice guy$\rangle$.
There are 6 SPs in our corpus: ?, ?, ?, ?, ?, ?. .
bei4 in short bei-construction: SB
This only includes ? and ? (in spoken language) when they occur in the short bei-construction (i.e., NP0 + SB + VP); for example, ?[he] ?/LB ?[scold] ?/AS ?[one] ?/M $\langle$He was scolded$\rangle$.
Note: ? has other tags: LB, VV, and P (e.g., ?[you] ?[to]/P ?[he] ?[write] ?/M ?[letter] $\langle$You should write a letter to him$\rangle$).
Note: Whether bei4(?)and ba3(?) are prepositions or verbs is highly controversial. In the tagging stage,we don't make any commitment to that. That's the reason why we tag them as LB, SB, BA, respectively, rather than tagging them as P or VV.
Temporal Noun: NT
Temporal Nouns can be the objects of prepositions such as ?[at], ?[since], ?[until], or ??[until]. They can be refered by ????[at this moment], and questioned by ????[when]. They can also modify VP/S directly. Like other nouns, NTs can be arguments of some verbs.
Temporal Nouns are either the names of the time (e.g., 1990?[1990], ?? [January], ??[Han Dynasty]) or formed by PN+LC, N+LC, DT+N.
Ex: ??[January], ??[Han Dynasty], ??[present time], ??[when], ??[from now on], 1990?[year 1990], ??[at last].
N+LC as a NT: ??[after the war]/NT, ??[before the contest]/NT, ??[from now on], ??[the other day]/NT, ??[when]/NT, ??[at present]/NT.\\
PN+LC as a NT: ??[afterwards]/NT
Normally, a verb satisfies the following:
(i) Verbs (except auxiliary verbs etc.) serve as the predicate of a clause (main clause or embedded clause).
(ii) Verbs can be negated by ?[not] or ?[not].
(iii) Aspect markers can be attached to most (but not all) verbs.
(iv) Most verbs can occur in A-not-A.
If a word w in set(V) is the head of an NP, it is tagged as N, not as V.
If w in set(V) is a noun modifier (excluding the case where the V is the head of a relative clause), it is tagged as N or JJ(according to
the tests for N and JJ), not as V.
you3 as the main verb: VE
Only ?[have], ?[not have], ??[not have], and ?[not have] are tagged as VE when they are the main verbs (including the possessive you3, existential you3, etc.).
(The main reason why we assign those verbs with a new tag is that the treatment of existential sentences is controversial. Giving these verbs a different tag will make it easy to find the existential sentences in the corpus.)