OLiA Discourse Extensions

Ontologies of Linguistic Annotation. Machine-readable tagsets and annotation schemata for more than 100 languages.

OLiA Discourse Extensions

The OLiA Discourse Extensions extend the Ontologies of Linguistic Annotation (OLiA) with respect to discourse features. The OLiA ontologies provide a a terminology repository that can be employed to facilitate the conceptual (semantic) interoperability of annotations of discourse phenomena as found in important corpora available to the community.

As of Nov 2024, the OWL2/DL Reference Model of the OLiA Discourse Extensions are fully integrated with the OLiA Reference Model. A standalone legacy ontology (using the original URL schema, synchronized and linked with the OLiA Reference Model, but marked as deprecated) can be found under http://purl.org/olia/discourse/olia_discourse.owl

IMPORTANT NOTE: After the integration with the OLiA Reference Model. the ontology http://purl.org/olia/discourse/olia_discourse.owl will remain available and it will retain the original URI schema, BUT

  • its concepts are declared deprecated,
  • it will be automatically generated from modularized/discourse.owl,
  • and it may be overwritten any time without further notice

DO NOT apply changes to the original olia_discourse.owl, as this be overwritten, too. (And update your links as soon as possible to http://purl.org/olia/olia.owl.)

Background

The OLiA Discourse Extensions extend the Ontologies of Linguistic Annotation (OLiA) with respect to discourse features. The OLiA ontologies provide a a terminology repository that can be employed to facilitate the conceptual (semantic) interoperability of annotations of discourse phenomena as found in important corpora available to the community, including the RST Discourse Treebank and the Penn Discourse Treebank. Note that the current ontologies are chosen such that they represent typical phenomena, they are, however, by no means exhaustive with respect to available corpora.

Discourse phenomena considered here include

The OLiA ontologies do currently not cover dialogue structure, Gricean and Post-Gricean pragmatics and speech act theory or annotation schemes developed on this basis. In a broad sense, these can be regarded discourse phenomena, as the distinction between discourse and pragmatics is largely underdefined.

Instead, we follow a pragmatic distinction based on the types of available annotations: We restrict ourselves to the annotation of text (no dialogues, hence), with a particular focus on theories of discourse structure and discourse relations (in the sense of the Rhetorical Structure Theory or the Segmented Discourse Representation Theory) and frequently annotated phenomena most often discussed in regard to this (hence, anaphora, information status and information structure). Further extensions are, however, envisioned.

At the moment, the OLiA ontologies cover 9 annotation schemes for the annotation of coreference, information status, information structure and discourse structure for a broad variety of languages. So far, 8 of these are provided on this site. A full publication is planned for Jan 2014. All of these annotation schemes are formalized as self-contained OWL/DL ontologies (Annotation Models), with a declarative linking (Linking Models) linking them to an ontology that provides a generalized vocabulary for discourse annotation (Reference Model). For the latter aspects, we currently provide two ontologies that will subsequently be integrated with the OLiA Reference Model (cf. provisional linking: provisional Linking with OLiA Reference Model).

PS: Note that this site is currently being updated, the publication of further Annotation Models and an update of the Reference Models is in preparation.

Discourse Structure Annotation

Different theories of discourse structure emerged in the past decades, and different models of annotation have been developed, accordingly, as illustrated for two alternative annotations of the same sentence in the figure below.


Fig. 1. Comparing Discourse Structure Annotations (RST Discourse Treebank and Penn Discourse Treebank, file wsj_1365 (simplified)

With traditional annotation schemes, these annotations can hardly be put in any relation to each other, because different structures are involved: The RST annotation is based on trees, whereas the PDTB annotation is based on relational structures. Yet, grouping together relations on the basis of the utterances they spann across, pairs of relations can be formed:

  PDTB   RST
(4)-(5) CONTINGENCY.Condition.general (4)-(5-7) CONDITION
(2)-(4-7) CONTINGENY.Cause.result (1-2)-(3-7) EXPLANATION
(4-5)-(6) EXPANSION.Alternative.chosen alternative (5)-(6-7) CONTRAST

Yet, within their annotation schemes, the relations cannot be directly compared, for, e.g., the PDTB CONDITION and the RST CONDITION have different connotations in terms of discourse structure (no restrictions in PDTB, a hierarchical structure in RST). Using the ontologies, these aspects of information can be distangled (common properties in terms of the Reference Model in bold):


Fig 2. Comparing CONDITIONs

Discourse Ontologies

</table> ## Anaphora, Information Status and Information Structure {#is} Whereas discourse structure and discourse relations are particularly relevant with respect to the global structure of a discourse, the phenomena considered here refer to discourse phenomena as manifested within the utterance that reflect the influence of the surrounding (especially preceding) discourse. Information Structure deals with the structure of utterances in terms of what kind of information they provide: The topic is the part of an utterance that it is *about*, the focus provides *new information* about the topic, and both are often (but not exclusively) seen in as dichothomy. The terminological development of information structure has, however, not progressed to an extent that fully compatible definitions are employed. Information Status refers to the degree that an entity is familiar (\`given\') to the hearer, as reflected in the choice of referring expressions: A given referent is often realized as a pronoun (albeit it doesn\'t have to be), an unknown referent is usually introduced by an indefinite NP, its full name, a longish description or a marked construction such as an \`indefinite *this*\'. Different realization options are available between these extremes, e.g., definite descriptions and names with different degree of informativity and complexity. Information Status annotation aims at classifying referring expressions accordingly. A given referent is typically anaphorically anchored in the preceding text, i.e., it co-refers with another expression that was previously mentioned. Coreference annotations aim at marking these anaphoric links between markables in the text. However, a referent doesn\'t have to be explicitly mentioned before to be familiar to the hearer, bridging inferences from a related entity (trigger) in the preceding text may be sufficient, and anaphora annotation has been extended to bridging annotations, accordingly. For both anaphora annotation and information structure/status annotation, different schemes have been developed, some of which are formalized here together with a generalizing Reference Model fragment. ### IS Ontologies {#is.ontology}
Model</td> Description Phenomenon OWL/DL models
Reference Model Reference Model fragment for discourse structure and discourse relations, to be integrated with the OLiA Reference Model discourse structure, discourse relations, information structure, information status, coreference OLiA Discourse Extensions Model, Provisional Reference Model linking
RST Annotation Modell Annotation Model for RST (http://www.sfu.ca/rst/, English, French, Portuguese, Spanish) discourse structure, discourse relations Annotation Model, Linking Model
RSTDTB Annotation Model Annotation Model for the RST Discourse Treebank (English, Wallstreet Journal) discourse structure, discourse relations Annotation Model, Linking Model
PDTB Annotation Model Annotation Model for the Penn Discourse Treebank (English, Wallstreet Journal), also applicable to PDTB-derivatives for Turkish, Hindi, Italian and Chinese discourse relations Annotation Model, Linking Model
PDGB Annotation Model Annotation Model for the Penn Discourse Graphbank (English, incl. Wallstreet Journal) discourse relations Annotation Model, Linking Model
Knott Annotation Model Annotation Model for the Knott (1996) discourse cue taxonomy (not used for corpus annotation, but for cue word classification, such discourse cues have been annotated in PDTB, though) discourse (relation) marker taxonomy Annotation Model, no Linking Model yet
</tr> </table>
Model</td> Description Phenomenon OWL/DL models
Reference Model Reference Model fragment, to be integrated with the OLiA Reference Model discourse structure, discourse relations, information structure, information status, coreference OLiA Discourse Extensions model, Provisional Reference Model linking
CRC632 Annotation Model for the corpora of the Collaborative Research Center (SFB) 632, "Information Structure" (Potsdam, Berlin, Germany), applied to various, typologically different languages information structure, information status Annotation Model, Linking Model
DIRNDL Annotation Model for the DIRNDL corpus (German, spoken language) information status, coreference Annotation Model, Linking Model
PoCoS Annotation Model for the Potsdam Coreference Scheme, applied to English, German and Russian coreference Annotation Model, Linking Model
ARRAU Annotation Model for the ARRAU corpus (English) coreference, bridging Annotation Model, Linking Model
TüBa-D/Z Annotation Model for the TüBa-D/Z corpus (German) coreference Annotation Model