Resource: Penn Discourse Treebank Version 3.0

Date of Submission March 19, 2019, 4:53 p.m.
Status accepted
ISLRN 977-491-842-427-0
Resource Type Primary Text
Media Type Text
Language English
Format/MIME Type text/plain
Size 43056 KB
Access Medium Web Download


Penn Discourse Treebank (PDTB) Version 3.0 is the third release in the Penn Discourse Treebank project, the goal of which is to annotate the Wall Street Journal (WSJ) section of Treebank-2 (LDC95T7) with discourse relations. Penn Discourse Treebank Version 2 (LDC2008T05) contains over 40,600 tokens of annotated relations. In Version 3, an additional 13,000 tokens were annotated, certain pairwise annotations were standardized, new senses were included and the corpus was subject to a series of consistency checks. Details concerning the development of PDTB Version 3.0 can be found in the documentation accompanying this release.

Largely because the PDTB project was based on the idea that discourse relations are grounded in an identifiable set of explicit words or phrases (discourse connectives) or simply in the adjacency of two sentences, the PTDB has been used by many researchers in the natural language processing community and more recently, by researchers in psycholinguistics. It has also stimulated the development of similar resources in other languages and domains.


Annotations are provided in the form of separate text files (standoff annotation) that are byte-indexed into the raw WSJ text files in Treebank-2. The raw WSJ files are also included in this release. All text files are plain text, encoded in UTF-8.

This corpus contains two tools: (1) The Annotator, used for annotation and adjudication, and which can also be used for viewing the corpus; and (2) The Conversion Tool for converting Version 2 annotation files into the Version 3 format.

The documentation directory contains a manual describing what is new in Version 3 and how Version 3 differs from Version 2; the methods and guidelines used in annotating PDTB Version 3; and a range of statistics on the tokens, including the frequency of each connective, its sense labels and its modifiers. More information about the corpus and research carried out by the developers and others using the corpus can be found on the PDTB website.


This work has been funded by the National Science Foundation, under grant NSF IIS 1422186 to the University of Pennsylvania and grant NSF IIS 1421067 to the University of Wisconsin, Milwaukee. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.

Version 1.0
Creator Bonnie Webber , Rashmi Prasad , Alan Lee , Aravind Joshi
Distributor Linguistic Data Consortium
Rights Holder Portions © 1987-1989 Dow Jones & Company, Inc., © 2008, 2012, 2019 The Penn Discourse Treebank Group, © 2008, 2012, 2019 Trustees of the University of Pennsylvania