Venice Italian Treebank (VIT) – version 2

Full Official Name: Venice Italian Treebank (VIT) – version 2
Submission date: Sept. 1, 2022, 1:03 p.m.

The VIT, Venice Italian Treebank is the effort of the collaboration of people working at the Laboratory of Computational Linguistics of the University of Venice in the years 1995-2005. It is partly the result of annotation carried out internally with no specific project in mind and no financial support. This work was partly related to the development of a lexicon, a morphological analyzer, a tagger, and a deep parser of Italian. All these resources were finally ready at the beginning of the ‘90s when the LCL got involved in the first national projects. This is a new release of the Venice Italian Treebank (VIT). It consists of the Written and Spoken VIT subsets. The PennTreebank version of the treebank is also made available on both subsets using parentheses and also a slightly modified version using brackets that allows web based visualization tools to build a tree of the structure. 1) Written VIT: The current dimension of the corpus is made of 223,292 tokens excluding punctuation, but 280,641 single tokens including enclitics and punctuation. It contains a totally revised constituency based representation of the corpus as well as three new files: a. vitorthograph_numb: this file contains the orthographic text of the 10195 sentences of the written treebank (the spoken one is available and does not need updating). Every sentence is marked by the same identifier found in the other files; b. vitdepstructs: the file that contains the conversion of the original constituency based VIT turned into dependency structures. This file has been lately used to produce the Universal Dependency version which however in order to obey the tagging schemes of UD had to be totally revised. As a result, most important information was deleted but is available in this original version. The deleted information concerns the labeling of all non canonical structures by highlighting dislocated, discontinuous and displaced grammatical functions with specialized labels taken from linguistic theory. These labels are: - LDC = Left Dislocated Complements - S_DIS = Dislocated Subject - S_TOP = Topicalized Subject - S_FOC = Focalized Subject c. vitfragment: dubbed VITfrag, contains the first 500 sentences which are slightly decomposed into 511 shorter sentences - separating interrogative sentences and direct speech marked ones – newly separated sentences are numbered with the same number of the previous half sentence with an "a" at the end. 2) Spoken VIT: The spontaneous speech corpus of Regional Italian contains 60,000 words and was created in the years 1995-2005. It contains two subcorpora collected under two National Projects: the project AVIP/API - the corresponding Italian version of the English project MapTask - where API is just the continuation of the previous project, and the project IPAR. The most important feature of the corpus is the annotation of OVERLAPS which are numerous in the dialogues. In this new revised version, 425 new fully parsed turns were added for a total of 3973. The total count of sentences is now 5851. The most important feature of the spoken VIT is the presence of overlaps transposed in the position inside the turn in which it was produced. 965 overlaps were annotated and distributed in 4000 turns. This subsets consists of both the parsed version of the corpus in constituent structure – called “parse_spokenVIT” - and the tokenized version of each sentence composing each turn in a separate file called “spokenVIT sentences”.

Right Holder(s)