ISLRN

CAMIO Transcription Languages

Full Official Name: CAMIO Transcription Languages

Submission date: Dec. 14, 2022, 6:52 p.m.

Introduction: CAMIO Transcription Languages was developed by the Linguistic Data Consortium and contains nearly 70,000 images of machine printed text with corresponding annotations and transcripts in the following 13 languages: Arabic, Chinese, English, Farsi, Hindi, Japanese, Kannada, Korean, Russian, Tamil, Thai, Urdu, and Vietnamese. This corpus is a subset of data created for a broader effort to support the development and evaluation of optical character recognition (OCR) and related technologies for 35 languages across 24 unique script types. The CAMIO (Corpus of Annotated Multilingual Images for OCR) collection was designed to address gaps in language and script coverage from existing corpora and to support future evaluation of OCR capabilities through a systematically constructed data set. Data: Most images were annotated for text localization, resulting in over 2.3M line-level bounding boxes. For the 13 languages represented in this release, 1250 images per language were also annotated with orthographic transcriptions of each line plus specification of reading order, yielding over 2.4M tokens of transcribed text. The resulting annotations are represented in a comprehensive XML output format defined for this corpus. The script for each language is indicated in parentheses: Arabic (Arabic), Chinese (Simplified), English (Latin), Farsi (Arabic), Hindi (Devanagari), Japanese (Japanese), Kannada (Kannada), Korean (Hangul), Russian (Cyrillic), Tamil (Tamil), Thai (Thai), Urdu (Arabic), and Vietnamese (Latin). Data for each language is partitioned into test, train or validation sets.

Creator(s)

Michael Arrigo

Stephanie Strassel

Christopher Caruso

Distributor(s)

Linguistic Data Consortium

Right Holder(s)

Portions © 2007, 2015, 2017-2020 1399 picofiles, © 2015-2019 65tes-habeshamusic.com, © 2019-2020 Accessify.com, © 2019-2020 Adobe, © 2013, 2019-2020 Alamy Ltd., © 2010-2011, 2019-2020, Amazon.com, Inc. or its affiliates, © 2008, 2018-2019 ambebi.ge, © 2000, 2019-2020 A Medium Corporation, © 2019-2020 App Annie, © 2019 AppKiwi, © 2014, 2019 Armenian News - Tert.am, © 2012-2014, 2018-2019 ARMENPRESS, © 2002, 2006, 2008, 2010-2011, 2013-2014, 2019-2020 Assimba.org, © 2011-2019 Atv - Eritrean Satellite Television, © 2016-2017 AtYourService.pk, © 2018-2019 Aysor, © 2019 Bag, © 2002-2003, 2009-2019 Baidu, © 2017-2019 Bangla sms bengali shayari, © 2019 bbcode0.com, © 2014, 2019-2020 Benawa Network, © 2002, 2012-2019 Bennett Coleman & Co. Ltd., © 2013-2019 Best TV, © 2000, 2019-2020 BigCommerce Pty. Ltd., © 2019 Bnet Technologies, © 2017 BONDHU2U, © 2011, 2015-2018 BuzzFeed, Inc., © 2016-2017, 2019 CBSEPORTAL.COM, © 2019-2020 cinejosh.com, © 2019 Civic Network OPORA, © 2010, 2018-2019 Cli

Status : Accepted

ISLRN :

014-810-264-834-8

Version

1.0

Source

https://catalog.ldc.upenn.edu/LDC2022T07

Resource Type

Primary Text

Media Type

Image

Text

Language(s)

Arabic

Chinese

English

Hindi

Japanese

Kannada

Korean

Persian

Russian

Tamil

Thai

Urdu

Vietnamese

Access Medium

Web Download