Center for Unified Biometrics and Sensors: IBM Data Sets

The Center for Unified Biometrics and Sensors (CUBS), at the University at Buffalo is releasing a new handwriting dataset to the research community. The IBM_UB dataset is a bi-modal (online and offline), multilingual corpus of ground-truthed handwritten documents. It contains a variety of handwritten content ranging from pages of free form cursive writing, to forms, spontaneously written letters, and tables of words, isolated characters and symbols. We expect this dataset to be a valuable resource for multilingual OCR development and for IR applications.

This corpus containing handwritten data was originally collected on IBM’s CrossPad™ device. The CrossPad™ was a portable digital notepad that used an electronic pen that produced real ink on paper while simultaneously capturing the online pen trajectories. Thus, the handwriting sample was available both as a hardcopy (offline) paper document as well as online trajectory data in IBM’s native format.

Researchers at the University at Buffalo (CUBS) have (a) converted the online data - originally in IBM’s native format - into the InkML format, (b) scanned the hardcopy documents into 300dpi grayscale images (PNG format), (c) developed visualization tools for the online data and (d) developed correspondence between the online and offline data and generated the ground truth at different levels of granularity - for a sub-set of the entire corpus.

The current release of data comprises of two sections: IBM_UB_1 and IBM_UB_2.

IBM_UB_1 contains free form cursive handwritten pages in English collected from 43 writers. It contains 6677 pages each of online and offline data.

The documents in the data set are characterized by summary text pages and corresponding query text pages
The summary text contains one or two pages of writing on a particular topic
The query text contains approximately 25 words that encapsulate the summary text
Each summary-query pair is labeled with a unique ID
Ground truth information is available for the online query text documents at the word level and for summary text documents at the page level
A page level correspondence between the online and offline documents has been established
A visualization tool for the online data is also being released

IBM_UB_2 contains handwritten pages in French collected from 200 authors. The pages are in the form of booklets each of which has several typed lines that the author reproduces by hand. These lines contain short cursive sentences, or discrete characters, symbols and/or digits.

Additional data sets will be released as more sections of the corpus are processed and prepared. Future releases will include data for other Latin script based languages (Italian and German) and evaluation tools for both online and offline recognition. An online word recognizer is being developed to baseline performance on the online data set and will be released under an open-source license.

Prof. Venu Govindaraju
SUNY Distinguished Professor, University at Buffalo
+1 716 645 1558
govind@buffalo.edu

We are grateful to IBM for the donation of this valuable data and especially to Michael Perrone of IBM Research for his leadership in making this possible.

This data release was funded in part by a generous gift from Google Inc. We would like to thank Henry Rowley and Ashok Popat from Google for their support.

IBM-UB Online and Offline Multi-lingual Handwriting Data Set