Support

Support Options

Submit a Support Ticket

IBM-UB Data Set

IBM-UB Online and Offline Multi-lingual Handwriting Data Set

The Center for Unified Biometrics and Sensors (CUBS), at the University at Buffalo is releasing a new handwriting dataset to the research community. The IBM_UB dataset is a bi-modal (online and offline), multilingual corpus of ground-truthed handwritten documents. It contains a variety of handwritten content ranging from pages of free form cursive writing, to forms, spontaneously written letters, and tables of words, isolated characters and symbols. We expect this dataset to be a valuable resource for multilingual OCR development and for IR applications.

       
 

This corpus containing handwritten data was originally collected on IBM's CrossPad™ device. The CrossPad™ was a portable digital notepad that used an electronic pen that produced real ink on paper while simultaneously capturing the online pen trajectories. Thus, the handwriting sample was available both as a hardcopy (offline) paper document as well as online trajectory data in IBM's native format.

Researchers at the University at Buffalo (CUBS) have (a) converted the online data - originally in IBM's native format - into the InkML format, (b) scanned the hardcopy documents into 300dpi grayscale images (PNG format), (c) developed visualization tools for the online data and (d) developed correspondence between the online and offline data and generated the ground truth at different levels of granularity - for a sub-set of the entire corpus.

The current release of data comprises of two sections: IBM_UB_1 and IBM_UB_2.

IBM_UB_1(v1) contains free form cursive handwritten pages in English. It contains 6654 pages of online data collected from 43 writers (4138 summary pages and 2516 query pages). It also contains 5934 page of offline data collected from 41 writers (3714 summary pages and 2220 query pages).

  • The documents in the data set are characterized by summary text pages and corresponding query text pages
  • The summary text contains one or two pages of writing on a particular topic
  • The query text contains approximately 25 words that encapsulate the summary text
  • Each summary-query pair is labeled with a unique ID
  • Ground truth information is available for the online query text documents at the word level and for summary text documents at the page level
  • A page level correspondence between the online and offline documents has been established
  • A visualization tool for the online data is also being released
  
 
IBM_UB_2 contains handwritten pages in French collected from 200 authors. The pages are in the form of booklets each of which has several typed lines that the author reproduces by hand. These lines contain short cursive sentences, or discrete characters, symbols and/or digits.
 
     
 

For this dataset

  • Ground truth information is available at the line level
  • A page level correspondence between the online and offline documents has been established

Additional data sets will be released as more sections of the corpus are processed and prepared. Future releases will include data for other Latin script based languages (Italian and German) and evaluation tools for both online and offline recognition. An online word recognizer is being developed to baseline performance on the online data set and will be released under an open-source license.

Individuals and organizations interested in the data for research purposes will need to execute this sub-license agreement with UB before the data is released.

References

A. Shivram, C. Ramaiah, S. Setlur, and V. Govindaraju. IBM_UB_1: A Dual Mode Unconstrained English Handwriting Dataset. In Proc. of the 12th International Conference on Document Analysis and Recognition (ICDAR), pages 13 - 17, 2013. [.bib] [.ris]

For further details, please contact:

Prof. Venu Govindaraju
SUNY Distinguished Professor, University at Buffalo
+1 716 645 1558
This email address is being protected from spambots. You need JavaScript enabled to view it.

We are grateful to IBM for the donation of this valuable data and especially to Michael Perrone of IBM Research for his leadership in making this possible.

This data release was funded in part by a generous gift from Google Inc. We would like to thank Henry Rowley and Ashok Popat from Google for their support.