How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs

Holley, Rose How good can it get? Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine, 2009, vol. 15, n. 3/4. [Journal article (Unpaginated)]

[img]
Preview
PDF
ANDP__How_Good_Can_it_Get.pdf

Download (265kB) | Preview

English abstract

This article details the work undertaken by the National Library of Australia Newspaper Digitisation Program on identifying and testing solutions to improve OCR accuracy in large scale newspaper digitisation programs. In 2007 and 2008 several different solutions were identified, applied and tested on digitised material now available in the Australian Newspapers Digitisation Program beta service http://ndpbeta.nla.gov.au/ndp/del/home. This article gives a state of the art overview of how OCR software works on newspapers, factors that effect OCR accuracy, methods of measuring accuracy, methods of improving accuracy, and testing methods and results for specific solutions that were considered viable for large scale text digitisation projects.

Item type: Journal article (Unpaginated)
Keywords: OCR accuracy, Optical Character Recognition, Historic Newspapers, OCR text correction
Subjects: L. Information technology and library technology > LZ. None of these, but in this section.
J. Technical services in libraries, archives, museum. > JG. Digitization.
Depositing user: Rose Holley
Date deposited: 30 Mar 2009
Last modified: 02 Oct 2014 12:13
URI: http://hdl.handle.net/10760/12908

References

The Australian Periodical Publications 1840-1845 component of the Australian Cooperative Digitisation Project (ACDP): <http://www.nla.gov.au/ferg/about/>.

The unsuccessful projects of the 1990's are reported in detail in this article: Entlich, Richard., 2002. Where are they now? Digitising Microfilmed Newspapers. RLG Diginews, June 15, 2002, vol. 6, no 3. URL: <http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070519/viewer/file1572.html#faq>.

And a response to this from the British Library is in this article: Deegan, Marilyn., 2002. Digitising Historic Newspapers: Progress and Prospects. RLG Diginews, August 15 2002, vol. 6 no 4. URL: <http://worldcat.org/arcviewer/1/OCC/2007/08/08/0000070519/viewer/file730.html#feature2>.

Australian Newspapers Digitisation Program (ANDP) website: <http://www.nla.gov.au/ndp>.

ABBYY website gives more information about ABBYY Finereader OCR software development history and how the software works: <http://www.abbyy.com/company>.

OCRopus website: <http://sites.google.com/site/ocropus/>.

ALTO (Analyzed Layout and Text Object) is a standardized XML format used for storing layout and content information of complex digital objects like newspapers. It is currently being used for newspaper digitisation projects at the US Library of Congress, the National Library of Australia, and the Bibliothèque nationale de France.

Klijn, Edwin., 2008. The current state of art in newspaper digitisation. A market perspective. D-Lib Magazine, January/February 2008, Vol. 14 No. 1/2, <doi:10.1045/january2008-klijn>, ISSN: 1082-9873.

Macquarie dictionary website: <http://www.macquariedictionary.com.au>.

Powell, Tracy and Gordon Paynter. Going Grey? Comparing the OCR Accuracy Levels of Bitonal and Greyscale Images. D-Lib Magazine, March/April 2009, vol. 15 no 3/4. URL: <doi:10.1045/march2009-powell>.

NextScan website: <http://www.nextscan.com/products/nextstar.html>.

Just as this article went to press, the author completed the report on public OCR text correction. See: Holley, Rose (2009) Many Hands Make Light Work: Public Collaborative Text Correction in Australian Historic Newspapers. ISBN 978-0-642-27694-0. Available at <http://www.nla.gov.au/ndp/project_details/documents/ANDP_ManyHands.pdf>.

Impact website: <http://www.impact-project.eu/home>.


Downloads

Downloads per month over past year

Actions (login required)

View Item View Item