Digitizing Books - A Newer Way
Gentle readers, I've neglected you for all the other interesting things that life has offered this sunny (and in Alberta - snowy!) May. My apologies. There are a variety of things to tell you about, but I'll confine each to a discrete blog so that you've got something to come back for.
I have a particular fascination with things that can serve two (opposite) purposes. Today - those things are CAPTCHAS.
When you go to a portion of a website that intends to give the public full access, but none to the e-mail and content-harvesting computer robots, clever designers are now using CAPTCHAs (Completely automated Public Turing* test to Tell Computers and Humans Apart). Since a robot can't see the content of an image, they give up in frustration. You, dear reader, I hope, are not turning away frustrated.
Now, in a 21st century sense of efficiency, these little typing strokes are being employed in aid of making history more accessible. CBC has reported this morning that CAPTCHAs are now being harnessed, being converted from mustangs to working horses, to help turn old books into digetized editions. How & why?
Why - I'll hint at it with the quotation from the end of the CBC article: "It's definitely a barn-raising to try to build the great library". That ought to get Mennonite attention!
Many books now can be scanned and converted to digitized versions (turned from images to searchable, editable, and formatable) text through a process known as Optical Character Recognition (OCR). Changes are that if you have a scanner at home, it has this amazing OCR capacity.
In OCR, each shape on the scanned page is assessed for his height, width, shape of edges, blank spots for the character/number it likely represents and a new document is created with the replaced character/number as text until you have a full transliteration of the original document. Generally this works very well if using an English-language source document that has been wordprocessed/typed/typeset originally with a standard font like Courier, Times Roman, Arial, etc. I've had great success with this, getting well over 95% accuracy with contemporary documents on white paper that I wanted to manage in my computer (e.g. scanning in an obituary so that I can quote it in a family history).
But, if the original is badly yellowed and the scan cannot easily pick out the edges of the characters/numbers from the backgrond or the paper is marked up with lines, or any of a bunch of other confounding elements, OCR will fail to be accurate - or simply fail.
This is where the big boys are harnessing the CAPTCHA mustangs and making them workhorses. The big boys are partnering up with the folks who want old books digitized - and are putting scanned snippets of old documents into the CAPTCHAs that web visitors are happy to transliterate for access to interesting web content. Then, they add that little transliteration to the previous ones .... and slowly build up a digitized version of the book.
For today's CBC story, see: Web registration tool to encode books online. If the link fails for you (news stories do get deleted after a time), I've kept a copy of it for future use - I can send the full content as an attachment.
In kinship,
Judii
* "Turing" references another piece of computer history that would stray us further from our topic. If you'd like to read up on it, see: Wikipedia: Turing test

Recent Comments