[wordup] Digitizing Books One Word at a Time
Adam Shand
adam at shand.net
Sat May 26 19:47:44 EDT 2007
This is amazing, what a cool use of all that "wasted" time! I wonder
how many sites are actually getting their captcha's from recaptcha? :-)
Adam.
Source: http://recaptcha.net/learnmore.html
A CAPTCHA is a program that can tell whether its user is a human or a
computer. You've probably seen them — colorful images with distorted
text at the bottom of Web registration forms. CAPTCHAs are used by
many websites to prevent abuse from "bots," or automated programs
usually written to generate spam. No computer program can read
distorted text as well as humans can, so bots cannot navigate sites
protected by CAPTCHAs.
About 60 million CAPTCHAs are solved by humans around the world every
day. In each case, roughly ten seconds of human time are being spent.
Individually, that's not a lot of time, but in aggregate these little
puzzles consume more than 150,000 hours of work each day. What if we
could make positive use of this human effort? reCAPTCHA does exactly
that by channeling the effort spent solving CAPTCHAs online into
"reading" books.
To archive human knowledge and to make information more accessible to
the world, multiple projects are currently digitizing physical books
that were written before the computer age. The book pages are being
photographically scanned, and then, to make them searchable,
transformed into text using "Optical Character Recognition" (OCR).
The transformation into text is useful because scanning a book
produces images, which are difficult to store on small devices,
expensive to download, and cannot be searched. The problem is that
OCR is not perfect.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sample-ocr.gif
Type: image/gif
Size: 10232 bytes
Desc: not available
Url : http://lists.spack.org/pipermail/wordup/attachments/20070526/5a5252d3/attachment.gif
-------------- next part --------------
reCAPTCHA improves the process of digitizing books by sending words
that cannot be read by computers to the Web in the form of CAPTCHAs
for humans to decipher. More specifically, each word that cannot be
read correctly by OCR is placed on an image and used as a CAPTCHA.
This is possible because most OCR programs alert you when a word
cannot be read correctly.
But if a computer can't read such a CAPTCHA, how does the system know
the correct answer to the puzzle? Here's how: Each new word that
cannot be read correctly by OCR is given to a user in conjunction
with another word for which the answer is already known. The user is
then asked to read both words. If they solve the one for which the
answer is known, the system assumes their answer is correct for the
new one. The system then gives the new image to a number of other
people to determine, with higher confidence, whether the original
answer was correct.
Currently, we are helping to digitize books from the Internet Archive.
More information about the wordup
mailing list