Microsoft and Google have been racing to digitize the books of the world. Over the past year, the battle of dueling press releases has seesawed back and forth as each has announced new agreements to digitize and index vast libraries.
The nature of the two companies efforts are different, with Microsoft scanning copyrighted material only if rights–holders opt in to the service and Google’s project scanning everything it can gain access to but only providing limited summaries and background for copyright materials the rights holders haven’t authorized full disclosure of.
The latest two announcements have belonged to Google. In May, following a prior announced of a deal with the University of Lausanne to scan French language documents, Google announced an ambitious international effort to add 800,000 texts stored in India at the University of Mysore. Some of the documents in the collection date back to the eighth century and are in Sanskrit and Kannada. Many are hand written and to digitize will require hand coding or specialized OCR (optical character recognition) techniques that Google built from an system called Tesseract. (Tesseract was built at Hewlett Packard over ten years but abandoned and made available via open source is 2005. Google latched on to the public code, dedicated some engineers to remove bugs and improve it and is now using it for some of its book digitization projects.)
Today, coming back stateside, Google announced twelve more universities including the University of Chicago and the universities of the Big Ten conference (Ohio State, Michigan, Illinois, Indiana, Iowa, Michigan State, Northwestern, Penn State, Purdue, Wisconsin, and Minnesota) will also work with Google.
The newest schools to participate will join the University of Texas, Stanford, Harvard and the NY Public Libraries in Google’s stable of library partners. Microsoft’s Live Book Search Program has deals to work with Cornell University, The British Library, The University of Toronto the University of California.
While Google seems to be signing up libraries more aggressively, they are facing a lawsuit by the Association of American Publishers and the Authors Guild over their plans to incorporate parts of copyrighted books (even though they plan to only show summaries or excerpts of copyrighted works). Elsewhere Google is still fighting a huge copyright lawsuit from Viacom over the display of copyrighted videos on Google owned YouTube.