« D is for Digitize: Day 2 Morning | Main | D is for Digitize: Day 3 »


Peter, thanks for the additional post and discussion. I really appreciate it.

I concur that there's some inconsistency/confusion about the "archival" nature of Google Books. Courant, at the lunch session, implied that he could replace the "acid-paper" books with the scans provided by Google -- in other words, he incorrectly conceives of them as archival. It was at that point that I started asking technical questions about the nature of the scans themselves.

You are correct, of course, that Google hasn't released any average specs, and my statement there was a bit off-the-cuff. Our images are on the whole higher resolution and somewhat better-looking (heh, more sloppy statements) than what Google is serving online (there are many ways to suck whole pages out of Google for comparison; the resolution is not that great and the mistakes are many). They may have higher resolution up their sleeves, but because they rely on secret magical algorithms to unwarp their pages, it will never be possible to have as much confidence in the absolute quality of their scans relative to our flattened pages or even flatbed scans. There are numerous examples on various blogs showing the mistakes and problems introduced by their image-processing approach. They are software people, solving problems with software; we're operating on the principle that good input will equal good output, and the more problems we can solve physically, the less effort we have to expend on software. Not to mention that with every improvement in sensor and software technology (still coming hot and fast in cheap consumer cameras), our scanners improve. We're a long way from my junk scanner proof-of-concept, and we've only had a place to chat since June.

I can't speak for James, but I can speak for the people who've built their own scanners. As I said in my talk, we have all been burned by entities like Google/Amazon/Elsevier/Apple/etc and instead of complaining or buying into another broken system (also coming hot and fast right now), we are working on a very physical level to create a future of books that suits our needs. This future, practically, starts with accessible, affordable hardware, which prior to this year was simply unavailable, but is now unlikely to disappear. The whole reason that we have this settlement is Google's incredible rush to get GBS running before anyone else can make a competing project; in that sense, we represent the (unrepresented) long view that we have more than 5 short years and more than one legal system and continent with which to solve this problem. If you accept that on a long enough timeline (or with revision of the law), copyright on everything will expire, then even infringing uses of scanning systems become a public benefit eventually.

I think an interesting recent example is the abandoning of GeoCities by Yahoo!. They simply don't want to maintain that system anymore -- and made no effort to preserve it. While there are people working day and night to save the archives, Yahoo! is not among them. It may be difficult to imagine Google doing such a thing, but it is not out of the realm of possibility. Having multiple copies of everything, in different hands, under different copyright systems should help ensure the lasting preservation and acessibility of knowledge.

Recall for a minute that the whole conference was diverted to potential future GBS settlement outcomes -- we were all talking about a settlement that we all knew wasn't going through. In that sense, talking about a future of books that doesn't acknowledge or revolve around the GBS settlement made a lot more sense... might been a lot more germane... than might have been immediately apparent.

Daniel, thanks for responding. The DIY Bookscanner project is interesting enough that I am going to do an off-topic post just on it, but let me address two of your points about the conference.

First, I didn't ask you about quality metrics at the conference because I don't think they are significant for Google books. We know that Google book scans are not what the library community has defined as a preservation-quality scan. Others, though, seem to feel that the less-than-desired quality of Google scans is an indictment of the whole project. I was surprised that these scanning purists did not speak up.

I am curious, though, how you can claim that your scans are as good or better than Google's. As far as I know, Google has never released average specifications on its scans, and I would think that the quality of your scans would be heavily dependent on the cameras and lights used.

You write: "My idea was to focus on the substantial non-infringing uses of the scanner and to point out that Google (and publishers, and more) are doing nothing to address the needs of many people." You did a great job of expressing this idea. I liked in particular your examples of groups that are using the scanner to digitize records and other materials that would otherwise remain inaccessible. This is great stuff.

What I found puzzling was the reason why James Grimmelmann wanted this to be part of the program. Was he suggesting that these type of activities should be regulated by the Google Settlement as well? (I hope not!) Was he trying to remind us that there is a world of information that lies outside the settlement? (Well, d'uh!) Or was he trying to suggest that with hardware and software applications similar to yours, some people were going to make copies of their own books (and possibly infringe copyright by doing so) while other people might digitize and post copies of the books to places like http://www.scribd.com? In short, what are the implications, if any, of a DIY book scanning ethos for the Google Book Settlement?

The DIY Book Scanners our community have built produce images which are, at worst, on-par with Google books and at best far superior. Unlike the Google scanners, which rely on software "magic" to make good images from fast scans of distorted pages, we flatten the pages with glass or acrylic and light them evenly and carefully. The result is a great looking scan that OCRs well. The resolution is somewhat less than a flatbed scanner (for large books), but for novels/small formats it is far greater. I don't know what you mean by a "face-up" scanner, but some of our scanners (like my folding scanner) approach the output quality of commercial scanners like the Atiz units, and are over 20x faster than flatbed scanners, which break bindings and take hours (this, I addressed in my talk).

I didn't discuss quality during my presentation because I didn't have time, though you may have heard me talking tech with Dan Clancy at the microphone about image quality. Anticipating questions about how good our scans are, I brought some image quality samples in my slide deck, but no one, including and perhaps especially you, asked about image quality. Odd indeed.

My idea was to focus on the substantial non-infringing uses of the scanner and to point out that Google (and publishers, and more) are doing nothing to address the needs of many people. I felt this outlined a clear need for a DIY scanner beyond the crass, obvious (and oft-repeated) potential pirate uses which exist in every camera, scanner, and copier in existence. In fact, at this point in time I'm faintly annoyed that many people's first reaction is to call our project piracy -- like Dan Clancy said about Google Books, this scanner is its own Rorschach test. Digitizing your home library is one pedestrian use. Our community is already serving the disabled, helping out people struck by natural disasters abroad, and more.

While my first scanner was made from trash, the point in making public that it was made from trash in three days was to show that producing good-enough scans is, at this point in time, trivial. The point of the new scanner, which was produced with the latest in high-technology (laser manufacturing equipment), was to show that it is reproducible, robust, transportable, and affordable to almost anyone. And all the plans and software are Free.

What more you might want after that is beyond me. You're right that many people will find it acceptable to re-buy broken, rights-limited ebooks from a third party like iTunes, but not all of us find that an acceptable outcome. We are proactively creating the tools to do it right.

As a former cataloger and someone who studied descriptive bibliography, I take good metadata very seriously. I don't expect good metadata from Google, or from Amazon (which I have seen combine information from several different versions of a book into one descripton), or from the Internet Archive (which does not, for example, report how it manipulate scans provided to it). I do expect libraries to provide access to good metadata, and so am greatly disappointed when library projects such as OCLC's WorldCat Local refuse to recognize individual differences between copies of books and instead mash all records together into a generic "master record."

I do not, therefore, expect Google Books ever to be a source for serious scholarship. If we want good metadata that would be useful to Paul Duguid or Geoffrey Nunberg (and we do), that needs to come from the libraries that are receiving copies of the scans Google is making. Critics who want to do serious scholarship should be using the Hathi Trust and not Google Books.

Mr. Hirtle's witty indifference to the massive metadata and scanning foul-ups in Google Books (and as others have noted, in Google Scholar as well) -- which despite his quips he knows perfectly well to be orders of magnitude worse than anything in any of the catalogues of his or the other partner libraries -- would be more amusing if the errors weren't so deleterious to the use of the corpus for serious scholarship by the people whose interests he is ostensibly paid to serve. It would be more seemly if he could put at least simulate some concern for the needs of his masters.

The comments to this entry are closed.