(Continues D is for Digitize: Day 2 Morning. Posted by Peter Hirtle]
The afternoon sessions presented less that was entirely new to me and my note-taking skills started to flag, so the notes below have less on the actual presentations and more commentary from me.
K is for Keynote
Pamela Samuelson and Paul Courant presented a very engaging lunch-time presentation. I can’t do a better job of summary than LJ has, so I won’t try.
C is for Culture
The four speakers in the afternoon session turned away from legal issues and instead looked at some of the broader cultural issues associated with GBS. The always-entertaining Paul Duguid, for example, while praising the existence of the Google Books database, worried about the scanning and metadata problems he has uncovered in it. His argument (and Geoffrey Nunberg’s similar rant) have always struck me as a little odd for two reasons. First, it strikes me that Google has been able to replicate in a period of 5 or 6 short years almost all the cataloging errors that it has taken librarians over a century to accumulate. Anyone want to make a guess as to who can clean up their data faster? Second, I don’t see Google as a library and don’t expect the same level of bibliographic accuracy from them as I do from a research library. If you want good metadata, then make sure that Google competitors (such as the Hathi Trust) develop quickly.
The really odd presentation was by Daniel Reetz of his scrap-material, low-cost do-it-yourself scanner. The talk – and the scanner – excited many in attendance and has been the subject of many blogs from the conference, including posts from Robin Sloan, Harry Lewis, and Eric Hellman. But it struck me as particularly odd. First, a common criticism of Google’s project has been its sub-standard scanning, which is far from preservation quality. There was no discussion at all, however, of the quality of the images produced from this little machine. Second, at a conference that had as a subtext whether Google was being respectful enough of copyright owners and publishers, we had a presentation on an approach that could ignore publishers and authors completely. Weird.
Are there situations in which the DIY Scanner might be useful? Sure. I always work on the assumption that something is better than nothing, and if you can’t afford a face-up scanner and want something faster than a camera and tripod, this would work. It might make it possible to digitize your home library. You can also by a turntable that digitizes all the albums in the basement – but most find it easier to buy a better copy at iTunes.
P is for Public
The final session was devoted to public interest issues in the settlement. Lateef Mtima, who presented himself as normally a defender of rights holders, reminded everyone of the tremendous social good that GBS could bring to underserved populations. Chris Danielsen presented a moving argument in favor of increasing access to books for the visually impaired.
Cindy Cohn from EFF and John Verdi from EPIC talked about the importance of privacy issues in the settlement (as did Carrie Russell, in a masterful presentation that outlined the mixed feelings that most librarians have about the settlement). In a recent Digital Campus podcast, the hosts suggested that the privacy community is trying to use GBS as a place to argue their general concerns with privacy on the Internet, and I didn’t hear anything in the session to dissuade me of that.
The privacy issue in GBS seems particularly odd to me. First of all, I don’t view Google as library and so don’t expect it to follow library confidentiality statutes. Libraries should only subscribe to the database if Google meets professional standards regarding privacy – but right now those are pretty low. For example, the International Coalition of Library Consortia’s “Privacy Guidelines for Electronic Resources Vendors” only requires that vendors (such as Google would be) “respect the privacy of the users of its products” and not disclose such information to a 3rd party without permission. Critics are expecting more from Google than any other library vendor.
(Continues with D is for Digitize: Day 3)
Peter, thanks for the additional post and discussion. I really appreciate it.
I concur that there's some inconsistency/confusion about the "archival" nature of Google Books. Courant, at the lunch session, implied that he could replace the "acid-paper" books with the scans provided by Google -- in other words, he incorrectly conceives of them as archival. It was at that point that I started asking technical questions about the nature of the scans themselves.
You are correct, of course, that Google hasn't released any average specs, and my statement there was a bit off-the-cuff. Our images are on the whole higher resolution and somewhat better-looking (heh, more sloppy statements) than what Google is serving online (there are many ways to suck whole pages out of Google for comparison; the resolution is not that great and the mistakes are many). They may have higher resolution up their sleeves, but because they rely on secret magical algorithms to unwarp their pages, it will never be possible to have as much confidence in the absolute quality of their scans relative to our flattened pages or even flatbed scans. There are numerous examples on various blogs showing the mistakes and problems introduced by their image-processing approach. They are software people, solving problems with software; we're operating on the principle that good input will equal good output, and the more problems we can solve physically, the less effort we have to expend on software. Not to mention that with every improvement in sensor and software technology (still coming hot and fast in cheap consumer cameras), our scanners improve. We're a long way from my junk scanner proof-of-concept, and we've only had a place to chat since June.
I can't speak for James, but I can speak for the people who've built their own scanners. As I said in my talk, we have all been burned by entities like Google/Amazon/Elsevier/Apple/etc and instead of complaining or buying into another broken system (also coming hot and fast right now), we are working on a very physical level to create a future of books that suits our needs. This future, practically, starts with accessible, affordable hardware, which prior to this year was simply unavailable, but is now unlikely to disappear. The whole reason that we have this settlement is Google's incredible rush to get GBS running before anyone else can make a competing project; in that sense, we represent the (unrepresented) long view that we have more than 5 short years and more than one legal system and continent with which to solve this problem. If you accept that on a long enough timeline (or with revision of the law), copyright on everything will expire, then even infringing uses of scanning systems become a public benefit eventually.
I think an interesting recent example is the abandoning of GeoCities by Yahoo!. They simply don't want to maintain that system anymore -- and made no effort to preserve it. While there are people working day and night to save the archives, Yahoo! is not among them. It may be difficult to imagine Google doing such a thing, but it is not out of the realm of possibility. Having multiple copies of everything, in different hands, under different copyright systems should help ensure the lasting preservation and acessibility of knowledge.
Recall for a minute that the whole conference was diverted to potential future GBS settlement outcomes -- we were all talking about a settlement that we all knew wasn't going through. In that sense, talking about a future of books that doesn't acknowledge or revolve around the GBS settlement made a lot more sense... might been a lot more germane... than might have been immediately apparent.
Posted by: Daniel Reetz | November 03, 2009 at 05:02 PM
Daniel, thanks for responding. The DIY Bookscanner project is interesting enough that I am going to do an off-topic post just on it, but let me address two of your points about the conference.
First, I didn't ask you about quality metrics at the conference because I don't think they are significant for Google books. We know that Google book scans are not what the library community has defined as a preservation-quality scan. Others, though, seem to feel that the less-than-desired quality of Google scans is an indictment of the whole project. I was surprised that these scanning purists did not speak up.
I am curious, though, how you can claim that your scans are as good or better than Google's. As far as I know, Google has never released average specifications on its scans, and I would think that the quality of your scans would be heavily dependent on the cameras and lights used.
You write: "My idea was to focus on the substantial non-infringing uses of the scanner and to point out that Google (and publishers, and more) are doing nothing to address the needs of many people." You did a great job of expressing this idea. I liked in particular your examples of groups that are using the scanner to digitize records and other materials that would otherwise remain inaccessible. This is great stuff.
What I found puzzling was the reason why James Grimmelmann wanted this to be part of the program. Was he suggesting that these type of activities should be regulated by the Google Settlement as well? (I hope not!) Was he trying to remind us that there is a world of information that lies outside the settlement? (Well, d'uh!) Or was he trying to suggest that with hardware and software applications similar to yours, some people were going to make copies of their own books (and possibly infringe copyright by doing so) while other people might digitize and post copies of the books to places like http://www.scribd.com? In short, what are the implications, if any, of a DIY book scanning ethos for the Google Book Settlement?
Posted by: Peter Hirtle | October 27, 2009 at 09:35 AM
The DIY Book Scanners our community have built produce images which are, at worst, on-par with Google books and at best far superior. Unlike the Google scanners, which rely on software "magic" to make good images from fast scans of distorted pages, we flatten the pages with glass or acrylic and light them evenly and carefully. The result is a great looking scan that OCRs well. The resolution is somewhat less than a flatbed scanner (for large books), but for novels/small formats it is far greater. I don't know what you mean by a "face-up" scanner, but some of our scanners (like my folding scanner) approach the output quality of commercial scanners like the Atiz units, and are over 20x faster than flatbed scanners, which break bindings and take hours (this, I addressed in my talk).
I didn't discuss quality during my presentation because I didn't have time, though you may have heard me talking tech with Dan Clancy at the microphone about image quality. Anticipating questions about how good our scans are, I brought some image quality samples in my slide deck, but no one, including and perhaps especially you, asked about image quality. Odd indeed.
My idea was to focus on the substantial non-infringing uses of the scanner and to point out that Google (and publishers, and more) are doing nothing to address the needs of many people. I felt this outlined a clear need for a DIY scanner beyond the crass, obvious (and oft-repeated) potential pirate uses which exist in every camera, scanner, and copier in existence. In fact, at this point in time I'm faintly annoyed that many people's first reaction is to call our project piracy -- like Dan Clancy said about Google Books, this scanner is its own Rorschach test. Digitizing your home library is one pedestrian use. Our community is already serving the disabled, helping out people struck by natural disasters abroad, and more.
While my first scanner was made from trash, the point in making public that it was made from trash in three days was to show that producing good-enough scans is, at this point in time, trivial. The point of the new scanner, which was produced with the latest in high-technology (laser manufacturing equipment), was to show that it is reproducible, robust, transportable, and affordable to almost anyone. And all the plans and software are Free.
What more you might want after that is beyond me. You're right that many people will find it acceptable to re-buy broken, rights-limited ebooks from a third party like iTunes, but not all of us find that an acceptable outcome. We are proactively creating the tools to do it right.
Posted by: Daniel Reetz | October 25, 2009 at 10:07 PM
As a former cataloger and someone who studied descriptive bibliography, I take good metadata very seriously. I don't expect good metadata from Google, or from Amazon (which I have seen combine information from several different versions of a book into one descripton), or from the Internet Archive (which does not, for example, report how it manipulate scans provided to it). I do expect libraries to provide access to good metadata, and so am greatly disappointed when library projects such as OCLC's WorldCat Local refuse to recognize individual differences between copies of books and instead mash all records together into a generic "master record."
I do not, therefore, expect Google Books ever to be a source for serious scholarship. If we want good metadata that would be useful to Paul Duguid or Geoffrey Nunberg (and we do), that needs to come from the libraries that are receiving copies of the scans Google is making. Critics who want to do serious scholarship should be using the Hathi Trust and not Google Books.
Posted by: Peter Hirtle | October 23, 2009 at 11:08 PM
Mr. Hirtle's witty indifference to the massive metadata and scanning foul-ups in Google Books (and as others have noted, in Google Scholar as well) -- which despite his quips he knows perfectly well to be orders of magnitude worse than anything in any of the catalogues of his or the other partner libraries -- would be more amusing if the errors weren't so deleterious to the use of the corpus for serious scholarship by the people whose interests he is ostensibly paid to serve. It would be more seemly if he could put at least simulate some concern for the needs of his masters.
Posted by: Dennis Yee | October 23, 2009 at 02:46 PM