(by Peter Hirtle)
CLIR has inaugurated a new publication series called Ruminations, and for its first report, it has published an incredibly interesting and important report by John Wilkin. In “Bibliographic Indeterminacy and the Scale of Problems and Opportunities of ‘Rights’ in Digital Collection Building,” Wilkin explores the status of copyrights in the 5+ million volumes that have been digitized and deposited with the Hathi Trust. He provides hard data that builds on the earlier work of Brian Lavoie and Lorcan Dempsey on the nature of the WorldCat database and Michael Cairns on the number of orphans.
Wilkin’s work is a rich and nuanced piece that stimulates thoughts of broad importance as well as questions about the soundness of some of his assumptions. Here are some of my initial thoughts, sparked by reading an early draft as well as the published product.
First, Wilkin’s analysis provides solid data on the scope of the orphan works issue (those works that protected by copyright but that for which a copyright owner cannot be located). “Even before we are finished digitizing our collections,” Wilkin concludes, “the potential numbers are significant and surprising: more than 800,000 US orphans and nearly 2 million non-US orphans.”
The size and scope of the orphan works problems was one of the subtexts in debate about the Google Books Amended Settlement Agreement (ASA), and critics of the settlement argued that the ASA would give Google an unconscionable monopoly over orphans. Wilkin’s work, however, indicates that the universe of orphan works is much, much larger than the ASA would have made accessible. His calculations do not distinguish registered and renewed US copyrighted works from unregistered U.S. titles; the ASA (unlike the original settlement agreement) would have only given Google the right to use registered orphan works. That number would be far smaller than 800,000. As for the foreign works, we don’t know how many of the 2 million titles Wilkin identifed as orphans were published in England, Canada, Australia, or New Zealand (the countries that would have been part of the ASA) and how many were published in the countries that were not included in the ASA. It is likely, however, that only a small percentage of these two million would have been accessible via the ASA. The major problem with the ASA, therefore, was not that it would give Google a monopoly over all orphan works, but rather that it would leave millions of titles still inaccessible. The original settlement agreement was actually better in this regard, even though it had a slew of other problems.
With the rejection of the ASA, even its partial solution to the orphan works issue is gone. The legislation proposed by the Copyright Office to address the issue in its final report on the orphan works problem is no better. Given the scope of the issue as identified by Wilkin, no mass digitization effort like the Hathi Trust’s could ever afford to engage in a title-by-title search for copyright owners. There needs to be a third solution. The Trust recently announced that it is are going to engage in a study on how to locate the owners of orphan works that should help further explicate the scope of the orphan works problem.
Second, Wilkin’s analysis is based on the 5+ million books secured in the Hathi Trust. They represent a 31% (and growing) overlap with the holdings of ARL libraries. It is pretty clear to me that Hathi has started building the Digital Public Library of America that others are talking about. I am also curious as to how big that library will get, and whether the patterns Wilkin has identified will continue to hold as the collection grows. Can we assume for example, that if 31% of academic library collections are duplicated in the 5 million volumes already in Hathi, then 16 million volumes will give us 100% overlap? And will the number of orphans also increase threefold?
Third, the bulk of Wilkin’s essay is devoted to counting the status of works in the database. He notes: “there is considerable nuance and some tricky exceptions to all of these rules, which I won't try to supply here.” I live for nuance and tricky rules, so the rest of this too-long review is devoted to his assumptions.
- While admitting it is an oversimplification, Wilkin assumes that “all pre-1923 books are in the public domain.” That is a pretty good assumption. There is only one instance that I can think of where a pre-1923 book could still be protected by copyright (excluding the weird Twin Books decision on foreign books in the 9th Circuit). If a pre-1923 book was published without the authority of the copyright owner, or included a work that was published without the authority of the copyright owner, that work might still be protected by copyright if the copyright owner published it after 1923. Reproducing the pre-1923 work would infringe on the later copyright. This is apparently the case with the song “Happy Birthday.” According to Robert Brauneis in his fascinating history of the “world’s most popular song,” the lyrics to Happy Birthday were reprinted many times starting in 1912, but the first authorized publication (and registration) occurred in 1935. Distributing any of the earlier versions could infringe on the copyrights created in 1935. But the number of cases of this must be tiny, so I can live with Wilkin’s assumption (and also the slight risk that it entails when digitizing a pre-1923 work).
- Wilkin also mentions the findings of Michigan’s Copyright Review Management System (CRMS), which has found that 55% of US works published between 1923 and 1964 are in the public domain. Wilkin doesn’t link to the protocols that CRMS follows when investigating the copyright status of a work, but they can be found here, and are excellent. For example, unlike other prominent digitization projects that look only at place and date of publication of the volume in hand, CRMS takes into account possible prior foreign publication of a title before its appearance in the US. A book published in the US in 1935 and not renewed would appear in the Copyright Office records as being in the public domain, but if a version was first published in London in 1934, the work would still be protected in the U.S. (I talk about this issue more here.) The CRMS protocol requires that investigators look for evidence of prior foreign publication.
- Wilkin posits this assumption: “For non-US works published between 1923-1963, roughly 20% will be in the public domain (e.g., because the author died before 1941, as would be the case for determining public domain status for works published in countries like the US that has a term of life plus 70 years).” Here he misconstrues the operation of Section 104(a), which restored copyrights in most foreign works. The key determination for most countries is whether the work was in the public domain in its home country as of 1 January 1998 1996. So the question is how many works published between 1923 and 1998 1996 had authors who died before 1927 (in a life+70 country such as the Netherlands or Germany) or 1947 (in a life +50 country, such as Canada) – not 1941. (One would also need to take into account local extensions, such as the additional copyright protection for authors of musical works who are “mort pour la France.”) I think that more research on foreign copyrights would need to be done before we could assume that 20% of pre-1964 works (which by itself is date that only matters to US works, and hence odd to see used as a dividing line in this discussion of foreign works) are in the public domain.
- Wilkin suggests that for copyrighted works published between 1923 and 1963, “we will be able to contact only 10% of the authors, publishers, or heirs who hold rights.” He hints that his estimate is based on the important work done by Denise Troll Covey, but I don't see how one can extrapolate any useful percentages from Covey's bar graphs.
This last point strikes at the real heart of the problem with Wilkin's conclusions. They are based on an assessment of how likely it is that we will be able to locate copyright owners. Right now, that is all guesswork. We have know no way of knowing if 20%, or 30%, or 50% of books in the Hathi Trust collection will turn out to be true orphans. Wilkin's careful initial assessment of the nature of the collection, however, would suggest that he and his Michigan colleagues will soon replace his guesses with estimates based on actual investigations.
UPDATES, 29 June 2011: A misspelling in last paragraph has been corrected. John Mark Ockerbloom was kind enough to point out that I had mistakenly used 1998 as the date of most copyright restorations. The actual date as specified in 104A is of course 1 January 1996. I have updated the text accordingly.