There has been a lot of talk by some of the critics of the Google Books Settlement (GBS) about how it gives Google a monopoly on orphan works. But most of the commentators who have talked about orphan works and the Google settlement are sloppy in their language. In this post, I want to clarify the language and then make a stab at some numbers.
There are two sets of books governed by the settlement. First, there are the in-copyright but out of print books whose rights holders sign up with Google. We can call these "active rights holders." Second, there are the in-copyright but out-of-print books whose authors do not register with Google or the Books Rights Registry: the "inactive rights holders."
Some like to call this second group "orphan works," but that is wrong. This latter group actually consists of two separate groups. First, there are rights holders who could be easily located but who have chosen not to sign up with the Registry. Foreign authors whose works are normally protected by their national reproduction rights organization come readily to mind. I suspect that many could not conceive that their works could be used without their explicit permission and so see no need to register. Others may not learn of the settlement in spite of Google's advertising campaign. Because these authors could be easily identified and located, however, their works are not orphan works. The other portion of the "inactive rights holders" subset are the true orphan works: works whose copyright owners either cannot be located either because they cannot be identified or because their whereabouts are unknown.
The scope of the orphan works problem
Trying to come up with numbers is a very challenging task, but here is a quick attempt to get some ballpark figures. First, we need to look at the potential scope of Google's database. The Lavoie article on the Google 5 said that WorldCat contained 32 million print book records in 2005. I think that number is too high because we know that there is a tremendous amount of duplication in WorldCat, but let's use it as the outside potential limits of the Google database. Bowker's Global Books in Print reports 18.5 million book items in print, which leaves 13.5 million titles that are out-of-print. (Since that is a current figure and would include books published between 2005 and 2009, it is probably too high - but I also suspect that there are many in-print foreign titles that are not included. Let's use it.)
Lavoie reported that there were 5.4 million titles that were out of copyright (pre-1923), so we are left with roughly 8 million titles that are potentially in copyright but out of print. (Some of these would be American works that have not had their copyright renewed and hence are in the public domain, but I think the number could only be 150,000 1.7 million at most, and so I am going to ignore that).
[UPDATE: So I got a good question about the number of works that might have entered the public domain that pointed out that my original number is wrong. Here is my thinking: of the 8 million books, half are in English (following WorldCat numbers) and hence are likely to be American works. (I won't worry about books published only in England.) Of those 4 million, 63% according to Lavoie are after 1963, and still protected by copyright. That leaves 1.9 million works published between 1923 and 1964. A 1961 copyright study suggested that maybe 9% of these works were renewed and still protected by copyright, though recent work by Michigan indicates that 41% of the works are still protected by copyright. If we assume 90% are public domain, then 1.7 million works are public domain. If 59% are PD, then 1.1 million are PD. Let's call it 1.5 million - and the number of in-copyright but out of print works should drop from 8 million to 6.5 million.]
So we are talking about 8 [6.5] million works published since 1923 that are in copyright but out of print. Of those, how many are going to have inactive rights holders? What percentage of authors are going to register with Google, and what percentage will ignore the call? Or if we look another way - what percentage of these works are true orphans?
Denise Troll Covey's numbers might provide some guidance. In CMU's random trial, she was unable to locate 21% of publishers. (There random sample was not limited to out-of-print books, so the percentage might actually be too low.) If we assume that number would hold on the 8 [6.5] million, that would mean that we have about 1.7 [1.4] million true orphan works in the total database of 13.5  million. (That number might actually be smaller since some rights holders other than publishers might come forward via the settlement.) The remaining 11.8 [10.6 ] million books would either have rights holders who registered with Google or who choose not to register.
Even with orphan works legislation, these works would not be eligible for inclusion in a digitized books database since they are not true orphans. The Google Books settlement is the only way to get cost-effective access to them.
What we need in the settlement is a compulsory license that would allow anyone to license the use of a work maintained by a non-active rights holders, and not just orphan works.