Missing Volumes in the Bookworm? #6

bmschmidt · 2016-05-28T22:13:43Z

Integrating the MARC records with the Bookworm, I've noticed that there seem to be just under a million books
that don't exist in the Bookworm but do have Marc records. (IE, the bookworm has about 4.7 million volumes; there are about 5.5m volumes in the MARC records).

They are not evenly distributed. The losses include, most notably, every single Internet-Archive scanned book. Where have they gone? Maybe the entire open-open corpus is missing?

Here's a list of the scanners by number of volumes in the MARC files: (from field 974$s)

bschmidt@sibelius:/raid/hathipd$ jq '.scanner' jsoncatalognew.txt | sort | uniq -c
    156 "bc"
   1169 "borndigital"
      2 "brooklynmuseum"
    292 "clark"
     89 "clements-umich"
      3 "cornell"
  68344 "cornell-ms"
   4772 "getty"
    977 "geu"
4880745 "google"
 483568 "ia"
  54847 "lit-dlps-dc"
   1062 "mcgill"
  10717 "mdl"
  10501 "mhs"
     11 "mou"
    191 "nnc"
    374 "northwestern"
      1 "private"
   1109 "tamu"
    346 "ucm"
     68 "udel"
   4192 "uiuc"
     57 "umd"
      7 "umn"
    875 "ump"
     17 "wau"
  22948 "yale"
    420 "yale2"

Here, on the other hand, are the sources inside the bookworm (ie, the MARC records that also exist inside the bookworm).

Every IA-scanned book is gone; 68,000 Cornell-MS books are gone, and about 700,000 Google-scanned books are missing.

bschmidt@sibelius:/raid/hathipd$ mysql -e "SELECT scanner,COUNT(*) from contributing_library_serial_killer_guess GROUP BY scanner" hathipd
+----------------+----------+
| scanner        | COUNT(*) |
+----------------+----------+
| clements-umich |       32 |
| cornell        |        2 |
| google         |  4102853 |
| lit-dlps-dc    |    46365 |
| northwestern   |        1 |
| ucm            |       18 |
| yale           |    22947 |
+----------------+----------+

The text was updated successfully, but these errors were encountered:

bmschmidt · 2016-05-28T22:59:32Z

It looks like this may be an upstream problem from the feature counts. They are only 4.8m volumes there. I spot checked several (not enough to be confident, though) at random and all were google scanned. @organisciak or someone else; are the features supposed to exclude ia-scanned books? Do they? Can we get them in Bookworm?

organisciak · 2016-05-30T01:59:39Z

The EF files didn't exclude anything, it's just that the PD collection has grown since we crunched EF version 0.2 in Feb 2015. We're currently working on non-PD data, we'll update PD Extracted Features later.

bmschmidt · 2016-05-30T19:04:57Z

My bad. It turns out this had to do with volume ids; the IA-scanned books are also the once that have colons and slashes, in the volume ids, and for whatever reason those are replaced with + and = in the volume identifiers in the Bookworm database. So the linkage was not happening on my end.

Oops. Should have listened to myself when I said I didn't check enough to be confident.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing Volumes in the Bookworm? #6

Missing Volumes in the Bookworm? #6

bmschmidt commented May 28, 2016 •

edited

bmschmidt commented May 28, 2016 •

edited

organisciak commented May 30, 2016

bmschmidt commented May 30, 2016

Missing Volumes in the Bookworm? #6

Missing Volumes in the Bookworm? #6

Comments

bmschmidt commented May 28, 2016 • edited

bmschmidt commented May 28, 2016 • edited

organisciak commented May 30, 2016

bmschmidt commented May 30, 2016

bmschmidt commented May 28, 2016 •

edited

bmschmidt commented May 28, 2016 •

edited