Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing Volumes in the Bookworm? #6

Open
bmschmidt opened this issue May 28, 2016 · 3 comments
Open

Missing Volumes in the Bookworm? #6

bmschmidt opened this issue May 28, 2016 · 3 comments

Comments

@bmschmidt
Copy link
Member

bmschmidt commented May 28, 2016

Integrating the MARC records with the Bookworm, I've noticed that there seem to be just under a million books
that don't exist in the Bookworm but do have Marc records. (IE, the bookworm has about 4.7 million volumes; there are about 5.5m volumes in the MARC records).

They are not evenly distributed. The losses include, most notably, every single Internet-Archive scanned book. Where have they gone? Maybe the entire open-open corpus is missing?

Here's a list of the scanners by number of volumes in the MARC files: (from field 974$s)

bschmidt@sibelius:/raid/hathipd$ jq '.scanner' jsoncatalognew.txt | sort | uniq -c
    156 "bc"
   1169 "borndigital"
      2 "brooklynmuseum"
    292 "clark"
     89 "clements-umich"
      3 "cornell"
  68344 "cornell-ms"
   4772 "getty"
    977 "geu"
4880745 "google"
 483568 "ia"
  54847 "lit-dlps-dc"
   1062 "mcgill"
  10717 "mdl"
  10501 "mhs"
     11 "mou"
    191 "nnc"
    374 "northwestern"
      1 "private"
   1109 "tamu"
    346 "ucm"
     68 "udel"
   4192 "uiuc"
     57 "umd"
      7 "umn"
    875 "ump"
     17 "wau"
  22948 "yale"
    420 "yale2"

Here, on the other hand, are the sources inside the bookworm (ie, the MARC records that also exist inside the bookworm).

Every IA-scanned book is gone; 68,000 Cornell-MS books are gone, and about 700,000 Google-scanned books are missing.

bschmidt@sibelius:/raid/hathipd$ mysql -e "SELECT scanner,COUNT(*) from contributing_library_serial_killer_guess GROUP BY scanner" hathipd
+----------------+----------+
| scanner        | COUNT(*) |
+----------------+----------+
| clements-umich |       32 |
| cornell        |        2 |
| google         |  4102853 |
| lit-dlps-dc    |    46365 |
| northwestern   |        1 |
| ucm            |       18 |
| yale           |    22947 |
+----------------+----------+
@bmschmidt
Copy link
Member Author

bmschmidt commented May 28, 2016

It looks like this may be an upstream problem from the feature counts. They are only 4.8m volumes there. I spot checked several (not enough to be confident, though) at random and all were google scanned. @organisciak or someone else; are the features supposed to exclude ia-scanned books? Do they? Can we get them in Bookworm?

@organisciak
Copy link
Member

The EF files didn't exclude anything, it's just that the PD collection has grown since we crunched EF version 0.2 in Feb 2015. We're currently working on non-PD data, we'll update PD Extracted Features later.

@bmschmidt
Copy link
Member Author

My bad. It turns out this had to do with volume ids; the IA-scanned books are also the once that have colons and slashes, in the volume ids, and for whatever reason those are replaced with + and = in the volume identifiers in the Bookworm database. So the linkage was not happening on my end.

Oops. Should have listened to myself when I said I didn't check enough to be confident.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants