Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace HTMetadata with Bookworm-MARC? #3

Open
bmschmidt opened this issue Sep 7, 2016 · 5 comments
Open

Replace HTMetadata with Bookworm-MARC? #3

bmschmidt opened this issue Sep 7, 2016 · 5 comments

Comments

@bmschmidt
Copy link
Member

It's been my intention to replace the existing HTMetadata module here with the new Bookworm_MARC one I wrote in the spring. The goal is to pull the information we can directly from the MARC files, rather than intermediating through the Solr index.

I recommend this (in part because I think the date parsing is significantly better, and it captures some fields I think matter a lot like contributing library), but it's possible this is not the best way to integrate all the existing work at HTRC. My original hope was that Bookworm-MARC would bundle some HTRC code, but that ship has sort of sailed.

Another possibility is to use the MARC fields by default, but create a second supplemental table from Solr and load those in using bookworm add_metadata. Or vice-versa; Solr primarily, and MARC for supplemental information.

There's also the question of whether we should use first_publisher (as I do in MARC) or any_publisher.

@organisciak
Copy link
Member

Yes, definitely. I was intending to deprecate the old HTMetadata. Feeding from a single source is sensible.

Do you have any example of what fields you hope to index? Is this example still current: Bookworm-project/Bookworm-MARC#5? @tcole3 intended to put together metadata for BW following from the JSON files we crunched for the new full collection EF, but I think going completely with your codebase would be sensible. Tim was going to match up the proper date fields (for serial vs. non-serial), but I believe you've already done that?

@tcole3 might also have a thought about first_publisher vs any_publisher.

@organisciak
Copy link
Member

I don't think publisher is a hugely enlightening snippet of information, so first_publisher seems sufficient, for what it's worth.

@bmschmidt
Copy link
Member Author

Tim may have better ideas than I on exactly which date fields are best. My strategy has been (I think) the special Hathi field (974) is best, but then I honestly don't remember how I prioritize; whether the MARC publication field (260c, maybe?) or the first date in field 008 when they conflict is completely unclear to me.

If there is some strategy that varies with serial/nonserial for what field to look at, that would be great. My impression was that any date in field 974 tends to better than the record-level information since, as I found, so many serials seem to be listed as monographs and vice-versa. http://rpubs.com/benmschmidt/189321

That link is is pretty close to accurate, but I believe there may be some unpushed changes in the codebase. I will take a look after my class tomorrow.

@bmschmidt
Copy link
Member Author

bmschmidt commented Sep 7, 2016

Publisher is one of those things that certain book history might care deeply about. But it probably requires extensive standardization to be useful; I started but did not complete that work.

@organisciak
Copy link
Member

My limited understanding is that series' need the enumeration/chronology information to get the correct date rather than the first-published date, but that field can be incorrect for republished books. Again, deferring to the experts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants