Skip to content
This repository has been archived by the owner on Mar 27, 2023. It is now read-only.

scrape esac data #244

Open
maxheld83 opened this issue Jul 9, 2020 · 6 comments
Open

scrape esac data #244

maxheld83 opened this issue Jul 9, 2020 · 6 comments
Assignees
Labels
esac our use of http://esac-initiative.org

Comments

@maxheld83
Copy link
Contributor

the comprehensive data in ESAC_Transformative_Agreement_Übersicht_der_Verträge.xlsx is so far entered by hand from the esac website.

Perhaps there might be away to scrape this off the website programmatically and/or ask esac for the data in structured form.

Not sure how central this is to our mission though.

@maxheld83 maxheld83 self-assigned this Jul 9, 2020
@maxheld83 maxheld83 added the esac our use of http://esac-initiative.org label Jul 29, 2020
@maxheld83 maxheld83 added this to To do in Scholcomm Analytics Kanban via automation Jul 29, 2020
@maxheld83 maxheld83 moved this from To do to blocked in Scholcomm Analytics Kanban Jul 29, 2020
@maxheld83 maxheld83 changed the title parse / update esac data from website get esac data programmatically Jul 29, 2020
@maxheld83
Copy link
Contributor Author

also opens up #251 and makes #240 much easier

@maxheld83
Copy link
Contributor Author

I think it'd be really great to get the ESAC registry data in a programmatic way, ideally without scraping, since the data surely must exist in some database.
This would open a bunch of interesting applications for us (see esac label).

@njahn82 @Henrieke72:

  • I wasn't able to find the ESAC registry data anywhere but as HTML on their website. Is there already a proper source that I've missed?
  • if not, can I just go and ask Kai Geschuhn whether and how they'd be willing to share the underlying data?

@maxheld83
Copy link
Contributor Author

and @njahn82 can you comment how strategically important the ESAC registry data is for our project?

I really want to leverage the work that @Henrieke72 did with it already, and it seems to me the opportunities to mash up the ESAC data with the rest of hoad could be quite interesting #251, but I might not have enough context.

Considering that the data is already mostly structured (and even tidy), properly cleaning and exposing it shouldn't be too much work, maybe a day or two.
Depending on what ESAC wants to do with their data, we can also wrap it up in a small R package that's separate from hoad, so more people can use it.

@maxheld83
Copy link
Contributor Author

so this will be scraped in a separate package

@maxheld83 maxheld83 changed the title get esac data programmatically scrape esac data Jul 30, 2020
@Henrieke72
Copy link

@maxheld83 Unfortunately, there is only the HTML version of the data, this is why I had to copy and paste it into an Excel sheet. As the registry data are very dynamic, maybe there is a way to automatically update the Excel file with the new data?

@maxheld83
Copy link
Contributor Author

Thanks @Henrieke72! I'll do that; I'll scrape the data off the website and then offer an excel export.

@maxheld83 maxheld83 removed this from blocked in Scholcomm Analytics Kanban Jun 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
esac our use of http://esac-initiative.org
Projects
None yet
Development

No branches or pull requests

2 participants