Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: Update data on the CI #26

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

Draft: Update data on the CI #26

wants to merge 9 commits into from

Conversation

jbruechert
Copy link
Contributor

No description provided.

@ulfgebhardt
Copy link
Member

ulfgebhardt commented Mar 29, 2021

  1. The update & push to github must become its own workflow e.g .github/workflows/crawl.yml so we can separate stuff and schedule the job. Ofc what you do is perfectly fine for testing.
name: "Close stale issues"
on:
  schedule:
  - cron: "0 0 * * *"

jobs:
  stale:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/stale@v1.1.0
      with:
        repo-token: ${{ secrets.GITHUB_TOKEN }}
        stale-issue-message: 'Message to comment on stale issues. If none provided, will not mark issues stale'
        stale-pr-message: 'Message to comment on stale PRs. If none provided, will not mark PRs stale'

src

  1. If you need secrets in the repo to push to the correct git repo, tell me and I make things happen

  2. Is it smart to keep data like banz.json in this repo?

  3. ....

  4. Profit! Good work, very awesome to see this year old problem to get solved <3

@jbruechert
Copy link
Contributor Author

Is it smart to keep data like banz.json in this repo?

So far we are treating this data as input data for our other tools. If someone has a use for it outside of this we can move it to a new repo I guess.

@ulfgebhardt
Copy link
Member

Oh I did not know that this tool processes the crawled data further. Its cool, I just thought for separation of code & data it might be wise to do so. If its not thats perfectly fine.

@jbruechert
Copy link
Contributor Author

Before we can merge this we need to fill in the years between 2013 and 2021 with a manual run locally btw. The script currently only starts at the year of the latest commit in the gesetze repo, which is from this year, so it would get confused by that.

@ulfgebhardt
Copy link
Member

ulfgebhardt commented Mar 29, 2021

I guess this is to increase efficiency? How long is the time effort for a whole scrape? My experience with bundestag.de is that they change old stuff regularly and their unstable web interfaces cause faulty data to show up every now and then - this can be detected with regular crawls. So what speaks against crawling everything once a day?

@jbruechert
Copy link
Contributor Author

jbruechert commented Mar 29, 2021

The Bundesanzeiger scraper is really slow. As far as I understand the code, the actual laws are updated completely, but the index of changed laws is only extended, not refreshed. Even crawling just 2021 in the Bundesanzeiger takes about 5 Minutes iirc, and everything would take multiple hours.

@jbruechert jbruechert force-pushed the ci-update branch 2 times, most recently from a2c23e3 to 3d7570f Compare March 29, 2021 20:06
@Croydon
Copy link

Croydon commented Sep 14, 2021

What is the status of this? Any particular reason why the work in this PR did not get continued?

@jbruechert
Copy link
Contributor Author

The data from 2013 and 2021 needs to be added (for example by editing updatelawsgit.py locally). Afterwards I think this could be merged. It would be nice if someone else could take over that work, I'm not that active here right now.

Copy link
Collaborator

@darkdragon-001 darkdragon-001 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work. I would vote in favor of implementing #35 and #36 first, but would be fine otherwise since this is better than nothing.

Comment on lines +25 to +31
def clone_lawsgit() -> None:
print("Updating gesetze.git…")

if not os.path.exists(LAWS_PATH):
run_command(["git", "clone", "--depth=1", LAWS_REPOSITORY, LAWS_PATH])
else:
run_command(["git", "-C", LAWS_PATH, "pull"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed since you are running uses: actions/checkout@v2?

Comment on lines +20 to +23
def run_command(command: List[str]) -> None:
if subprocess.check_call(command) != 0:
print("Error while executing", command)
exit(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we really use subprocesses or better just import the python files directly?

Comment on lines +33 to +37
def get_latest_year() -> int:
repo = Repo(LAWS_PATH)
timestamp = repo.head.commit.committed_date
date = datetime.fromtimestamp(timestamp)
return date.year
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we ensure that this correlation always is valid? Maybe we should check the content of the downloaded json files instead.

run_command(["./banz_scraper.py", BANZ_FILE, str(minyear), str(maxyear)])
run_command(["./bgbl_scraper.py", BGBL_FILE, str(minyear), str(maxyear)])

# TODO add the other indexes here, once they are working
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they working now?

def fetch_raw_xml() -> None:
print("Downloading new xml from gesetze-im-internet.de…")

run_command(["./lawde.py", "loadall", f"--path={RAW_XML_PATH}"])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a lot of data. Do we really need to download everything daily again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants