Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes of joacmue #14

Open
wants to merge 139 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
4177126
python3-ified some scripts
Dec 5, 2020
e7ad0c7
minor clean-up
Dec 5, 2020
ba7cc1c
made the --name only variant work
Dec 6, 2020
b1b935a
corrected fault with heading indentation
Dec 6, 2020
9487f75
Made tables work... sort of.
Dec 6, 2020
662fa16
made BR (line break?) work
Dec 7, 2020
fee8cde
fixed multi-row headers and lists
Dec 7, 2020
d13c8c8
skipping line breaks in "kommentar"
Dec 10, 2020
a8aaf26
Made lists render inside table (no indentation)
Dec 11, 2020
e761381
prettyfying alphanumeric list indices
Dec 11, 2020
69f3a71
clean-up of todos, tables should work now
Dec 17, 2020
a4609e0
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
5af2990
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
d1e84e8
made the banz scraper work again
joacmue Dec 23, 2020
4dca275
Added some notes on what this actually does
joacmue Mar 28, 2021
5d64c00
Merge remote-tracking branch 'gesetze-tools-upstream/master'
joacmue Mar 28, 2021
9c1aa53
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
f7ebdf1
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
f91a8cf
Some suggested changes from the PR
joacmue Apr 3, 2021
5bdfd5e
Suggested Changes from the PR
joacmue Apr 3, 2021
4da5c5c
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
70f8035
python3-ified some scripts
Dec 5, 2020
6dc6e7a
minor clean-up
Dec 5, 2020
7dd3f8c
made the --name only variant work
Dec 6, 2020
ab9b64f
Made tables work... sort of.
Dec 6, 2020
4005d7e
made BR (line break?) work
Dec 7, 2020
52b80c8
fixed multi-row headers and lists
Dec 7, 2020
b324727
skipping line breaks in "kommentar"
Dec 10, 2020
8e49ec8
Made lists render inside table (no indentation)
Dec 11, 2020
f64454f
prettyfying alphanumeric list indices
Dec 11, 2020
f9cb296
clean-up of todos, tables should work now
Dec 17, 2020
b25e56a
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
937b86f
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
b29530a
made the banz scraper work again
joacmue Dec 23, 2020
d3d84ba
Added some notes on what this actually does
joacmue Mar 28, 2021
2bd7a75
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
ee9bf6b
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
eb3ac97
Some suggested changes from the PR
joacmue Apr 3, 2021
f36e726
Suggested Changes from the PR
joacmue Apr 3, 2021
09c5004
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
35452ff
Merge branch 'master' of https://github.com/joacmue/gesetze-tools
joacmue Apr 6, 2021
17b5ac5
Corrected a copy typo of double brackets
joacmue Apr 6, 2021
d4ced4d
Removing two causes of linter errors
joacmue Apr 6, 2021
aac5558
removing banz_scraper python 2.x leftovers
joacmue Apr 6, 2021
f70a0ed
Removing some linter warnings
joacmue Apr 6, 2021
53b96fc
Minor clean-up
joacmue Apr 6, 2021
fb39d0a
minor clean-up
joacmue Apr 6, 2021
92b2bf7
Continuing to please the linter
joacmue Apr 6, 2021
823d076
Minor modifications.
darkdragon-001 Apr 6, 2021
2209244
Update data in separate commits/branches.
darkdragon-001 Apr 6, 2021
885061a
Some fixes
darkdragon-001 Apr 17, 2021
31f948a
Merge remote-tracking branch 'origin/master' into joacmue
darkdragon-001 Apr 18, 2021
246cc82
Removing regex qualifiers from non-regex strings
joacmue Apr 18, 2021
c41acef
Merge branch 'master' of https://github.com/joacmue/gesetze-tools
joacmue Apr 18, 2021
b065734
Re-adding the default flush when outside tables
joacmue Apr 18, 2021
49ef8a6
Removing special handling of lettered list indices
joacmue Apr 18, 2021
532da90
Cleaning up the backspaces in tables & lists
joacmue Apr 18, 2021
3a01c61
not printing leading line break on table headers
joacmue Apr 18, 2021
d48adf3
Removing mess around handling breaks
joacmue Apr 18, 2021
2264d43
Cleaning up custombreaks
joacmue Apr 25, 2021
279e19e
Adding empty cells for colspans
joacmue Apr 25, 2021
73aaf7b
Something was strange with the round function
joacmue Apr 25, 2021
a7b7e82
Making breaks on encounters of <BR> again
joacmue Apr 25, 2021
e4cf4bb
Removing special case for begin of <br>
joacmue Apr 25, 2021
2d13b97
Explicitly parsing colnames for colspans now
joacmue May 9, 2021
17bf66b
Making lawdown go over all laws without errors
joacmue May 13, 2021
fab87e6
Making multiline headers with colspan render nicer
joacmue May 14, 2021
b117e25
Cleaning up column list handling
joacmue May 14, 2021
358fd09
python3-ified some scripts
Dec 5, 2020
ac34cd9
minor clean-up
Dec 5, 2020
b1152ed
made the --name only variant work
Dec 6, 2020
ed7c5a0
Made tables work... sort of.
Dec 6, 2020
84c986b
made BR (line break?) work
Dec 7, 2020
923c6ec
fixed multi-row headers and lists
Dec 7, 2020
389e419
skipping line breaks in "kommentar"
Dec 10, 2020
abaefbf
Made lists render inside table (no indentation)
Dec 11, 2020
2264272
prettyfying alphanumeric list indices
Dec 11, 2020
c5b2048
clean-up of todos, tables should work now
Dec 17, 2020
1e21aaf
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
bd9763a
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
fe61b75
made the banz scraper work again
joacmue Dec 23, 2020
59eea7e
Added some notes on what this actually does
joacmue Mar 28, 2021
7128419
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
33e3ee6
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
5058f20
Some suggested changes from the PR
joacmue Apr 3, 2021
b0fa94d
Suggested Changes from the PR
joacmue Apr 3, 2021
7361631
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
6ec2842
python3-ified some scripts
Dec 5, 2020
a93f191
minor clean-up
Dec 5, 2020
4042407
made the --name only variant work
Dec 6, 2020
0c916fd
corrected fault with heading indentation
Dec 6, 2020
0cd86c3
Made tables work... sort of.
Dec 6, 2020
6319695
made BR (line break?) work
Dec 7, 2020
c3262fa
fixed multi-row headers and lists
Dec 7, 2020
6ce7e2e
skipping line breaks in "kommentar"
Dec 10, 2020
ff0ec27
Made lists render inside table (no indentation)
Dec 11, 2020
c408faf
prettyfying alphanumeric list indices
Dec 11, 2020
7e042b1
clean-up of todos, tables should work now
Dec 17, 2020
355d84b
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
b5ef45c
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
699ab06
made the banz scraper work again
joacmue Dec 23, 2020
dce5570
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
4ec214e
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
0836f20
Some suggested changes from the PR
joacmue Apr 3, 2021
a6ad1e6
Suggested Changes from the PR
joacmue Apr 3, 2021
3716a55
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
fe153d3
Corrected a copy typo of double brackets
joacmue Apr 6, 2021
1689312
Removing two causes of linter errors
joacmue Apr 6, 2021
aa9ad98
removing banz_scraper python 2.x leftovers
joacmue Apr 6, 2021
52e0641
Removing some linter warnings
joacmue Apr 6, 2021
170c0ff
Minor clean-up
joacmue Apr 6, 2021
f21b36f
minor clean-up
joacmue Apr 6, 2021
6227fcb
Continuing to please the linter
joacmue Apr 6, 2021
2d34d37
Removing regex qualifiers from non-regex strings
joacmue Apr 18, 2021
91bf7ea
Minor modifications.
darkdragon-001 Apr 6, 2021
f737049
Update data in separate commits/branches.
darkdragon-001 Apr 6, 2021
f1ec414
Some fixes
darkdragon-001 Apr 17, 2021
8e22485
Improve issue templates.
darkdragon-001 Apr 17, 2021
d2fcb9d
Try to fix formatting template.
darkdragon-001 Apr 18, 2021
f820c53
Enable CI also for PRs.
darkdragon-001 Apr 18, 2021
f8026bf
Re-adding the default flush when outside tables
joacmue Apr 18, 2021
06fe657
Removing special handling of lettered list indices
joacmue Apr 18, 2021
5dae428
Cleaning up the backspaces in tables & lists
joacmue Apr 18, 2021
4905e9a
not printing leading line break on table headers
joacmue Apr 18, 2021
1d2026a
Removing mess around handling breaks
joacmue Apr 18, 2021
45bb01b
Cleaning up custombreaks
joacmue Apr 25, 2021
99dc3c0
Adding empty cells for colspans
joacmue Apr 25, 2021
5f41808
Something was strange with the round function
joacmue Apr 25, 2021
2f61890
Making breaks on encounters of <BR> again
joacmue Apr 25, 2021
625bd5e
Removing special case for begin of <br>
joacmue Apr 25, 2021
83bcc5c
Explicitly parsing colnames for colspans now
joacmue May 9, 2021
724c33b
Making lawdown go over all laws without errors
joacmue May 13, 2021
074e674
Making multiline headers with colspan render nicer
joacmue May 14, 2021
bd96212
Cleaning up column list handling
joacmue May 14, 2021
0707c01
Rebased to master
joacmue May 15, 2021
10cff58
Re-rean bgbl_scraper, updated readme.md
joacmue May 15, 2021
c66ed0c
Merge branch 'master' of https://github.com/joacmue/gesetze-tools
joacmue May 15, 2021
978b871
aligned vkbl.json formatting with other files
joacmue May 15, 2021
4c76092
Minor fixes
darkdragon-001 May 15, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
@@ -1,3 +1,7 @@
test.zip
test.py
bgbl
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who creates these files?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be me. Did not want to push my notes and test scripts to the repo. Should probably store those outside the working folder and remove it from .gitignore

laws
laws-md
.vscode
__pycache__
38 changes: 33 additions & 5 deletions README.md
Expand Up @@ -4,7 +4,6 @@ BundesGit Gesetze Tools
These scripts are used to keep the law repository up to date.

Install requirements:

```bash
pip install -r requirements.txt
```
Expand All @@ -17,28 +16,57 @@ Downloads all laws as XML files from
[www.gesetze-im-internet.de](http://www.gesetze-im-internet.de/)
and extracts them to a directory.

Last tested: 2017-01-14 SUCCESS
### Useage
darkdragon-001 marked this conversation as resolved.
Show resolved Hide resolved
Update your list of laws first:
```bash
python lawde.py updatelist
python lawde.py loadall
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
python lawde.py loadall

This is still stated below and after the note.

```

You can then download all laws by calling (<span style="color:red">**not recommended!**</span>)
```bash
python lawde.py loadall
```
Which will take approx. 2-3hrs.

Alternatively, you can find the individual law you're interested in in [./data/laws.json](./data/laws.json), which is mostly a list of laws in this form:
```bash
{"slug": "<shortname>", "name": "<longname>", "abbreviation": "<abbreviation>"}
```
You can download individual laws by calling (<span style="color:red">**recommended**</span>)
```bash
python lawde.py load <shortname>
```

Last tested: 2020-12-05 SUCCESS

## lawdown.py

Converts all XML laws to Markdown and copies them with other files related
to the law into specified working directory.

Last tested: 2017-01-14 SUCCESS
### Useage
darkdragon-001 marked this conversation as resolved.
Show resolved Hide resolved
```bash
python lawdown.py convert <inpath> <outpath>
python lawdown.py convert ./laws ./laws-md
```

Last tested: 2020-12-05 SUCCESS

## bgbl_scraper.py

Scrapes the table of contents of all issues of the Bundesgesetzblatt and dumps
the result to JSON.

Last tested: 2017-01-14 FAILED ("KeyError: xaversid")
Last tested: 2020-12-05 FAILED ("KeyError: xaversid")
Issue seems to be a restructure of the Bundesanzeiger Webpage. The original bgbl links get an error 404.

## banz_scraper.py

Scrapes the table of contents of all available issues of the Bundesanzeiger and
dumps the result to JSON.

Last tested: 2017-01-14 SUCCESS
Last tested: 2020-12-23 SUCCESS

## vkbl_scraper.py

Expand Down
48 changes: 37 additions & 11 deletions banz_scraper.py
Expand Up @@ -17,6 +17,8 @@
banz_scaper.py data/banz.json

"""
import os
import sys
from pathlib import Path
import re
import json
Expand All @@ -34,6 +36,11 @@ class BAnzScraper:
LIST = ('genericsearch_param.edition=%s&genericsearch_param.sort_type='
'&%%28page.navid%%3Dofficial_starttoofficial_start_update%%29='
'Veröffentlichungen+anzeigen')
# Website changed, so I am changing the links here
BASE_URL = 'https://www.bundesanzeiger.de/pub/de/amtlicher-teil?'
BASE = ''
YEAR = '&year=%s'
LIST = '&edition=BAnz+AT+%s'

MONTHS = ['Januar', 'Februar', 'März', 'April', 'Mai', 'Juni', 'Juli',
'August', 'September', 'Oktober', 'November', 'Dezember']
Expand All @@ -55,30 +62,38 @@ def scrape(self, low=0, high=10000):

def get_years(self):
url = self.BASE_URL + self.BASE
# this is the landing page of the Bundesanzeiger
# https://www.bundesanzeiger.de/ebanzwww/wexsservlet?page.navid=to_official_part&global_data.designmode=eb
# which resolves to: https://www.bundesanzeiger.de/pub/de/amtlicher-teil
response = self.get(url)
years = []
root = lxml.html.fromstring(response.text)
selector = '#td_sub_menu_v li'
selector = '#id3' # was: selector = '#td_sub_menu_v li'
# This is the YEAR dropdown selector on top of the table (checked 2020/12/22)
for li in root.cssselect(selector):
try:
year = int(li.text_content())
years += [int(x) for x in li.text_content().split('\n') if x]
#was: year = int(li.text_content())
except ValueError:
continue
years.append(year)
#was: years.append(year)
return years

def get_dates(self, year):
url = self.BASE_URL + self.YEAR % year
response = self.get(url)
dates = []
root = lxml.html.fromstring(response.text)
selector = 'select[name="genericsearch_param.edition"] option'
selector = '#id4' # was: selector = 'select[name="genericsearch_param.edition"] option'
# This is the DATE dropdown selector on top of the table (checked 2020/12/22)
for option in root.cssselect(selector):
dates.append((option.attrib['value'], option.text_content().strip()))
#was: dates.append((option.attrib['value'], option.text_content().strip()))
dates += [x for x in option.text_content().split('\n') if x]
return dates

def get_items(self, year, date):
url = self.BASE_URL + self.LIST % date[0]
#url = self.BASE_URL + self.LIST % date[0]
url = self.BASE_URL + self.YEAR % year + self.LIST % date
response = self.get(url)
items = {}
root = lxml.html.fromstring(response.text)
Expand Down Expand Up @@ -121,14 +136,25 @@ def main(arguments):
maxyear = arguments['<maxyear>'] or 10000
minyear = int(minyear)
maxyear = int(maxyear)
print('This will scrape information from the Bundesanzeiger between ' + str(minyear) + ' and ' + str(maxyear) + '.')
print('Results will be stored in ' + arguments['<outputfile>'])
print('You will see all dates with publications appear below as they are parsed.')
banz = BAnzScraper()
data = {}
if Path(arguments['<outputfile>']).exists():
with open(arguments['<outputfile>']) as f:
data = json.load(f)
if os.path.exists(arguments['<outputfile>']):
if (sys.version_info > (3, 0)):
with open(arguments['<outputfile>']) as f:
data = json.load(f)
else:
with file(arguments['<outputfile>']) as f:
data = json.load(f)
data.update(banz.scrape(minyear, maxyear))
with open(arguments['<outputfile>'], 'w') as f:
json.dump(data, f)
if (sys.version_info > (3, 0)):
with open(arguments['<outputfile>'], 'w+', encoding='utf8') as f:
json.dump(data, f, indent=2, sort_keys=True, ensure_ascii=False)
else:
with file(arguments['<outputfile>'], 'w', encoding='utf8') as f:
json.dump(data, f)

if __name__ == '__main__':
from docopt import docopt
Expand Down
24 changes: 18 additions & 6 deletions bgbl_scraper.py
Expand Up @@ -16,12 +16,15 @@
from pathlib import Path
import re
import json
import sys
from collections import defaultdict

import lxml.html
import requests


# Landing page might be this one:
# https://www.bgbl.de/xaver/bgbl/start.xav#__bgbl__%2F%2F*[%40attr_id%3D'I_2020_57_inhaltsverz']__1607176275258
# https://www.bgbl.de/xaver/bgbl/start.xav?start=//*[@attr_id=%27%27]#__bgbl__%2F%2F*%5B%40attr_id%3D%27I_2020_62_inhaltsverz%27%5D__1608231069168
class BGBLScraper:
BASE_URL = 'http://www.bgbl.de/Xaver/'
START = 'start.xav?startbk=Bundesanzeiger_BGBl'
Expand Down Expand Up @@ -85,6 +88,7 @@ def parse(self, response):

def get_base_toc(self):
url = self.BASE_URL + self.BASE_TOC
print(url)
response = self.get(url)
root = self.parse(response)
selector = 'a.tocEntry'
Expand Down Expand Up @@ -214,12 +218,20 @@ def main(arguments):
maxyear = int(maxyear)
bgbl = BGBLScraper()
data = {}
if Path(arguments['<outputfile>']).exists():
with open(arguments['<outputfile>']) as f:
data = json.load(f)
if os.path.exists(arguments['<outputfile>']):
if (sys.version_info > (3, 0)):
darkdragon-001 marked this conversation as resolved.
Show resolved Hide resolved
with open(arguments['<outputfile>'], 'r') as f:
data = json.load(f)
else:
with file(arguments['<outputfile>']) as f:
data = json.load(f)
data.update(bgbl.scrape(minyear, maxyear))
with open(arguments['<outputfile>'], 'w') as f:
json.dump(data, f)
if (sys.version_info > (3, 0)):
with open(arguments['<outputfile>'], 'w+', encoding='utf8') as f:
json.dump(data, f, indent=2, sort_keys=True, ensure_ascii=False)
else:
with file(arguments['<outputfile>'], 'w') as f:
json.dump(data, f)

if __name__ == '__main__':
from docopt import docopt
Expand Down
289,504 changes: 289,503 additions & 1 deletion data/banz.json

Large diffs are not rendered by default.