Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' #257

Open
demongolem opened this issue Apr 10, 2020 · 1 comment

Comments

@demongolem
Copy link

I just downloaded all the bills using ./run govinfo --bulkdata=BILLSTATUS. Then, I went on to .run bills. I believe there are currently about 50,000 items to be processed from running it. About 7,300 them failed with exactly the same stack trace. Here it is at the bottom of this issue report.

The files are scattered among type (hconres in this case) session (113 in this case). I validated a few of the xml files that were returned from the govinfo run and they we valid and they looked good. This leads me to believe that (and this is my guess) there are certain documents with characters perhaps or something of that sort which cause the error to arise. I will look more into it, however perhaps others have insight into what is going on. Hey, this might just be a problem with me using Python 3 and difference between str and bytes in Python 2 and Python 3. However, so far, I have gotten it to work with Python 3 (work I can share at some point if my version ever fully works).

[hconres25-113] Exception:

Traceback (most recent call last):

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 178, in process_$
results = fetch_func(id, options, *extra_args)

File "/home/gwerner/from_greg/congress/tasks/bills.py", line 101, in process_$
bill_data = form_bill_json_dict(xml_as_dict)

File "/home/gwerner/from_greg/congress/tasks/bills.py", line 173, in form_bil$
'summary': bill_info.summary_for(bill_dict['summaries']['billSummaries']),

File "/home/gwerner/from_greg/congress/tasks/bill_info.py", line 185, in summ$
"text": strip_tags(summary['text']),

File "/home/gwerner/from_greg/congress/tasks/bill_info.py", line 199, in stri$
text = utils.unescape(text)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 470, in unescape
text = re.sub("&#?\w+;", fixup, text)

File "/usr/lib64/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 465, in fixup
text = chr(html.entities.name2codepoint[text[1:-1]])

AttributeError: module 'lxml.html' has no attribute 'entities'

@demongolem
Copy link
Author

demongolem commented Apr 10, 2020

Yeah, I have found how to correct one file, that being hconres2-113. It contained " around a word. If I replace it with the double quote character before passing it to utils.unescape(text) the bill was processed successfully.

Also something like Air Force RDT&E; creates problems because the regex detects &E; and thinks that is fishy, but really it is part of actual text and not HTML encoding.

I think that for Python 3 anyway, this would take care of much of it without resorting to the fixup function (using ht because html is already a variable in the code)

import html as ht text = ht.unescape(text)

So in utils.py, the solution for Python 3 anyway would be to change the bit in the unescape function to this

try:
    text = ht.unescape(text)
except Exception as e:
    print(repr(e))
# this line does not appear necessary for Python 3
# in fact it will cause errors
# text = re.sub("&#?\w+;", fixup, text)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant