Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' #257

demongolem · 2020-04-10T12:11:58Z

I just downloaded all the bills using ./run govinfo --bulkdata=BILLSTATUS. Then, I went on to .run bills. I believe there are currently about 50,000 items to be processed from running it. About 7,300 them failed with exactly the same stack trace. Here it is at the bottom of this issue report.

The files are scattered among type (hconres in this case) session (113 in this case). I validated a few of the xml files that were returned from the govinfo run and they we valid and they looked good. This leads me to believe that (and this is my guess) there are certain documents with characters perhaps or something of that sort which cause the error to arise. I will look more into it, however perhaps others have insight into what is going on. Hey, this might just be a problem with me using Python 3 and difference between str and bytes in Python 2 and Python 3. However, so far, I have gotten it to work with Python 3 (work I can share at some point if my version ever fully works).

[hconres25-113] Exception:

Traceback (most recent call last):

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 178, in process_$
results = fetch_func(id, options, *extra_args)

File "/home/gwerner/from_greg/congress/tasks/bills.py", line 101, in process_$
bill_data = form_bill_json_dict(xml_as_dict)

File "/home/gwerner/from_greg/congress/tasks/bills.py", line 173, in form_bil$
'summary': bill_info.summary_for(bill_dict['summaries']['billSummaries']),

File "/home/gwerner/from_greg/congress/tasks/bill_info.py", line 185, in summ$
"text": strip_tags(summary['text']),

File "/home/gwerner/from_greg/congress/tasks/bill_info.py", line 199, in stri$
text = utils.unescape(text)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 470, in unescape
text = re.sub("&#?\w+;", fixup, text)

File "/usr/lib64/python3.6/re.py", line 191, in sub
return _compile(pattern, flags).sub(repl, string, count)

File "/home/gwerner/from_greg/congress/tasks/utils.py", line 465, in fixup
text = chr(html.entities.name2codepoint[text[1:-1]])

AttributeError: module 'lxml.html' has no attribute 'entities'

The text was updated successfully, but these errors were encountered:

demongolem · 2020-04-10T12:23:48Z

Yeah, I have found how to correct one file, that being hconres2-113. It contained " around a word. If I replace it with the double quote character before passing it to utils.unescape(text) the bill was processed successfully.

Also something like Air Force RDT&E; creates problems because the regex detects &E; and thinks that is fishy, but really it is part of actual text and not HTML encoding.

I think that for Python 3 anyway, this would take care of much of it without resorting to the fixup function (using ht because html is already a variable in the code)

import html as ht text = ht.unescape(text)

So in utils.py, the solution for Python 3 anyway would be to change the bit in the unescape function to this

try:
    text = ht.unescape(text)
except Exception as e:
    print(repr(e))
# this line does not appear necessary for Python 3
# in fact it will cause errors
# text = re.sub("&#?\w+;", fixup, text)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' #257

Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' #257

demongolem commented Apr 10, 2020

demongolem commented Apr 10, 2020 •

edited

Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' #257

Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' #257

Comments

demongolem commented Apr 10, 2020

demongolem commented Apr 10, 2020 • edited

demongolem commented Apr 10, 2020 •

edited