Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better parsing of nested sections #1750

Open
ajparsons opened this issue Dec 14, 2023 · 2 comments
Open

Better parsing of nested sections #1750

ajparsons opened this issue Dec 14, 2023 · 2 comments

Comments

@ajparsons
Copy link
Contributor

Related to mysociety/parlparse#171 - but I think can be improved just in display.

So there's something a bit off about how TWFY is parsing some complicated debates:

The navigation structure assumes: header starts, header ends, header starts, header ends.

But in practice, this is sometimes nesting:

e.g. https://www.theyworkforyou.com/debates/?id=2020-06-30d.191.3

logically contains all the votes in the following 'debates' - but these are separated off because of the new header.

While parliament groups brings them all in one page https://hansard.parliament.uk/Commons/2020-06-30/debates/581DFFF9-B3ED-4B76-9F51-A1F2325334A6/ImmigrationAndSocialSecurityCo-Ordination(EUWithdrawal)Bill

In practice, the problem I have is making the linking clearer between a vote and the debate.

Currently there isn't a good link the tree, because the parent debate just contains the text of the amendment (which is useful) but not the discussion - while the top level debate (which I guess we could link to instead), does not contain the vote itself.

@dracos
Copy link
Member

dracos commented Dec 14, 2023

The issue is how the parsing code detects (or doesn't) headings, which has always been an issue, see e.g. mysociety/parlparse#53 . I think Parliament's is bad the other way, in that "New Clause 7" (the "heading" of the second vote on that page) is output as pure body text, with no real way of noticing it's something new.

If you look at the source https://www.theyworkforyou.com/pwdata/scrapedxml/debates/debates2020-06-30d.xml you'll see we have it as:

<minor-heading id="uk.org.publicwhip/debate/2020-06-30d.269.0" nospeaker="true" colnum="269" time="18:00:00" url=""> New Clause 7 </minor-heading>
<minor-heading id="uk.org.publicwhip/debate/2020-06-30d.269.1" nospeaker="true" colnum="269" time="18:00:00" url=""> Time limit on immigration detention for EEA and Swiss nationals </minor-heading>

I thought there was code to combine two minor-headings like that together on import if it found them, but presumably there's not or it's not working in some way. I see why it might be nice to have them all on one page, but that does make large debates even more unwieldy. But you'd have to introduce more structure to the output if you wanted to do anything with this, I think, and it's never been worth the effort involved.

@ajparsons
Copy link
Contributor Author

Yeah, I was specifically looking for debates with multiple votes to test a motion extractor - and that flushed out ones like this where things are more spread out than I expected.

If we sketched out (and funded) a project around clearer understanding of amendments and legislative process - a good approach to this would fit into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants