Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Trimming pages file" is parsing records wrong when the title contains parenthesis #100

Open
hut8 opened this issue Oct 29, 2022 · 3 comments
Assignees

Comments

@hut8
Copy link

hut8 commented Oct 29, 2022

I'm trying to import the dump from enwiki-20221001

+----------+--------------------+
| page_id  | page_title         |
+----------+--------------------+
| 71701640 | 104-2,3,(6),(7),11 |
+----------+--------------------+

This ends up creating this line (which has the wrong title, and also has only 2 columns instead of three) in pages.txt.gz:

71701640   104-2,3,(6

Here's some context for surrounding lines:

71701608        Alberta_Sovereignty_Act 0
71701611        Homeland_Defence_Act    0
71701613        Berlin_Nobody   0
71701617        Miss_Grand_Nepal_2022   0
71701639        Pgm2_c  1
71701640   104-2,3,(6
71701649        2022_Binh_Duong_karaoke_bar_fire        0
71701668        Chapel_of_the_Christ,_San_Pablo_del_Monte       0
71701673        Ximena_Aguilera 0
71701676        Wedding_dress_of_Katharine_Worsley      0
71701682        2022–23_Central_Michigan_Chippewas_men\'s_basketball_team       0

I will do some more research on this shortly.

@jwngr
Copy link
Owner

jwngr commented Nov 7, 2022

Thanks for opening this! I don't have time to investigate this myself, but would gladly marge in a PR if you put one together with a fix.

@corneliusroemer
Copy link

I think this should be easy to fix with a more sophisticated splitting regex, rather than splitting on ),(, we can split on \'\),\(([1-9][0-9]*,[0-9]+,\')

While this isn't absolutely foolproof (i.e. someone could make an adversarial redirect that would screw us), it's quite unlikely that this will happen.

),( is already unlikely, now '),(1,0,' is much more unlkely, as it requires an extra:

  • ' before
  • number comma number comma ' after

I'm rewriting the pipeline in snakemake to make it easier to debug and run in parallel.

@jwngr
Copy link
Owner

jwngr commented Dec 24, 2023

Thanks for the help @corneliusroemer - I would gladly merge in the snakemake PR if you're willing to open it.

@hut8 - Thanks for getting the conversation started!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants