Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve remove_markup handling of Wikipedia headers #2622

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

marat-s
Copy link

@marat-s marat-s commented Oct 6, 2019

Fix #2561

@gojomo
Copy link
Collaborator

gojomo commented Oct 7, 2019

The link to the issue in the PR description is good - but ideally the title will also describe the matters affected.

But also: see my comment on #2561 – despite the original reporter's preference that headings be stripped, I doubt that's the right default for everyone.

@mpenkov mpenkov changed the title Merging a bug fix #2561. Improve remove_markup handling of Wikipedia headers Oct 7, 2019
@mpenkov
Copy link
Collaborator

mpenkov commented Oct 7, 2019

I've taken the liberty of adjusting the title.

@gojomo Perhaps we can include a strip_header_markup keyword parameter (defaults to False) to handle both sides of the argument?

@gojomo
Copy link
Collaborator

gojomo commented Oct 7, 2019

@mpenkov A parameter-controlled option would certainly support the needs of the reporter, @0x0badc0de, and address my concern about inconveniencing others.

But this need might be narrow enough that the right advice, to those who need it, is to run the ~1-liner to strip headers yourself if you need it. (Roughly just a s/^(=+).*\1// regex replacement on the output text.)

@gojomo
Copy link
Collaborator

gojomo commented Oct 15, 2019

See the further discussion on #2561 for clarification on what kind of fix the original reporter was expecting, and would be best.

@@ -256,6 +262,9 @@ def remove_markup(text, promote_remaining=True, simplify_links=True):
text = re.sub(RE_P13, '\n', text) # leave only cell content
text = re.sub(RE_P17, '\n', text) # remove formatting lines

if not retain_heading_markup:
text = re.sub(RE_P18, '', text) # remove headings
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this retain the words of the heading? (I don't think so.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gojomo,
The re.sub function using RE_P18 does retain the words in the heading. Please refer to the screenshot attached where I test if this works on a jupyter notebook.
image

A comment below asks for unit tests; If you could guide me, I could do that and fix any issues that could have been introduced due to the changes. I would appreciate any pointers and your expectations from the tests.

@mpenkov
Copy link
Collaborator

mpenkov commented Oct 24, 2019

Please add unit tests for your new functionaity.

@mpenkov mpenkov added this to Needs triage in PR triage Nov 3, 2019
@mpenkov mpenkov added this to Triage in PR triage 2021-06 Jun 22, 2021
@mpenkov mpenkov added the stale Waiting for author to complete contribution, no recent effort label Jun 22, 2021
@mpenkov mpenkov moved this from Unsorted to Handle after next release in PR triage 2021-06 Jun 22, 2021
@SagarDollin
Copy link

Please add unit tests for your new functionaity.

Hey @mpenkov, I could test this. Would you please guide me to your expectations from the unit test. And also any contribution guidelines before creating another PR. You can view a small test that I conducted here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale Waiting for author to complete contribution, no recent effort
Projects
PR triage
  
Needs triage
PR triage 2021-06
Handle after next release
Development

Successfully merging this pull request may close these issues.

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup
5 participants