Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup #2561

Open
0x0badc0de opened this issue Jul 26, 2019 · 8 comments · May be fixed by #2622
Open

WikiCorpus.filter_wiki/remove_markup don't remove heading-markup #2561

0x0badc0de opened this issue Jul 26, 2019 · 8 comments · May be fixed by #2622
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest help wanted impact LOW Low impact on affected users reach LOW Affects only niche use-case users

Comments

@0x0badc0de
Copy link

Problem description

I am trying to get clean wiki texts. But still getting headings markup.

Steps/code/corpus to reproduce

Just create WikiCorpus and call get_texts. Some texts will contain ==headings== (different number of = and different headings, of course).
Simple test-case:

>>> import gensim.corpora.wikicorpus
>>> print(gensim.corpora.wikicorpus.filter_wiki('===heading==='))
===heading===

Versions

Python 3.7.4 (default, Jul 13 2019, 14:04:11) 
[GCC 8.3.0]
NumPy 1.16.4
SciPy 1.3.0
gensim 3.8.0
FAST_VERSION 1
@mpenkov mpenkov added the bug Issue described a bug label Jul 29, 2019
@mpenkov mpenkov added good first issue Issue for new contributors (not required gensim understanding + very simple) help wanted difficulty medium Medium issue: required good gensim understanding & python skills Hacktoberfest Issues marked for hacktoberfest labels Sep 28, 2019
@marat-s marat-s linked a pull request Oct 6, 2019 that will close this issue
@gojomo
Copy link
Collaborator

gojomo commented Oct 7, 2019

Many users will want headings preserved - they include great text for full-text indexing, learning word-vectors, and other purposes.

As the headings are very recognizable in the text, users who don't want them should probably filter them out themselves, in a post-WikiCorpus step. Or, WikiCorpus could include an option to strip headings – but I would suggest that both for continuity with current behavior, and the many users who will want them retained, that any such option should be default 'off'.

@0x0badc0de
Copy link
Author

0x0badc0de commented Oct 7, 2019

Names like get_texts, remove_markup suggest that the output should contain text only, no any markup at all. Surely, that's up to you to decide but such behaviour looks just confusing.

Also look at documentation of filter_wiki:

Filter out wiki markup from `raw`, leaving only text.
...
    Returns
    -------
    str
        `raw` without markup.

The same for remove_markup.

PS: It can be a wrong title of this ticket. It is not about removing heading but heading markup. So instead of ==Heading== I'd expect to have just Heading on output.

@gojomo
Copy link
Collaborator

gojomo commented Oct 7, 2019

Thanks for clarifying, I understand much better now!

Yes, the methods/comments indicate 'markup' will be removed, which creates an expectation problem. And based on both the issue title, and the patch-submitter's initial implementation, I initially thought you wanted the headings entirely stripped – which seemed wrong to me, because even though the headings often aren't proper sentences, they've still always been of interest (even necessary!) when I've used wiki text.

How about:

  • a parameter option, retain_heading_markup, default True, which retains the current behavior for continuity for anyone still expecting it
  • if False, the markup (paired leading/training '=' or '-' characters) is removed, but the section-heading text itself is retained
  • in addition to documenting the new parameter, in places where names give the other impression (like remove_markup, get_texts, etc) add note to the effect of "Section heading markup (eg '== Early Life ==', etc) is retained unless class retain_heading_markup parameter is set false."

For completeness, as this markup slipped by, a quick visual check of a handful of longer article texts could be done to see if any other present-but-likely-unwanted markup persists – and then if so, depending on the complexity/desirability of removing that, either fixing now or adding a new issue.

@gojomo gojomo changed the title WikiCorpus.filter_wiki/remove_markup don't remove headings WikiCorpus.filter_wiki/remove_markup don't remove heading-markup Oct 7, 2019
@0x0badc0de
Copy link
Author

Sounds good. Probably in some new major version (where behaviour changes are acceptable) this flag can be set to False by default with an entry in release notes.
If there is a possibility for some other markups to pass by, probably it worth using some generic name for the flag like retain_important_markup or retain_gensim_v3_markup, or even really_clean_text_with_no_any_markup with default to False :)

@piskvorky piskvorky added reach LOW Affects only niche use-case users impact LOW Low impact on affected users labels Oct 8, 2019
@dsmith47
Copy link

Drawn in by the hacktoberfest label. Would be interested in working on this if it still needs done.

@gojomo
Copy link
Collaborator

gojomo commented Oct 15, 2019

Hi, @dsmith47! There's an existing PR, #2622 by @marat-s, that took a swing at this, but review of that proposed fix helped to reveal that the need was actually a bit different.

@marat-s, are you planning to update your PR to address the followup clarification/discussion?

@marat-s
Copy link

marat-s commented Oct 15, 2019

Hi @gojomo, sure, let's make it done. I'll adjust the PR during the week according to the discussion.

@marat-s
Copy link

marat-s commented Oct 22, 2019

Hi there, please find the adjustments at #2622

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty medium Medium issue: required good gensim understanding & python skills good first issue Issue for new contributors (not required gensim understanding + very simple) Hacktoberfest Issues marked for hacktoberfest help wanted impact LOW Low impact on affected users reach LOW Affects only niche use-case users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants