Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove File/Image links in strip_code() #194

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

Mottl
Copy link

@Mottl Mottl commented Aug 11, 2018

Closes issue #136

@Mottl
Copy link
Author

Mottl commented Aug 11, 2018

The problem with appveyor is not related to changes in the code.
@earwig

@earwig
Copy link
Owner

earwig commented Aug 11, 2018

Thanks, I'll look at the appveyor issue.

Unfortunately, I'm not sure about this. Now we need to maintain this list for every language, which introduces an additional maintenance burden. (This is one of my problems with the original issue.)

I would prefer to have a way to retrieve this dynamically for the current wiki. This means mwparserfromhell needs a way to learn the context of the text its parsing and needs to grow functions to retrieve configuration from a live wiki (or some way for the user to provide it without requiring an annoying amount of boilerplate: maybe predefined config for known wikis and a way to generate/cache it for unknown wikis?). This kind of thing has been largely out-of-scope for mwparserfromhell so far, but I'm willing to accept it as a feature if we can come up with a clean implementation. Of course, that's a much more involved task than what you've done here.

If we stick with this solution, there are some edge cases not covered. For example, [[Imagem:Foo]] on enwiki is a normal link, but this will strip it out, which isn't the correct behavior. You might argue that is an unlikely case, but it makes me a bit uncomfortable. Unintentionally, I think, it will also strip out just "[[ImagemFoo]]", which should be easy to fix.

It might be a good idea to make this a configurable feature of strip_code, using kwargs. Since image captions are arguably part of the article text, I would be sad if we didn't have a way to preserve them. For inspiration, see how Template.__strip__ handles this. However, this means we need to parse out things like thumbnail options and potentially alt text, which I've avoided until now.

We also need some tests in TestWikilink.test_strip.

More stylistic things:

  • You can replace the startswith for-loop with just _title.startswith(tuple(self._strip_links)). (Or make self._strip_links into a tuple.)
  • If we do end up with a hardcoded namespace list like this, it should be moved to mwparserfromhell/definitions.py.

@lahwaacz
Copy link
Contributor

If we end up using a context object based on the API results, hardcoding a list of namespace names per language is not necessary, because the English prefixes always work and the translations can be deduced from these "canonical" names. See the relevant API query.

I'm totally against doing any API queries directly from mwparserfromhell. Besides the unnecessary code bloat, that would also make it difficult to integrate mwparserfromhell into projects which would like to handle all queries by themselves. I think a nice solution is the introduction of intermediate context classes where users would supply the necessary information. Common wikis like enwiki can have an instance of the context provided by mwparserfomhell.

Also, since we're talking about parsing MediaWiki titles, you might be interested in my helper Title class. That's probably the most simple case of context-sensitive parsing of MediaWiki things, but still there are some problems (e.g. lahwaacz/wiki-scripts#38).

@Mottl
Copy link
Author

Mottl commented Aug 12, 2018

I agree with @lahwaacz, that it would be more comfortable to take localized names from API instead of hardcoding them. We could do something like this:

# returns a dict of 'en'->'hi' translations
specials = mwparserfromhell.getLocalizedSpecials(lang='hi')
strip_specials = [specials[i] for i in ['File', 'Image', 'Media']]

wikicode = mwparserfromhell.parse(text)
s = wikicode.strip_code(strip_specials=strip_specials)

mwparserfromhell.getLocalizedSpecials() can download and cache results from Wikipedia API. We can also provide the default cached value inside the package to be used out of the box — it will be trivial to maintain the updated default cached value in the repo, @earwig

@legoktm
Copy link
Contributor

legoktm commented Aug 13, 2018

I would recommend hardcoding the English versions (Image, File) in mwparserfromhell, and then having a setter method so frameworks/scripts can add in localized versions from the API. I agree that mwph should not be calling the API, or at least, if we decide that mwph should be wiki-config-specific like Parsoid, then it should probably be done separately, with a little more thought about the architecture.

mwparserfromhell.parser.add_file_localization([...])
mwparserfromhell.parser.add_ns_localization({'file': [...]})

@zverok
Copy link

zverok commented Sep 16, 2018

If I may...

(Disclaimer: I am currently switching with a huge wikiparsing-related project to Python from Ruby. In Ruby, I developed wikitext parser myself from scratch, so I am kinda aware of some problems and possible solutions :)

My library was built on top of MediaWiki API client, so for me the problem was simpler, all the context that is necessary for proper parsing was fetched with the page. But, in the case of parsing-only library like mwparserfromhell, I'd do this:

  1. Hardcode English names. They are "more equal than others". In fact, in Ukrainian Wikipedia [[Image: will still work (as well as localized Ukrainian aliases, which work only there)
  2. Provide a way to inject "namespace aliases" into parser, like, IDK,
mwparserfromhell.parse(wiki, namespaces={'File': ['Зображення']})
  1. Document which raw API call, and which, say, pywikibot call will fetch those values.

If you'd like, I can try to prepare the PR (though I am not sure about my Python-fu still, I started the transition not long ago).

@bt2901 bt2901 mentioned this pull request Oct 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants