Remove File/Image links in strip_code() #194

Mottl · 2018-08-11T13:55:33Z

Closes issue #136

Mottl · 2018-08-11T14:06:24Z

The problem with appveyor is not related to changes in the code.
@earwig

earwig · 2018-08-11T19:40:26Z

Thanks, I'll look at the appveyor issue.

Unfortunately, I'm not sure about this. Now we need to maintain this list for every language, which introduces an additional maintenance burden. (This is one of my problems with the original issue.)

I would prefer to have a way to retrieve this dynamically for the current wiki. This means mwparserfromhell needs a way to learn the context of the text its parsing and needs to grow functions to retrieve configuration from a live wiki (or some way for the user to provide it without requiring an annoying amount of boilerplate: maybe predefined config for known wikis and a way to generate/cache it for unknown wikis?). This kind of thing has been largely out-of-scope for mwparserfromhell so far, but I'm willing to accept it as a feature if we can come up with a clean implementation. Of course, that's a much more involved task than what you've done here.

If we stick with this solution, there are some edge cases not covered. For example, [[Imagem:Foo]] on enwiki is a normal link, but this will strip it out, which isn't the correct behavior. You might argue that is an unlikely case, but it makes me a bit uncomfortable. Unintentionally, I think, it will also strip out just "[[ImagemFoo]]", which should be easy to fix.

It might be a good idea to make this a configurable feature of strip_code, using kwargs. Since image captions are arguably part of the article text, I would be sad if we didn't have a way to preserve them. For inspiration, see how Template.__strip__ handles this. However, this means we need to parse out things like thumbnail options and potentially alt text, which I've avoided until now.

We also need some tests in TestWikilink.test_strip.

More stylistic things:

You can replace the startswith for-loop with just _title.startswith(tuple(self._strip_links)). (Or make self._strip_links into a tuple.)
If we do end up with a hardcoded namespace list like this, it should be moved to mwparserfromhell/definitions.py.

lahwaacz · 2018-08-12T07:26:18Z

If we end up using a context object based on the API results, hardcoding a list of namespace names per language is not necessary, because the English prefixes always work and the translations can be deduced from these "canonical" names. See the relevant API query.

I'm totally against doing any API queries directly from mwparserfromhell. Besides the unnecessary code bloat, that would also make it difficult to integrate mwparserfromhell into projects which would like to handle all queries by themselves. I think a nice solution is the introduction of intermediate context classes where users would supply the necessary information. Common wikis like enwiki can have an instance of the context provided by mwparserfomhell.

Also, since we're talking about parsing MediaWiki titles, you might be interested in my helper Title class. That's probably the most simple case of context-sensitive parsing of MediaWiki things, but still there are some problems (e.g. lahwaacz/wiki-scripts#38).

Mottl · 2018-08-12T10:19:55Z

I agree with @lahwaacz, that it would be more comfortable to take localized names from API instead of hardcoding them. We could do something like this:

# returns a dict of 'en'->'hi' translations
specials = mwparserfromhell.getLocalizedSpecials(lang='hi')
strip_specials = [specials[i] for i in ['File', 'Image', 'Media']]

wikicode = mwparserfromhell.parse(text)
s = wikicode.strip_code(strip_specials=strip_specials)

mwparserfromhell.getLocalizedSpecials() can download and cache results from Wikipedia API. We can also provide the default cached value inside the package to be used out of the box — it will be trivial to maintain the updated default cached value in the repo, @earwig

legoktm · 2018-08-13T00:41:53Z

I would recommend hardcoding the English versions (Image, File) in mwparserfromhell, and then having a setter method so frameworks/scripts can add in localized versions from the API. I agree that mwph should not be calling the API, or at least, if we decide that mwph should be wiki-config-specific like Parsoid, then it should probably be done separately, with a little more thought about the architecture.

mwparserfromhell.parser.add_file_localization([...])
mwparserfromhell.parser.add_ns_localization({'file': [...]})

zverok · 2018-09-16T16:22:38Z

If I may...

(Disclaimer: I am currently switching with a huge wikiparsing-related project to Python from Ruby. In Ruby, I developed wikitext parser myself from scratch, so I am kinda aware of some problems and possible solutions :)

My library was built on top of MediaWiki API client, so for me the problem was simpler, all the context that is necessary for proper parsing was fetched with the page. But, in the case of parsing-only library like mwparserfromhell, I'd do this:

Hardcode English names. They are "more equal than others". In fact, in Ukrainian Wikipedia [[Image: will still work (as well as localized Ukrainian aliases, which work only there)
Provide a way to inject "namespace aliases" into parser, like, IDK,

mwparserfromhell.parse(wiki, namespaces={'File': ['Зображення']})

Document which raw API call, and which, say, pywikibot call will fetch those values.

If you'd like, I can try to prepare the PR (though I am not sure about my Python-fu still, I started the transition not long ago).

Remove File/Image links in strip_code()

d52ea2b

syntax issue

c6e813c

bt2901 mentioned this pull request Oct 15, 2019

Images and interwiki #222

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove File/Image links in strip_code() #194

Remove File/Image links in strip_code() #194

Mottl commented Aug 11, 2018

Mottl commented Aug 11, 2018 •

edited

earwig commented Aug 11, 2018

lahwaacz commented Aug 12, 2018

Mottl commented Aug 12, 2018 •

edited

legoktm commented Aug 13, 2018

zverok commented Sep 16, 2018

Remove File/Image links in strip_code() #194

Are you sure you want to change the base?

Remove File/Image links in strip_code() #194

Conversation

Mottl commented Aug 11, 2018

Mottl commented Aug 11, 2018 • edited

earwig commented Aug 11, 2018

lahwaacz commented Aug 12, 2018

Mottl commented Aug 12, 2018 • edited

legoktm commented Aug 13, 2018

zverok commented Sep 16, 2018

Mottl commented Aug 11, 2018 •

edited

Mottl commented Aug 12, 2018 •

edited