strip_code() needs special case for [[File:...]] links and interwiki links #136

halfak · 2015-12-30T20:15:17Z

It looks like strip_code() interprets File-links like they are regular wikilinks. Instead, it should remove everything except the caption/alt-text. In the example below, the size param is dumped into the text.

$ python
Python 3.4.3 (default, Jul 28 2015, 18:20:59) 
[GCC 4.8.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import mwparserfromhell as mwp
>>> wikicode = mwp.parse("Foo bar [[File:EscherichiaColi NIAID.jpg|135px]] baz")
>>> wikicode.strip_code()
'Foo bar 135px baz'
>>> wikicode = mwp.parse("Foo bar [[Image:EscherichiaColi NIAID.jpg|135px]] baz")
>>> wikicode.strip_code()
'Foo bar 135px baz'

earwig · 2015-12-30T20:17:59Z

The problem here is that we are depending on wiki-specific namespace names, which are very unpredictable, so a way to configure that will be required.

halfak · 2015-12-30T21:09:56Z

That's a good point. It seems very awkward to configure this per-wiki since I assume the "File" namespace will be internationalized. Is there anyway that wiki-install-specific params like this can be applied in the parser?

lahwaacz · 2015-12-30T23:23:24Z

What's the use case for such context-sensitive code stripping? There are many more similar problems: should category and interlanguage links be stripped too? And I don't even have to start with templates...
IMHO the most general way to handle files and images in a detached code like mwparserfromhell (or its clients) would be to treat them like templates and not links, which should have been done in the language itself from the beginning to not create a barrier like this.

halfak · 2015-12-30T23:28:40Z

Oh! We're working on extracting prose (sentences and paragraphs) from English Wikipedia and this was one of the issues that came up with using strip_code() to get rid of content that doesn't show up when rendered. We're not too concerned about handling what is rendered via templates. Generally, I agree -- it's a bummer that we use link syntax for images, but, we can't go back and fix that now. :/

jfolz · 2015-12-31T00:58:37Z

I did this once to identify what kind of thing a link points to. The file for all Wikipedias is 260k uncompressed. If you only need this for English Wikipedia it's pretty straightforward. You can ask it to give you all its namespaces:
https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces

However, there's some 200+ Wikipedias in total:
https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=languages

The IDs are consistent across languages and most namespaces should exist everywhere. This is OK if you target the major Wikipedias that don't go anywhere soon. However, many of those only have a few pages and are sometimes removed. New ones get added from time to time. I personally wouldn't want to have to support this as it can lead to all sorts of annoyingly complex debugging with different versions floating around. It's a bit like those poor guys handling timezones in pytz.

Now, here's the catch: You can also link to a Wikipedia thing in a different language. This includes images:
https://en.wikipedia.org/wiki/fr:Fichier:NCI_swiss_cheese.jpg
Not sure if that is a thing that is done often though, files should slowly move to commons. Which is, btw, also something you need to watch out for.

Oh, the joy of working with Wikipedia, where every dirty hack ever imagined is actually the preferred way of doing things ;)

Btw, templates can also generate text:
https://en.wikipedia.org/wiki/Template:IPA

halfak · 2015-12-31T01:50:57Z

+1 for not merging siteinfo data into mwparserfromhell. Maybe it would be more appropriate if I could provide the parser with such assets when I ask it to parse a chunk of wikitext.

Though, honestly, I'm starting to believe that this isn't something mwparserfromhell should handle. For my specific use-case, I can walk the syntax tree and check wikilink nodes as I go to make the decisions that work for my application.

earwig · 2015-12-31T09:14:20Z

There's probably something you can do involving API functions like the one that expands templates...

In EarwigBot's copyvio detector, I had to support a similar use case. It strips out images, reference, etc and leaves alone template parameter values that are longer than a certain length (which we interpret to be article content rather than technical stuff)—rather hacky overall, but it works in the majority of cases.

That code is here: https://github.com/earwig/earwigbot/blob/develop/earwigbot/wiki/copyvios/parsers.py#L140 , maybe it can give you some ideas.

I am not sure if this is something we should support in the parser directly, though. strip_code() wasn't intended to be a well-developed feature that covered all the edge cases.

jfolz · 2015-12-31T15:35:17Z

strip_code could accept an optional kwarg for a custom mapping function and/or dict of mapping functions for different node types. Probably not worth it though just to remove a loop from client code, but there could be other uses.

earwig · 2016-01-01T20:02:09Z

That's sensible.

jayvdb · 2016-01-05T03:57:24Z

IMO strip_code should strip all attributes of media:, file: and category: links that are not prefixed with :. Even the file: title is more code than text, as it has lost its context without the media displayed.

I vote for the caller providing a 'namespaces map' which can be used to indicate what namespace names map to which namespace numbers.

lahwaacz · 2016-01-05T07:43:54Z

Why namespace numbers? There are no namespace numbers in the language, only in the database, which is not touched at all by mwparserfromhell. Even though the common namespace numbers are constant across all wikis, there is no need to bring numbers to the parser. What would be needed and sufficient is a map of localized and aliased namespace names to the corresponding canonical (English) names. Also attributes like first-letter sensitiveness and other quirks I may be unaware of would have to be described.

ricordisamoa · 2016-01-05T07:58:56Z

mwparserfromhell should IMHO be kept minimal. Anyone doing serious wikitext expansion should use custom routines, perhaps with some knowledge of the wiki.

ghost · 2016-01-05T08:13:53Z

I agree with what @ricordisamoa just said.

jayvdb · 2016-01-05T08:57:04Z

Why namespace numbers?

Somehow the names of Media:, File: and Category: namespaces need to be provided, if they are to be treated differently. They have hard-coded well-known identifiers: namespace numbers. No need to invent a new method of identifying them.

moroboshi · 2016-01-05T18:11:55Z

Il 05/01/2016 09:57, John Vandenberg ha scritto:

Why namespace numbers?
Somehow the names of |Media:|, |File:| and |Category:| namespaces need
to be provided, if they are to be treated differently. They have
hard-coded well-known identifiers: namespace numbers. No need to
invent a new method of identifying them.

A wikipedia can also define localized name for the namespace, for
examplei in it.wiki, you can use
"Category" (hardcoded) or "Categoria" (localized in italian).

Sono l'oscurità del nero più profondo
Più profondo della notte più nera
Sono il mare del chaos, la fonte di tutto il chaos

eranroz · 2016-07-30T09:24:30Z

I just had a similar case where I want to strip_code and remove all the images, so this is my solution:

img = re.compile('(File|Image|ALTERNATIVE_ALIASES):', re.I)
prs = mwparserfromhell.parser.Parser().parse(wikicode)
remove_img = [f for f in prs.ifilter_wikilinks() if img.match(str(f.title))]
for f in remove_img:
   prs.remove(f)

And to get ALTERNATIVE_ALIASES you need to:

Request it by API: https://en.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases
get it from php code: mediawiki/core/languages/messages/MessagesXX.php (namespaceAliases)
[IMO: all approaches are out of scope of mwparserfromhell]

Anyway, maybe parser should provide some utility function of "remove links which have target that match x" (where x is re/function) - this seems to be useful and general enough.

earwig · 2016-07-30T17:06:06Z

mwparserfromhell.parser.Parser().parse(wikicode)

Just use mwparserfromhell.parse(wikicode)...

harej · 2018-02-27T17:41:53Z

I've noticed problems with strip_code() and filtering HTML tags as well.

Based on https://de.wikipedia.org/wiki/User:FNBot?action=raw:

<div style="margin-left: -1em;"><div style="margin-left: -10%;">
<div style="width: 80%; margin: auto;margin-bottom:5px;"><div style="float:left;padding-right: 4px;">{{Benutzer:ZOiDberg/Vorlage:user py}}</div><div style="float:left;padding-right: 4px;">{{Benutzer:Sitic/Babel/tools}}</div><div style="float:left;">{{Benutzer:FNDE/Vorlage/Bot}}</div><br style="clear:both;"></div>
{{Bot|FNDE
|Kontakt = Benutzer_Diskussion:FNDE
|Modus   = automatisch
}}
<table style="width: 80%;margin:auto;background:#f8f9fa;border: 2px solid #eaecf0;padding:10px;margin-top:7px;">
<tr><td>FNBot kann im Notfall <b>sofort</b> angehalten werden.[{{fullurl:Benutzer:FNBot/Stop|action=edit&preload=Benutzer:FNBot/Stop/Preload&unwatch=1&section=new&nosummary=1&editintro=Benutzer:FNBot/Stop/Editnotice&summary=Bot%20angehalten}} <span style="padding:4px 7px 3px 7px;color:#fff;background: #de2525;border-radius: 4px;font-weight:bold;font-size:14pt;position: relative;top:2px;left:5px;">× NOT AUS</span>]</div>
</div></td></tr>
</table>

<table style="width: 80%;margin:auto;background:#fbfdd6;border: 2px solid #eaecf0;padding:10px;margin-top:7px;">
<tr><td>{{Benutzer:FNBot/Aufgaben}}</td></tr>
</table>

<table style="width: 80%;margin:auto;background:#fbfdd6;border: 2px solid #eaecf0;padding:10px;margin-top:7px;">
<tr><td>{{Benutzer:FNBot/Einzelaufträge}}</td></tr>
</table>

I do:

>>> r = requests.get('https://de.wikipedia.org/wiki/User:FNBot?action=raw').text
>>> mwparserfromhell.parse(r).strip_code()
'<table style="width: 80%;margin:auto;background:#f8f9fa;border: 2px solid #eaecf0;padding:10px;margin-top:7px;">\n<tr><td>FNBot kann im Notfall sofort angehalten werden.[ × NOT AUS]</td></tr>\n</table>'

As far as I know, that HTML spam should not be in there. Is that problem related to the problem documented in this Issue? Is this a different problem? Is this the intended behavior?

ghost · 2018-02-27T21:33:15Z

What you see is the intended behaviour. The HTML tags are imbalanced in your code sample, so mwparserfromhell treats the offending segment as just text.

A real library like lxml would simply fix the imbalanced tags for you, for comparison. But mwparser doesn't because we require parsing not to transform the text, i.e. you can take the output of mwparserfromhell.parse(...) and copy-paste it back into the edit window and MediaWiki will not record an extra revision.

ctrlcctrlv · 2023-01-30T03:27:49Z

I've implemented this. See #301.

ctrlcctrlv · 2023-01-30T03:47:26Z

My solution leads to the following on the current wikitext for [[ja:w:西村博之]]:

$ python3 parse.py 西村博之.mediawiki|grep Wikilink|grep 画像:|column -t -s $'\t' -T4

Wikilink  画像:Hiroyuki Nishimura%27s_speech in Sapporo_20050831.jpg  ['220px', 'thumb']  [[札幌市]] 2005年
Wikilink  画像:Jim Watkins 1 (cropped).png                            ['225px', 'thumb']  [[2ちゃんねる]]と[[8chan]]を運営している[[ジム・ワトキンス]]
Wikilink  画像:Ronald Watkins (cropped).jpg                           ['225px', 'thumb']  [[2010年代]]に5ちゃんねると8kunの管理人を務めた[[ロン・ワトキンス]]
Wikilink  画像:2chan main page.png                                    ['250px', 'thumb']  [[4chan]]の親サイト「[[ふたば☆ちゃんねる]]」([https://www.2chan.net/ 2chan.net]) は、[[2ちゃんねる]]の分派として2001年に開設された。
Wikilink  画像:4chan's main page.png                                  ['250px', 'thumb']  "[[4chan]]は、2003年に開設された英語圏最大の匿名掲示板。2015年から[[CHANカルチャー]]創始者の「ひろゆき」によって管理・運営されている<ref name=""清義明 20210329""/><ref name=""藤原4""/><ref name=""GNET""/>。"
Wikilink  画像:Christopher Poole at XOXO Festival September 2012.jpg  ['250px', 'thumb']  [[4chan]]の創設者で初代管理人の[[クリストファー・プール|moot]]。mootは英語圏のインターネットフォーラム「[[:en:Something Awful|Something Awful]]」と日本語圏の[[画像掲示板]]「[[ふたば☆ちゃんねる]]」のアクティブユーザーであった。
Wikilink  画像:8chan logo.svg                                         ['thumb', '250px']  "[[4chan]]から派生した[[8chan]]（現：8kun）は「世界で最も卑劣なウェブサイト」とも評される[[匿名掲示板]]<ref name=""清義明 20210329""/>。また管理人の[[ロン・ワトキンス]]は、[[Qアノン]]陰謀論における「[[Qアノン#「Q」の正体|Q]]」の正体として一般的に信じ
Wikilink  画像:2ch_2007Logo.png                                       ['250px', 'thumb']  [[仮差押え]]する申し立てが[[東京地方裁判所|東京地裁]]に出された際、[[差押|差押え]]対象が2ch.netの[[ドメイン名|ドメイン]]にまで及んでいたことに対する痛烈な皮肉であった<ref>{{Cite news |title=「2ちゃんねる」閉鎖 これはギャグなのか |date=2007-01-15 |url

$ python3 parse.py 西村博之.mediawiki|grep Wikilink|head -n 20

Wikilink	1976年		
Wikilink	昭和		
Wikilink	11月16日		
Wikilink	日本		
Wikilink	実業家		
Wikilink	論客		
Wikilink	英語圏		
Wikilink	匿名		
Wikilink	4chan		
Wikilink	管理者		管理人
Wikilink	匿名掲示板		
Wikilink	2ちゃんねる (2ch.sc)		2ch.sc
Wikilink	未来検索ブラジル		
Wikilink	愛称		
Wikilink	通称		
Wikilink	匿名掲示板		
Wikilink	2ちゃんねる		
Wikilink	ドワンゴ		
Wikilink	動画共有サービス		動画配信サービス
Wikilink	ニコニコ動画

parser.py containing:

import mwparserfromhell
from mwparserfromhell.nodes import Wikilink
import argparse
import csv, sys
import json

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='parse wikitext for a Wikipedia page.')
    parser.add_argument('input', help='Input file')
    args = parser.parse_args()
    csvw = csv.writer(sys.stdout, delimiter="\t")

    with open(args.input) as f:
        inp = f.read()

    for tag in mwparserfromhell.parse(inp).nodes:
        csvw.writerow([type(tag).__name__] + ([tag.title, tag.args, tag.text] if isinstance(tag, Wikilink) else [str(tag)]))

earwig mentioned this issue Oct 28, 2016

strip_code() doesn't strip image formatting code #169

Closed

lahwaacz mentioned this issue Jun 4, 2017

rewrite and extend Caveats #180

Merged

earwig mentioned this issue Feb 26, 2018

strip_code() does not remove interwiki/in other language links #191

Closed

earwig changed the title ~~strip_code() needs special case for [[File:...]] links~~ strip_code() needs special case for [[File:...]] links and interwiki links Feb 26, 2018

Mottl mentioned this issue Aug 11, 2018

Remove File/Image links in strip_code() #194

Open

CXuesong mentioned this issue Apr 24, 2020

External link not parsed CXuesong/MwParserFromScratch#13

Closed

ctrlcctrlv linked a pull request Jan 30, 2023 that will close this issue

Add support for Wikilink.args for File: links #301

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

strip_code() needs special case for [[File:...]] links and interwiki links #136

strip_code() needs special case for [[File:...]] links and interwiki links #136

halfak commented Dec 30, 2015

earwig commented Dec 30, 2015

halfak commented Dec 30, 2015

lahwaacz commented Dec 30, 2015

halfak commented Dec 30, 2015

jfolz commented Dec 31, 2015

halfak commented Dec 31, 2015

earwig commented Dec 31, 2015

jfolz commented Dec 31, 2015

earwig commented Jan 1, 2016

jayvdb commented Jan 5, 2016

lahwaacz commented Jan 5, 2016

ricordisamoa commented Jan 5, 2016

ghost commented Jan 5, 2016

jayvdb commented Jan 5, 2016

moroboshi commented Jan 5, 2016

eranroz commented Jul 30, 2016 •

edited

earwig commented Jul 30, 2016

harej commented Feb 27, 2018

ghost commented Feb 27, 2018

ctrlcctrlv commented Jan 30, 2023 •

edited

ctrlcctrlv commented Jan 30, 2023 •

edited

strip_code() needs special case for [[File:...]] links and interwiki links #136

strip_code() needs special case for [[File:...]] links and interwiki links #136

Comments

halfak commented Dec 30, 2015

earwig commented Dec 30, 2015

halfak commented Dec 30, 2015

lahwaacz commented Dec 30, 2015

halfak commented Dec 30, 2015

jfolz commented Dec 31, 2015

halfak commented Dec 31, 2015

earwig commented Dec 31, 2015

jfolz commented Dec 31, 2015

earwig commented Jan 1, 2016

jayvdb commented Jan 5, 2016

lahwaacz commented Jan 5, 2016

ricordisamoa commented Jan 5, 2016

ghost commented Jan 5, 2016

jayvdb commented Jan 5, 2016

moroboshi commented Jan 5, 2016

eranroz commented Jul 30, 2016 • edited

earwig commented Jul 30, 2016

harej commented Feb 27, 2018

ghost commented Feb 27, 2018

ctrlcctrlv commented Jan 30, 2023 • edited

ctrlcctrlv commented Jan 30, 2023 • edited

eranroz commented Jul 30, 2016 •

edited

ctrlcctrlv commented Jan 30, 2023 •

edited

ctrlcctrlv commented Jan 30, 2023 •

edited