Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-textual elements score and mapframe are not filtered out #301

Open
adno opened this issue Dec 29, 2022 · 0 comments
Open

Non-textual elements score and mapframe are not filtered out #301

adno opened this issue Dec 29, 2022 · 0 comments

Comments

@adno
Copy link

adno commented Dec 29, 2022

Several elements with non-textual content such as maps and musical scores (elements mapframe and score) are not filtered out. Steps to reproduce:

  1. Download this dump: https://dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2
  2. Invoke the following command to list lines that contain the opening tags of these elements:
    wikiextractor --no-templates --html-safe '' -o - dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles1.xml-p1p41242.bz2 | grep '<\(score\|mapframe\)\b'

Output:

<score sound="1"> % Adding least one space before each line is recommended
<mapframe latitude="37.7235" longitude="23.4819" zoom="10" width="200" height="131" align="left" />Aegina is roughly triangular in shape, approximately from east to west and from north to south, with an area of .
<score sound="1">{ \time 4/4 c'4 c' g' g' | a' a' g'2 | f'4 f' e' e' | d' d' c'2 | g'4 g' f' f' | e' e' d'2 | g'4 \times 2/3 { f'8 f' f' } e'4 d' | c' r r2 | \bar "|." } \addlyrics { A B C D E F G H I J K L M N O P Q R S T U V dub- a- U X Y "Z(ed)" }</score>
<score sound="1">
<score sound="1">
<mapframe width=400 height=200 zoom=6 latitude=42.977 longitude=-76.506 align=left text="The original path of the Erie Canal">
<mapframe latitude="37.807984" longitude="-122.475411" zoom="18" width="213" height="180" align="right">

In general, the output also contains the content delimited by the tags (musical scores and map data). In some cases, both of the opening/closing tags (or parth the score itself) for musical scores are missing, e.g. article id="152940" from dumps.wikimedia.org/enwiki/20221020/enwiki-20221020-pages-articles3.xml-p151574p311329.bz2 contains only the opening <score>:

Sheet music does not often explicitly indicate "Bebung". Composers assumed that, like other ornaments, performers would apply "bebung" at their discretion. Where sheet music does indicate "bebung", it appears as a series of dots above or below a note, with the number of dots indicating the number of finger movements. For example: <score>
Carl Philipp Emanuel Bach called the vibrato "Bebung", however other composers like Johann Mattheson had described the term earlier on. C.P.E Bach often used Bebung in his 

More often, we see the whole score with the closing tag, but no opening tag.

There similar issues with other tags (#300) and table formatting (#298).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant