Skip to content
This repository has been archived by the owner on Jan 3, 2024. It is now read-only.

[Norwegian] Some pages are not being scrapped properly #33

Open
C0rn3j opened this issue Jul 6, 2018 · 20 comments
Open

[Norwegian] Some pages are not being scrapped properly #33

C0rn3j opened this issue Jul 6, 2018 · 20 comments

Comments

@C0rn3j
Copy link

C0rn3j commented Jul 6, 2018

Out of all the issues I opened here this one is the most important to me as I've used this project for creation of a Kindle-compatible dictionary, and incomplete/missing entries are the bane of every dictionary project ^^


https://en.wiktionary.org/wiki/seg#Norwegian_Bokm%C3%A5l
Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": [], "audio": []}}]

https://en.wiktionary.org/wiki/ham#Norwegian_Bokm%C3%A5l
Missing completely

[{"etymology": "", "definitions": [], "pronunciations": {"text": ["IPA: /h\u0251m/"], "audio": []}}]

https://en.wiktionary.org/wiki/by#Norwegian_Bokm%C3%A5l
Missing the verb definition

[
	{
		"etymology": "From Old Norse býr (“place (to camp or settle), land, property, lot; and later settlement”).\n",
		"definitions": [
			{
				"partOfSpeech": "noun",
				"text": "by m (definite singular byen, indefinite plural byer, definite plural byene)\n\ntown, city (regardless of population size or land area)\n",
				"relatedWords": [
					{
						"relationshipType": "derived terms",
						"words": [
							"bydel",
							"byfornyelse, byfornying",
							"bygdeby",
							"bymessig",
							"bystat",
							"bystatus",
							"drabantby",
							"ferieby",
							"gamleby",
							"havneby",
							"hjemby",
							"landsby",
							"Mexico by",
							"naboby",
							"spøkelsesby",
							"storby"
						]
					}
				],
				"examples": []
			}
		],
		"pronunciations": {
			"text": [],
			"audio": []
		}
	},
	{
		"etymology": "From byde, from Old Norse bjóða, from Proto-Germanic *beudaną (“to offer”), from Proto-Indo-European *bʰewdʰ- (“to wake, rise up”).\n",
		"definitions": [],
		"pronunciations": {
			"text": [],
			"audio": []
		}
	}
]

Here's a list of errors from my project for words in Norwegian Bokmål. It is totally possible that some errors are due to a mistake in my own scripts, but all I checked were thrown due to WiktionaryParser not parsing them properly or at all.

https://haste.rys.pw/raw/vevafamiwo

Another half-broken entry -

https://en.wiktionary.org/wiki/for#Norwegian_Bokm%C3%A5l

@C0rn3j C0rn3j changed the title Some pages are not being scrapped properly [Norwegian] Some pages are not being scrapped properly Jul 12, 2018
@suyashb95
Copy link
Owner

Seems fixed in 39ba274

@C0rn3j
Copy link
Author

C0rn3j commented Jul 15, 2018

Seems fixed indeed. Thank you a LOT.

Is there anywhere I can send you a few bucks to? Paypal?

@suyashb95
Copy link
Owner

Appreciate it but, it's a hobby project so that's not necessary :D

@C0rn3j
Copy link
Author

C0rn3j commented Jul 17, 2018

And your hobby project is incredibly helpful to me, so if you change your mind and I ever see a donation page/button on the main page, I'll use it ^^


Actually found one more under løsrive, it's missing the inflection part - https://en.wiktionary.org/wiki/l%C3%B8srive#Norwegian_Bokm%C3%A5l

[
  {
    "etymology": "From løs +‎ rive",
    "definitions": [
      {
        "partOfSpeech": "verb",
        "text": "(often reflexive, with seg / oneself)\nto break away\nto detach (oneself)\nto tear oneself away (fra / from)\nto secede (fra / from)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

EDIT: And one more in Bokmål - øl - it strips the first inflection line

https://en.wiktionary.org/wiki/%C3%B8l#Norwegian_Bokm%C3%A5l

[
  {
    "etymology": "From Old Norse ǫl, from Proto-Germanic *alu, from Proto-Indo-European *h₂elut- (“beer”).\n",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "øl m (definite singular ølen, indefinite plural øl, definite plural ølene) (a glass, bottle or can of beer)\n\nbeer (alcoholic drink)\na beer (in a glass, bottle or can)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [
        "IPA: /œl/",
        "Rhymes: -œl"
      ],
      "audio": []
    }
  }
]

@C0rn3j C0rn3j reopened this Jul 17, 2018
@suyashb95
Copy link
Owner

Inflections seem to be turning up properly now, although they're a part of the definition text itself

@C0rn3j
Copy link
Author

C0rn3j commented Aug 4, 2018

Amazing, looking forwards to a new release ^^

@C0rn3j C0rn3j closed this as completed Aug 4, 2018
@C0rn3j
Copy link
Author

C0rn3j commented Aug 4, 2018

That seems to have broken more than it fixed.

konkurs in Norwegian Bokmål in 0.0.8:

[
  {
    "etymology": "From Latin concursus",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": "konkurs (indeclinable)\n\nbankrupt\n",
        "relatedWords": [],
        "examples": [
          "gå konkurs - go bankrupt"
        ]
      },
      {
        "partOfSpeech": "noun",
        "text": "konkurs (indeclinable)\n\nbankrupt\nkonkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

and after in 0.0.91:

[
  {
    "etymology": "From Latin concursus",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": "konkurs (indeclinable)\n\nbankrupt\n",
        "relatedWords": [],
        "examples": [
          "gå konkurs - go bankrupt"
        ]
      },
      {
        "partOfSpeech": "noun",
        "text": "konkurs m (definite singular konkursen, indefinite plural konkurser, definite plural konkursene)\n\na bankruptcy\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

heis in 0.0.91 has a duped entry

[
  {
    "etymology": "From the verb heise",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  },
  {
    "etymology": "From the verb heise",
    "definitions": [
      {
        "partOfSpeech": "verb",
        "text": "heis m (definite singular heisen, indefinite plural heiser, definite plural heisene)\n\nelevator (US), lift (UK)\nheis\nimperative of heise\n",
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

Here's more that broke for testing (first word of every line is what the entry is for, this is a diff)-

image

@C0rn3j C0rn3j reopened this Aug 4, 2018
@suyashb95
Copy link
Owner

Whoops, added a fix in another release

@C0rn3j
Copy link
Author

C0rn3j commented Aug 5, 2018

Okay, that looks much better, just a few things.

My scripts operate on the assumption that the inflections are before the first line break. Am unsure if that was true for every word in 0.0.8, but it certainly was for 99.9%+ of them.

In 0.0.92 this is now not the case with bor and handful of other entries, like faksimile, while it seems it gets otherwise scrapped correctly, it adds line breaks between the two inflection lines. Is this by design and should I write some different kind of detection? It didn't use to be that way until now, think it was just a space in the other words.

image

image

Other than that it seems to have broken a single word - pantergaupe, which is now missing the inflection part.

[
  {
    "etymology": "panter +‎ gaupe",
    "definitions": [
      {
        "partOfSpeech": "noun",
        "text": "Iberian lynx; Lynx pardinus\n",
        "relatedWords": [
          {
            "relationshipType": "synonyms",
            "words": [
              "iberisk gaupe",
              "spansk gaupe"
            ]
          }
        ],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [
        "IPA: /pan.ter.ɡæʉ.pe/, [ˈpɑn.təɾ.ˌɡæʉ̯ː.pə]"
      ],
      "audio": []
    }
  }
]

@suyashb95
Copy link
Owner

Some of the inflections are in multiple lines so they'll be parsed that way. I've gonna fix inflection parsing for other words like pantergaupe in the dev branch for now. I'm experimenting with having definitions in a list of sentences instead of one long string, let's see if that works.

@C0rn3j
Copy link
Author

C0rn3j commented Aug 5, 2018

Ohhhh you're totally right! Never noticed nor realized this would be the problem.

image

I skimmed my definition list and apparently this was already an issue I was not handling. Your fix just made it more visible.

@C0rn3j
Copy link
Author

C0rn3j commented Aug 13, 2018

Not sure if same problem as pantergaupe but maldivisk is missing the inflection line in the second definition(0.0.92).

https://en.wiktionary.org/wiki/maldivisk#Norwegian_Bokm%C3%A5l

image

BTW: I rewrote the detection part of my script, it seems to be working great, thanks for the fixes!

@suyashb95
Copy link
Owner

Added some changes in 2ba2eea to fix this. Also, the definition text is now a list so you may have to change your script

@C0rn3j
Copy link
Author

C0rn3j commented Sep 8, 2018

Finally kicked myself to work on my script again, changes look awesome, thanks!

@C0rn3j C0rn3j closed this as completed Sep 8, 2018
@C0rn3j
Copy link
Author

C0rn3j commented Sep 8, 2018

Okay I only looked at my inflections output, premature celebration.

Your changes at some point seemed to have added garbage in the form of the word name to some words.

https://en.wiktionary.org/wiki/forrevet forrevet has a definition 'forrevet' which really shouldn't be there for example.

[
  {
    "etymology": "",
    "definitions": [
      {
        "partOfSpeech": "adjective",
        "text": [
          "forrevet (indefinite singular forrevet, definite singular and plural forrevne)",
          "alternative form of forreven",
          "forrevet",
          "neuter singular of forreven"
        ],
        "relatedWords": [],
        "examples": []
      }
    ],
    "pronunciations": {
      "text": [],
      "audio": []
    }
  }
]

https://en.wiktionary.org/wiki/foreskrevet has the exact same issue and am sure there's a bunch of others

@C0rn3j C0rn3j reopened this Sep 8, 2018
@suyashb95
Copy link
Owner

I haven't encountered multiple subheadings under a definition yet. The subheadings usually contain inflections so the parser adds that to the list of definitions. I guess it should either not include them or separate them out from the definition list, probably in a field called word/inflections in the JSON

@C0rn3j
Copy link
Author

C0rn3j commented Sep 9, 2018

Yeah, it should separate it, or not do that, as I can't simply filter out if word X contains definition X because some words really are that way (best in bokmål means best).

If you need more examples where this happens - støvete, uomskåret,

@C0rn3j
Copy link
Author

C0rn3j commented Sep 20, 2018

It looks like one of the updates also broke nested definitions

https://en.wiktionary.org/wiki/v%C3%A6re_glad_i

image

They weren't exactly scrapped perfectly in the first place it seems, but now they're not scrapped at all.

image

@suyashb95
Copy link
Owner

Nested definitions and examples have ambiguous formatting so figuring that out is going to take some time

@C0rn3j
Copy link
Author

C0rn3j commented Sep 23, 2018

I've had luck with the Wiktionary contributors willing to redo old formatting and use a newer template for some snowflake definitions I ran into.

Not sure if these nested words are the case, I could ask about them, but that'd require me to go through the diff and pick them out, which right now has a lot of "garbage" I mentioned above, and it'd be a pain to go through it in this state.

image

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants