Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Several issues with pronunciation #7

Open
IvanUkhov opened this issue Jan 6, 2016 · 11 comments
Open

Several issues with pronunciation #7

IvanUkhov opened this issue Jan 6, 2016 · 11 comments
Labels

Comments

@IvanUkhov
Copy link
Contributor

Hello,

The following screenshot demonstrates a number of issues with pronunciation:

  • the sj sound gets replaced by the dollar sign,
  • the underscore gets replaced by the plus sign, and
  • the superscript in the second alternative is not properly formatted.

skjorta

Regarding the first problem, try to look up words with the tj sound like tjugo; the sound will be erroneously represented by the letter c.

Regards,
Ivan

@hashier hashier added the bug label Jan 9, 2016
@hashier
Copy link
Owner

hashier commented Jan 9, 2016

What do you mean by "the superscript in the second alternative is not properly formatted."

@IvanUkhov
Copy link
Contributor Author

@hashier, there are two possible pronunciations, but accent 2 (grave) is denoted correctly only in the first one. So, it should be like this (just as on Folkets lexikon):

[2sj'o:r_ta el. 2sj'or_t:a]

Note the second “2”. While we’re on it, why have you decided to denote stress by capitalizing letters instead of using the traditional notation? Thanks!

@IvanUkhov
Copy link
Contributor Author

@hashier, sorry for picking on details. I just think that pronunciation is the most important part of the language, and it’s also the one that is the most difficult to master. It’s of great help to be able to clearly see how to pronounce words. I wish Dictionary had sound.

@hashier
Copy link
Owner

hashier commented Jan 9, 2016

Ah, I see what you mean with the 2.

I didn't pick anything, I used use what was in dataset that I got from folkets lexikon. Since I always find it hard to read anyway I never realised that it is completely wrong (:

I checked what the "data" is for the pronunciation of skjorta

<word class="nn" lang="sv" value="skjorta">
   <translation value="shirt" />
   <phonetic soundFile="skjorta.swf" value="²$O:r+ta el. 2$Or+t:a" />
[...]

Seems like it's already broken in that file so I guess there is nothing we can do to fix it :

@IvanUkhov
Copy link
Contributor Author

Hmm, the interesting thing is that their web interface pulls data from the same database, and this “broken” representation is exactly what it gets to work with. For instance, here is the server’s response for “skjorta”:

//OK[6,0,0,1,5,4,2,3,0,0,1,2,2,0,0,1,["se.algoritmica.folkets.client.LookUpResult/1089098233","[I/2970817851","[Ljava.lang.String;/2600011424","<word class=\"nn\" date=\"2011-03-03\" id=\"158400\" lang=\"sv\" lexinid=\"15841\" origin=\"lexin\" value=\"skjorta\"><translation date=\"2011-03-03\" id=\"15559\" value=\"shirt\"></translation><phonetic date=\"2011-03-03\" soundFile=\"skjorta.swf\" value=\"²$O:r+ta el. 2$Or+t:a\"></phonetic><paradigm date=\"2011-03-03\" id=\"13806\" origin=\"lexin\"><inflection value=\"skjortan\"></inflection><inflection value=\"skjortor\"></inflection></paradigm><see date=\"2011-03-03\" origin=\"saldo\" type=\"saldo\" value=\"skjorta||skjorta..1||skjorta..nn.1\"></see><compound date=\"2011-03-03\" id=\"5537\" value=\"bomullsskjorta\"><translation value=\"cotton shirt\"></translation></compound><compound date=\"2011-03-03\" id=\"5538\" inflection=\"skjort|kragen\" value=\"skjort|krage\"><translation value=\"shirt collar\"></translation></compound><idiom date=\"2011-03-03\" id=\"1358\" value=\"kosta skjortan (&amp;quot;kosta väldigt mycket&amp;quot;)\"><translation value=\"cost a packet (&amp;quot;cost very much&amp;quot;)\"></translation></idiom><definition date=\"2011-03-03\" id=\"15341\" value=\"ett tunnare klädesplagg med krage, ärmar och knäppning fram\"></definition><url date=\"2011-03-03\" origin=\"lexin\" type=\"any\" value=\"8/herr.swf\"></url></word>","<word value=\"shirt\" lang=\"en\" class=\"nn\" id=\"379721\" origin=\"lexin\" date=\"2009-02-24\"><translation id=\"379721-1\" value=\"skjorta\" origin=\"lexin\" date=\"2009-02-24\"></translation><example id=\"379721-2\" value=\"Tom put on a clean white shirt and a tie.\" origin=\"lexin\" date=\"2009-02-24\"><translation value=\"Tom satte på sig en ren vit skjorta och en slips.\" origin=\"lexin\" date=\"2009-02-24\"></translation></example><explanation value=\"A piece of clothing with collar, sleeves and buttons down the front.\" origin=\"lexin\" date=\"2009-02-24\"></explanation></word>","skjorta"],0,7]

If you scroll to the right, you’ll see exactly what you wrote above. So, I guess, there’s some post-processing on the client side that makes it look pretty.

@hashier
Copy link
Owner

hashier commented Jan 9, 2016

How did you make the request?

I don't think they do post processing, I assume their DB -> XML is "broken" and their homepage is not using DB -> XML but something else. If they of course use the same interface that you used then I have no idea how they fix it.

@IvanUkhov
Copy link
Contributor Author

In you have Chrome,

  1. open the skjorta page,
  2. open the Developer tools,
  3. go to the Network tab,
  4. reload the page,
  5. select “lookupword” in the list on the left-hand side, and
  6. go to the Response tab.

I’m not claiming that that’s how they do it; I really have no idea. Maybe there’s some other mechanism, which is intentionally hidden.

@hashier
Copy link
Owner

hashier commented Jan 9, 2016

nah, that's just us talking to their web server, that's not how they talk internally to their DB.

But interesting that it shows the correct stuff on the homepage in the end... maybe reading their javascript of handling the response might solve this problem but who's got time to do that (:

@IvanUkhov
Copy link
Contributor Author

I’ve found the code that seems to be doing the translation. Unfortunately, it’s heavily obfuscated and pretty much useless:

function Rbb(b) {
    var i, j;
    Arb[i = ++Brb] = Rbb;
    Crb[i] = KXb + sMb, Fbb();
    var c, d, e, f, g;
    f = new(Crb[i] = KXb + ZRb, pZ)(utb);
    g = OY((Crb[i] = KXb + '547', b));
    for (Crb[i] = KXb + SEb, d = 0, e = g.length;
        (Crb[i] = KXb + SEb, d) < e; Crb[i] = KXb + SEb, ++d) {
        c = g[d];
        switch (Crb[i] = KXb + TEb, c) {
            case 50:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + tCb, f).a).a += '\xB2';
                break;
            case 43:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + WEb, f).a).a += '_';
                break;
            case 64:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + wCb, f).a).a += 'ng';
                break;
            case 99:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + yCb, f).a).a += 'tj';
                break;
            case 36:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + XEb, f).a).a += 'sj';
                break;
            case 65:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + jZb, f).a).a += "'a";
                break;
            case 69:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + zCb, f).a).a += "'e";
                break;
            case 73:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + aFb, f).a).a += "'i";
                break;
            case 79:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + aGb, f).a).a += "'o";
                break;
            case 85:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + xOb, f).a).a += "'u";
                break;
            case 89:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + NEb, f).a).a += "'y";
                break;
            case 197:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + _Rb, f).a).a += "'\xE5";
                break;
            case 196:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + dMb, f).a).a += "'\xE4";
                break;
            case 214:
                Crb[i] = dvb + ttb, (Crb[i] = tTb + Vwb, (Crb[i] = KXb + PRb, f).a).a += "'\xF6";
                break;
            default:
                Crb[i] = dvb + ptb, (Crb[i] = tTb + Lwb, (Crb[i] = KXb + fGb, f).a).a += (Crb[i] = xub + kBb, (Crb[i] = xub + kBb, String).fromCharCode((Crb[i] = KXb + fGb, c)));
        }
    }
    j = (Crb[i] = dvb + Gtb, (Crb[i] = tTb + _vb, (Crb[i] = KXb + ePb, f).a).a);
    Brb = i - 1;
    return j
}

@IvanUkhov
Copy link
Contributor Author

That code indeed works. Here is a more human-friendly version:

var mapping = {
  50: '\xB2',
  43: '_',
  64: 'ng',
  99: 'tj',
  36: 'sj',
  65: "'a",
  69: "'e",
  73: "'i",
  79: "'o",
  85: "'u",
  89: "'y",
  197: "'\xE5",
  196: "'\xE4",
  214: "'\xF6",
};

function translate(text) {
  var buffer = "";
  for (var i = 0, length = text.length, next; i < length; i++) {
    next = mapping[text[i].charCodeAt(0)];
    if (next == undefined) {
        next = text[i];
    }
    buffer += next;
  }
  return buffer;
}

@hashier
Copy link
Owner

hashier commented Jan 9, 2016

wow! Batshit crazy! This was really something to get the obfuscated code to something like this simple! <3 love it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants