Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contradictory conversions in seo_zenphoto extension #1174

Open
kochs-online opened this issue Jul 2, 2018 · 11 comments
Open

Contradictory conversions in seo_zenphoto extension #1174

kochs-online opened this issue Jul 2, 2018 · 11 comments

Comments

@kochs-online
Copy link
Contributor

I came back to this age-old thread and tried to write a plugin to change the names of new album folders to lower-case. I also wanted to convert some special characters used in the German language (ß and Umlauts) into their basic Latin alphabet transcriptions (ä → ae, ö → oe, …).

That being said I discovered that there already is an official extension named seo_zenphoto that's supposed to do exactly this. I gave it a test run and noticed that while it replaces "ä" with "ae" and "ö" with "oe", it replaces "ü" with "u" instead of "ue" as expected.
The reason for this is that there are some duplicates in the large replacement array $specialchars of seo_zenphoto.php. One of these entries replaces "ü" with "ue" and two more entries replace "ü" with "u".

Here is a list of the duplicates I found (that comparison wasn't exactly a work of science so there might be more 😉):

  • Å
  • ø
  • Ø
  • ŕ
  • ř
  • ŗ
  • ü
  • Ü
  • ǖ

Not all of these duplicates are contradictory, but some are.
I don't know the source of the list used for that array, but maybe it was locale-dependent? Should this replacement be locale-aware as well?

Bonus round: The capital ẞ is missing in the list of replacements. Correct replacement would be "SS".

Bonus round number 2: When starting to write my own extension I tried to keep Zenphoto's standard behavior for replacements and just add my own set. Unfortunately this is either/or. So you can either have Zenphoto's set of replacement rules or your own, but cannot combine them and maintaining an extension would require you to check all new versions if Zenphoto's implementation has changed. Any chance to improve this?

@acrylian
Copy link
Member

acrylian commented Jul 2, 2018

  1. Duplicates: That might have sneaked in as this list was from various sources and extended manually as there is no way to automate this (which is why it will not work that good with non European char systesm like cyrillc or Asian ones). But since I am German actually, I though at least those all are covered. If "ü" is not replaced correctly with "ue" that is a bug.
  2. Bonus round: I am sure when this was introduce there was no captial ß at all ;-) (And I won't get into that all capital ß I saw so far are typographical disasters ;-))
  3. Bonus round 2: Yes, that's true. But there might indeed be a chance in the next major release ;-)

So 1 and 2 I welcome an update to the plugin respectively the array of defines.

sbillard referenced this issue in sbillard/netPhotoGraphics-DEV Jul 2, 2018
@sbillard
Copy link
Contributor

sbillard commented Jul 2, 2018

Just curious about Bonus 2. Is your set exclusive to the zenphoto one? If not, at least for the PHP combination there should be no problem with adding it. These transformations are provided by applying a filter. As many filters as you like may be registered to a target. They are applied in priority order, so by an additional plugin would be no issue there. The javascript handling would be a bit more complicated, but also doable. The only thing required would be to remove the $plugin_disable line.

sbillard referenced this issue in sbillard/netPhotoGraphics-DEV Jul 2, 2018
sbillard referenced this issue in sbillard/netPhotoGraphics-DEV Jul 2, 2018
sbillard referenced this issue in netPhotoGraphics/netPhotoGraphics Jul 2, 2018
also better performance of seo_zenphoto
@kochs-online
Copy link
Contributor Author

kochs-online commented Jul 3, 2018

Is your set exclusive to the zenphoto one?

No, it's meant to be additive. I was trying to apply locale-specific conversions (replacement of German Umlauts) to the SEO-relevant string. This was before I found the seo_zenphoto extension.
Currently the code calling the plugin from function seoFriendly() looks like this:

if (zp_has_filter('seoFriendly')) {
	$string = zp_apply_filter('seoFriendly', $string);
} else { // no filter, do basic cleanup
	$string = trim($string);
	$string = preg_replace("/\s+/", "-", $string);
	$string = preg_replace("/[^a-zA-Z0-9_.-]/", "-", $string);
	$string = str_replace(array('---', '--'), '-', $string);
}

So I can either have my filter applied or have the basic cleanup, but not both. (At least not without pulling a trick like having my extension unregister itself, call seoFriendly() again and than re-register itself.)

Edit: Ideally there would be a way to let extensions decide whether they are additive or exclusive. This could be done using an optional call-by-reference parameter:

static function my_filter($string, &$exclusive = false) {
	$exclusive = true;
	// Work exclusive magic here!
}

@acrylian
Copy link
Member

acrylian commented Jul 3, 2018

Yes, since there is only a basic default the plugin takes over or not. But have you tried to call your plugin later in order than the official one? It is certainly possible to attach more than one filter calls to a filter hook.

As already mentioned in a later version this will move to core and allow adding directly (since the array is passed through the filter, the function attached can then decided what to do).

Meanwhile I would appreciate the update to the conversion array to the plugin since the duplicates are a bug.

Edit: I apparently didn't remember that the plugin author had decided back then to allow only one seo filter at the time. Try to remove $plugin_disable = (zp_has_filter('seoFriendly') && !extensionEnabled('seo_zenphoto')) ? sprintf(gettext('Only one SEO filter plugin may be enabled. <a href="#%1$s"><code>%1$s</code></a> is already enabled.'), stripSuffix(get_filterScript('seoFriendly'))) : ''; from the plugin and see if this works with calling your plugin later in order.

@sbillard
Copy link
Contributor

sbillard commented Jul 3, 2018

You are correct in observing that the "standard" seo removes non-alphanumeric characters and consolidates spaces into a hyphen. That service could be provided even if a filter is applied. But it is probably inappropriate for the default to be to remove the non-alphanumeric characters. Consolidating spaces into hyphens is ok, though.

If you create a filter it needs to replace the non-alphanumeric characters (if that is your desire.) For instance, the seo_null plugin deliberately does not remove them. If you remove the $plugin_disable lines you can use your filter as an additive. If the order of application matters, make sure the filter priorities match your required order.

The problem with the current implementation (besides not consolidating the spaces/hyphens centrally) is that the javascript version of the functions creates a function rather than provide an additive set of statements. (this is fixed by netPhotoGraphics/netPhotoGraphics@1f4e621) The javascript code is used only for creating new albums.

@kochs-online
Copy link
Contributor Author

Try to remove $plugin_disable […] from the plugin and see if this works with calling your plugin later in order.

I understand that my filter could be called in addition to other filters, but that wasn't my point. I don't like the idea of copy & pasting code to plugins or redefined functions/methods because that way you wouldn't notice if the original code was changed.

Imagine me copy & pasting the following line from Zenphoto to my filter:

$string = str_replace(array('---', '--'), '-', $string);

Tomorrow someone discovers that this line of code only replaces doubles and triplets of hyphens but not quadruplets and changes Zenphoto to use this improved version:

$string = preg_replace("/--+/", "-", $string);

My filter would now show a different behavior and I have successfully avoided to benefit from improvements to the base code. 😉

But it is probably inappropriate for the default to remove the non-alphanumeric characters. Consolidating spaces into hyphens is ok, though.

🤔 Agreed.

@acrylian
Copy link
Member

acrylian commented Jul 3, 2018

@kochs-online I do understand. This is why this is currently either/or. The point here was to remove that line and see if it works out calling both together. I'll consider to remove that requirement of one at the them since it does not make sense to limit this.

$string = preg_replace("/--+/", "-", $string);

You can just send it with the pull request for the updated conversion array fixing what you found ;-)

In the future when this will be core with some selections we'll consider to provide an additional filter to modify the conversion array itself additionally.

@kochs-online
Copy link
Contributor Author

Working on the updated conversion table. What a snakepit! 😣

As suspected conversions are locale-specific and may differ from language to language. One example would be the Cyrillic letter "ь", that "has no phonetic value of its own and is purely an orthographic device" in Modern Russian according to Wikipedia. As such it is dropped when converting SEO-relevant strings as of now. In Bulgarian however that letter is "representing the mid back unrounded vowel" and should be replaced by the Latin letter "a".

I know, the whole topic is rather academic, but nonetheless I wonder what a less error-prone solution would look like. Make it a switch/case-statement and evaluate the locale?

switch (setlocale(LC_CTYPE, "0")) {
	case 'de_DE':	// German
		// replace "ä" with "ae", "ö" with "oe" etc.
		break;
	case 'bg_BG':	// Bulgarian
		// replace "ь" with "a" etc.
		break;
	case 'ru_RU':	// Russian
		// drop "ь" etc.
		break;
	default:
		// replace any non-ASCII character with "-"
		break;
}

At least this would make the code a lot more readable and maintanable while also solving the problem of contradicting conversions. Downside would be that I have no idea how to assign the current replacements to various locales.

Another approach would be to use iconv() (or a similar function) and leave it to the framework, but reading some of the comments for this function I got the impression it has a bunch of pitfalls of its own. 😕

@acrylian
Copy link
Member

acrylian commented Jul 4, 2018

Unless I am mistaken iconv() is not always available. I don't think that it is really useful to create a switch. Overcomplicates stuff IMHO.

We should generally offer a general clearance to use plain ABC/Numbers for urls like other CMS do it. We will never be able to cover all possible languages and their specialities 100% correctly since you probably have to know those languages to be able to (see the speciality of the cyrillic char). I mean look at the array…

If anyone requires specific things really a specific plugin can/needs to be created. Also probably careful with any setlocale usage as it may intefere with other settings unwantedly.

@kochs-online
Copy link
Contributor Author

I analyzed the current array more thoroughly and identified these entries as (possibly) problematic:

$specialchars index UTF-8 code Character Substitute
563 c2a8 ¨ none
4 c385 Å A
26 c385 Å A
179 c398 Ø OE
807 c398 Ø O
244 c39c Ü UE
830 c39c Ü Ue
463 c3b8 ø oe
806 c3b8 ø o
533 c3bc ü ue
828 c3bc ü u
829 c3bc ü u
99 c4a6 Ħ H
660 c4a6 Ħ H
127 c4bb Ļ L
743 c4bb Ļ K
129 c4bd Ľ L
741 c4bd Ľ K
130 c4bf Ŀ L
742 c4bf Ŀ K
468 c595 ŕ p
817 c595 ŕ r
474 c597 ŗ p
819 c597 ŗ r
470 c599 ř p
818 c599 ř r
522 c796 ǖ u
525 c796 ǖ u
568 cdba ͺ none
567 ce84 ΄ none
566 ce85 ΅ none
983 d0aa Ъ none
987 d0ac Ь none
982 d18a ъ none
986 d18c ь none
113 e1b8ac I
114 e1b8ac I
125 e1b8b6 L
126 e1b8b6 L
583 e1bebd none
570 e1bebf ᾿ none
578 e1bf80 none
579 e1bf81 none
572 e1bf8d none
574 e1bf8e none
576 e1bf8f none
573 e1bf9d none
575 e1bf9e none
577 e1bf9f none
582 e1bfad none
580 e1bfae none
581 e1bfaf none
571 e1bfbe none

For comparison these are ordered by their (hexadecimal) UTF-8 code in order to find duplicates.

@acrylian
Copy link
Member

acrylian commented Jul 4, 2018

Okay, that is not that much. Feel free to make the changes to the existing array. Users can always manually edit the name/titlelink of items if it does not generate the 100% fitting replacement, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants