Contradictory conversions in seo_zenphoto extension #1174

kochs-online · 2018-07-02T12:04:16Z

I came back to this age-old thread and tried to write a plugin to change the names of new album folders to lower-case. I also wanted to convert some special characters used in the German language (ß and Umlauts) into their basic Latin alphabet transcriptions (ä → ae, ö → oe, …).

That being said I discovered that there already is an official extension named seo_zenphoto that's supposed to do exactly this. I gave it a test run and noticed that while it replaces "ä" with "ae" and "ö" with "oe", it replaces "ü" with "u" instead of "ue" as expected.
The reason for this is that there are some duplicates in the large replacement array $specialchars of seo_zenphoto.php. One of these entries replaces "ü" with "ue" and two more entries replace "ü" with "u".

Here is a list of the duplicates I found (that comparison wasn't exactly a work of science so there might be more 😉):

Å
Ḭ
Ḷ
ø
Ø
ŕ
ř
ŗ
ü
Ü
ǖ

Not all of these duplicates are contradictory, but some are.
I don't know the source of the list used for that array, but maybe it was locale-dependent? Should this replacement be locale-aware as well?

Bonus round: The capital ẞ is missing in the list of replacements. Correct replacement would be "SS".

Bonus round number 2: When starting to write my own extension I tried to keep Zenphoto's standard behavior for replacements and just add my own set. Unfortunately this is either/or. So you can either have Zenphoto's set of replacement rules or your own, but cannot combine them and maintaining an extension would require you to check all new versions if Zenphoto's implementation has changed. Any chance to improve this?

The text was updated successfully, but these errors were encountered:

acrylian · 2018-07-02T13:07:10Z

Duplicates: That might have sneaked in as this list was from various sources and extended manually as there is no way to automate this (which is why it will not work that good with non European char systesm like cyrillc or Asian ones). But since I am German actually, I though at least those all are covered. If "ü" is not replaced correctly with "ue" that is a bug.
Bonus round: I am sure when this was introduce there was no captial ß at all ;-) (And I won't get into that all capital ß I saw so far are typographical disasters ;-))
Bonus round 2: Yes, that's true. But there might indeed be a chance in the next major release ;-)

So 1 and 2 I welcome an update to the plugin respectively the array of defines.

also see 6649e71

sbillard · 2018-07-02T19:41:08Z

Just curious about Bonus 2. Is your set exclusive to the zenphoto one? If not, at least for the PHP combination there should be no problem with adding it. These transformations are provided by applying a filter. As many filters as you like may be registered to a target. They are applied in priority order, so by an additional plugin would be no issue there. The javascript handling would be a bit more complicated, but also doable. The only thing required would be to remove the $plugin_disable line.

This reverts commit b239629.

also better performance of seo_zenphoto

kochs-online · 2018-07-03T07:27:31Z

Is your set exclusive to the zenphoto one?

No, it's meant to be additive. I was trying to apply locale-specific conversions (replacement of German Umlauts) to the SEO-relevant string. This was before I found the seo_zenphoto extension.
Currently the code calling the plugin from function seoFriendly() looks like this:

if (zp_has_filter('seoFriendly')) {
	$string = zp_apply_filter('seoFriendly', $string);
} else { // no filter, do basic cleanup
	$string = trim($string);
	$string = preg_replace("/\s+/", "-", $string);
	$string = preg_replace("/[^a-zA-Z0-9_.-]/", "-", $string);
	$string = str_replace(array('---', '--'), '-', $string);
}

So I can either have my filter applied or have the basic cleanup, but not both. (At least not without pulling a trick like having my extension unregister itself, call seoFriendly() again and than re-register itself.)

Edit: Ideally there would be a way to let extensions decide whether they are additive or exclusive. This could be done using an optional call-by-reference parameter:

static function my_filter($string, &$exclusive = false) {
	$exclusive = true;
	// Work exclusive magic here!
}

acrylian · 2018-07-03T08:38:20Z

Yes, since there is only a basic default the plugin takes over or not. But have you tried to call your plugin later in order than the official one? It is certainly possible to attach more than one filter calls to a filter hook.

As already mentioned in a later version this will move to core and allow adding directly (since the array is passed through the filter, the function attached can then decided what to do).

Meanwhile I would appreciate the update to the conversion array to the plugin since the duplicates are a bug.

Edit: I apparently didn't remember that the plugin author had decided back then to allow only one seo filter at the time. Try to remove $plugin_disable = (zp_has_filter('seoFriendly') && !extensionEnabled('seo_zenphoto')) ? sprintf(gettext('Only one SEO filter plugin may be enabled. <a href="#%1$s"><code>%1$s</code></a> is already enabled.'), stripSuffix(get_filterScript('seoFriendly'))) : ''; from the plugin and see if this works with calling your plugin later in order.

sbillard · 2018-07-03T15:47:13Z

You are correct in observing that the "standard" seo removes non-alphanumeric characters and consolidates spaces into a hyphen. That service could be provided even if a filter is applied. But it is probably inappropriate for the default to be to remove the non-alphanumeric characters. Consolidating spaces into hyphens is ok, though.

If you create a filter it needs to replace the non-alphanumeric characters (if that is your desire.) For instance, the seo_null plugin deliberately does not remove them. If you remove the $plugin_disable lines you can use your filter as an additive. If the order of application matters, make sure the filter priorities match your required order.

The problem with the current implementation (besides not consolidating the spaces/hyphens centrally) is that the javascript version of the functions creates a function rather than provide an additive set of statements. (this is fixed by netPhotoGraphics/netPhotoGraphics@1f4e621) The javascript code is used only for creating new albums.

kochs-online · 2018-07-03T16:51:59Z

Try to remove $plugin_disable […] from the plugin and see if this works with calling your plugin later in order.

I understand that my filter could be called in addition to other filters, but that wasn't my point. I don't like the idea of copy & pasting code to plugins or redefined functions/methods because that way you wouldn't notice if the original code was changed.

Imagine me copy & pasting the following line from Zenphoto to my filter:

$string = str_replace(array('---', '--'), '-', $string);

Tomorrow someone discovers that this line of code only replaces doubles and triplets of hyphens but not quadruplets and changes Zenphoto to use this improved version:

$string = preg_replace("/--+/", "-", $string);

My filter would now show a different behavior and I have successfully avoided to benefit from improvements to the base code. 😉

But it is probably inappropriate for the default to remove the non-alphanumeric characters. Consolidating spaces into hyphens is ok, though.

🤔 Agreed.

acrylian · 2018-07-03T17:23:10Z

@kochs-online I do understand. This is why this is currently either/or. The point here was to remove that line and see if it works out calling both together. I'll consider to remove that requirement of one at the them since it does not make sense to limit this.

$string = preg_replace("/--+/", "-", $string);

You can just send it with the pull request for the updated conversion array fixing what you found ;-)

In the future when this will be core with some selections we'll consider to provide an additional filter to modify the conversion array itself additionally.

kochs-online · 2018-07-04T08:42:45Z

Working on the updated conversion table. What a snakepit! 😣

As suspected conversions are locale-specific and may differ from language to language. One example would be the Cyrillic letter "ь", that "has no phonetic value of its own and is purely an orthographic device" in Modern Russian according to Wikipedia. As such it is dropped when converting SEO-relevant strings as of now. In Bulgarian however that letter is "representing the mid back unrounded vowel" and should be replaced by the Latin letter "a".

I know, the whole topic is rather academic, but nonetheless I wonder what a less error-prone solution would look like. Make it a switch/case-statement and evaluate the locale?

switch (setlocale(LC_CTYPE, "0")) {
	case 'de_DE':	// German
		// replace "ä" with "ae", "ö" with "oe" etc.
		break;
	case 'bg_BG':	// Bulgarian
		// replace "ь" with "a" etc.
		break;
	case 'ru_RU':	// Russian
		// drop "ь" etc.
		break;
	default:
		// replace any non-ASCII character with "-"
		break;
}

At least this would make the code a lot more readable and maintanable while also solving the problem of contradicting conversions. Downside would be that I have no idea how to assign the current replacements to various locales.

Another approach would be to use iconv() (or a similar function) and leave it to the framework, but reading some of the comments for this function I got the impression it has a bunch of pitfalls of its own. 😕

acrylian · 2018-07-04T09:45:49Z

Unless I am mistaken iconv() is not always available. I don't think that it is really useful to create a switch. Overcomplicates stuff IMHO.

We should generally offer a general clearance to use plain ABC/Numbers for urls like other CMS do it. We will never be able to cover all possible languages and their specialities 100% correctly since you probably have to know those languages to be able to (see the speciality of the cyrillic char). I mean look at the array…

If anyone requires specific things really a specific plugin can/needs to be created. Also probably careful with any setlocale usage as it may intefere with other settings unwantedly.

kochs-online · 2018-07-04T10:44:49Z

I analyzed the current array more thoroughly and identified these entries as (possibly) problematic:

`$specialchars` index	UTF-8 code	Character	Substitute
563	c2a8	¨	none
4	c385	Å	A
26	c385	Å	A
179	c398	Ø	OE
807	c398	Ø	O
244	c39c	Ü	UE
830	c39c	Ü	Ue
463	c3b8	ø	oe
806	c3b8	ø	o
533	c3bc	ü	ue
828	c3bc	ü	u
829	c3bc	ü	u
99	c4a6	Ħ	H
660	c4a6	Ħ	H
127	c4bb	Ļ	L
743	c4bb	Ļ	K
129	c4bd	Ľ	L
741	c4bd	Ľ	K
130	c4bf	Ŀ	L
742	c4bf	Ŀ	K
468	c595	ŕ	p
817	c595	ŕ	r
474	c597	ŗ	p
819	c597	ŗ	r
470	c599	ř	p
818	c599	ř	r
522	c796	ǖ	u
525	c796	ǖ	u
568	cdba	ͺ	none
567	ce84	΄	none
566	ce85	΅	none
983	d0aa	Ъ	none
987	d0ac	Ь	none
982	d18a	ъ	none
986	d18c	ь	none
113	e1b8ac	Ḭ	I
114	e1b8ac	Ḭ	I
125	e1b8b6	Ḷ	L
126	e1b8b6	Ḷ	L
583	e1bebd	᾽	none
570	e1bebf	᾿	none
578	e1bf80	῀	none
579	e1bf81	῁	none
572	e1bf8d	῍	none
574	e1bf8e	῎	none
576	e1bf8f	῏	none
573	e1bf9d	῝	none
575	e1bf9e	῞	none
577	e1bf9f	῟	none
582	e1bfad	῭	none
580	e1bfae	΅	none
581	e1bfaf	`	none
571	e1bfbe	῾	none

For comparison these are ordered by their (hexadecimal) UTF-8 code in order to find duplicates.

acrylian · 2018-07-04T11:46:36Z

Okay, that is not that much. Feel free to make the changes to the existing array. Users can always manually edit the name/titlelink of items if it does not generate the 100% fitting replacement, too.

…1174

sbillard referenced this issue in sbillard/netPhotoGraphics-DEV Jul 2, 2018

fix for netPhotoGraphics#1174

b239629

also see 6649e71

sbillard referenced this issue in sbillard/netPhotoGraphics-DEV Jul 2, 2018

Revert "fix for netPhotoGraphics#1174"

c8b4893

This reverts commit b239629.

sbillard referenced this issue in sbillard/netPhotoGraphics-DEV Jul 2, 2018

Revert "fix for netPhotoGraphics#1174"

7454060

This reverts commit b239629.

sbillard referenced this issue in netPhotoGraphics/netPhotoGraphics Jul 2, 2018

fix #1174

1f4e621

also better performance of seo_zenphoto

acrylian added the enhancement label Jul 7, 2018

acrylian added a commit that referenced this issue Sep 30, 2018

seo_cleanup: remove some duplicates and improve hyphen replacements #…

d124977

…1174

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contradictory conversions in seo_zenphoto extension #1174

Contradictory conversions in seo_zenphoto extension #1174

kochs-online commented Jul 2, 2018

acrylian commented Jul 2, 2018 •

edited

sbillard commented Jul 2, 2018

kochs-online commented Jul 3, 2018 •

edited

acrylian commented Jul 3, 2018 •

edited

sbillard commented Jul 3, 2018

kochs-online commented Jul 3, 2018

acrylian commented Jul 3, 2018 •

edited

kochs-online commented Jul 4, 2018

acrylian commented Jul 4, 2018

kochs-online commented Jul 4, 2018

acrylian commented Jul 4, 2018 •

edited

Contradictory conversions in seo_zenphoto extension #1174

Contradictory conversions in seo_zenphoto extension #1174

Comments

kochs-online commented Jul 2, 2018

acrylian commented Jul 2, 2018 • edited

sbillard commented Jul 2, 2018

kochs-online commented Jul 3, 2018 • edited

acrylian commented Jul 3, 2018 • edited

sbillard commented Jul 3, 2018

kochs-online commented Jul 3, 2018

acrylian commented Jul 3, 2018 • edited

kochs-online commented Jul 4, 2018

acrylian commented Jul 4, 2018

kochs-online commented Jul 4, 2018

acrylian commented Jul 4, 2018 • edited

acrylian commented Jul 2, 2018 •

edited

kochs-online commented Jul 3, 2018 •

edited

acrylian commented Jul 3, 2018 •

edited

acrylian commented Jul 3, 2018 •

edited

acrylian commented Jul 4, 2018 •

edited