y-019 #680

vr8hub · 2024-04-03T00:03:35Z

The first test looks for ”[^“‘]+?”, i.e. two consecutive rdquo's with no intervening ld/squo.
If then filters those to exclude nested quotations; I haven't gotten to that part yet.

Then it adds "additional matches" using another slightly simplified regex, ^[^“]+” to find a rdquo with no preceding ldquo on the line. Here there are exceptions: previous sibling isn't a blockquote, it's not in a blockquote itself, and it's not in a z3998:poem/verse/song/hymn.

The issue is that the first test is going to catch things that the "additional matches" excludes. E.g., if a blockquote has a single rdquo with no preceding ldquo, then the second regex would specifically exclude it. But, if anywhere above that blockquote there was a rdquo, then the first regex is going to report the rdquo in the blockquote, even though the second regex wanted to exclude it. (I found this in testing, so this is real, not theoretical.)

The question is, do you care? This goes to intent more than code: if the reason for the exceptions in the second regex is valid, then shouldn't the first regex deal with them as well? Or, if we don't care where the second consecutive rdquo is, then are the exceptions on the second regex needed?

On an unrelated note, the additional matches only show the last twenty characters of the node, rather than whole node. That's seems a little odd to me, is inconsistent with what is done elsewhere, and in many cases (all, in my testing cases) doesn't show the problematic rdquo at all. Why aren't we showing the whole node there?

The text was updated successfully, but these errors were encountered:

acabal · 2024-04-03T17:01:43Z

I need to see an example to visualize what's going on.

If anything, the first regex should be merged into the xpath, and the xpath updated to exclude the subset of <td>s that we don't want to look at. I probably didn't do that to begin with because the xpath would have been very complicated.

The match only shows a few preceding characters to make it easier for the user to see where the error is in very long paragraphs. Otherwise we'd get constant mailing list messages asking why this error is occurring because it can be hard to spot a single miscurled quote in long paragraphs with possibly multiple quotes.

vr8hub · 2024-04-03T17:45:44Z

Attached is the file I've started for testing. The FAIL 1 and FAIL 2 were written as exclusions because they are excluded by the second regex, but I had to convert them to failures because they fail due to the first regex.

It doesn't show the preceding characters of the match, it shows the last twenty characters of the node. Thus, if you run lint on the attached file, the last error doesn't show the offending rdquo at all, just the last twenty characters of the paragraph. I appreciate the "hard to spot," and had issues with Proust over this very thing because of the lengths of his paragraphs, but IMO showing no context is worse than showing all of it.

chapter-1.txt

acabal · 2024-04-03T18:10:55Z

OK so you can just remove the subsetting then.

As far as the structure of the test goes, if it works then I'm not too concerned about the internals. If you want to try to merge those two regexes into a single xpath, that gets the same results, then go for it.

vr8hub · 2024-04-03T18:41:35Z

Right, it's the "if it works" that's the question. The first regex is including hits that the second regex is explicitly excluding. IOW, the exclusions don't work if there are quotes prior to them. If you don't care about that, I don't either. :)

I'll remove the subset, and probably just include it in the PR for the next set of tests.

acabal · 2024-04-03T18:49:45Z

If it works it works. If not, feel free to make any tweaks that might resolve the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

y-019 #680

y-019 #680

vr8hub commented Apr 3, 2024

acabal commented Apr 3, 2024 •

edited

vr8hub commented Apr 3, 2024

acabal commented Apr 3, 2024

vr8hub commented Apr 3, 2024

acabal commented Apr 3, 2024

y-019 #680

y-019 #680

Comments

vr8hub commented Apr 3, 2024

acabal commented Apr 3, 2024 • edited

vr8hub commented Apr 3, 2024

acabal commented Apr 3, 2024

vr8hub commented Apr 3, 2024

acabal commented Apr 3, 2024

acabal commented Apr 3, 2024 •

edited