Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

y-019 #680

Open
vr8hub opened this issue Apr 3, 2024 · 5 comments
Open

y-019 #680

vr8hub opened this issue Apr 3, 2024 · 5 comments

Comments

@vr8hub
Copy link
Contributor

vr8hub commented Apr 3, 2024

The first test looks for ”[^“‘]+?”, i.e. two consecutive rdquo's with no intervening ld/squo.
If then filters those to exclude nested quotations; I haven't gotten to that part yet.

Then it adds "additional matches" using another slightly simplified regex, ^[^“]+” to find a rdquo with no preceding ldquo on the line. Here there are exceptions: previous sibling isn't a blockquote, it's not in a blockquote itself, and it's not in a z3998:poem/verse/song/hymn.

The issue is that the first test is going to catch things that the "additional matches" excludes. E.g., if a blockquote has a single rdquo with no preceding ldquo, then the second regex would specifically exclude it. But, if anywhere above that blockquote there was a rdquo, then the first regex is going to report the rdquo in the blockquote, even though the second regex wanted to exclude it. (I found this in testing, so this is real, not theoretical.)

The question is, do you care? This goes to intent more than code: if the reason for the exceptions in the second regex is valid, then shouldn't the first regex deal with them as well? Or, if we don't care where the second consecutive rdquo is, then are the exceptions on the second regex needed?

On an unrelated note, the additional matches only show the last twenty characters of the node, rather than whole node. That's seems a little odd to me, is inconsistent with what is done elsewhere, and in many cases (all, in my testing cases) doesn't show the problematic rdquo at all. Why aren't we showing the whole node there?

@acabal
Copy link
Member

acabal commented Apr 3, 2024

I need to see an example to visualize what's going on.

If anything, the first regex should be merged into the xpath, and the xpath updated to exclude the subset of <td>s that we don't want to look at. I probably didn't do that to begin with because the xpath would have been very complicated.

The match only shows a few preceding characters to make it easier for the user to see where the error is in very long paragraphs. Otherwise we'd get constant mailing list messages asking why this error is occurring because it can be hard to spot a single miscurled quote in long paragraphs with possibly multiple quotes.

@vr8hub
Copy link
Contributor Author

vr8hub commented Apr 3, 2024

Attached is the file I've started for testing. The FAIL 1 and FAIL 2 were written as exclusions because they are excluded by the second regex, but I had to convert them to failures because they fail due to the first regex.

It doesn't show the preceding characters of the match, it shows the last twenty characters of the node. Thus, if you run lint on the attached file, the last error doesn't show the offending rdquo at all, just the last twenty characters of the paragraph. I appreciate the "hard to spot," and had issues with Proust over this very thing because of the lengths of his paragraphs, but IMO showing no context is worse than showing all of it.

chapter-1.txt

@acabal
Copy link
Member

acabal commented Apr 3, 2024

OK so you can just remove the subsetting then.

As far as the structure of the test goes, if it works then I'm not too concerned about the internals. If you want to try to merge those two regexes into a single xpath, that gets the same results, then go for it.

@vr8hub
Copy link
Contributor Author

vr8hub commented Apr 3, 2024

Right, it's the "if it works" that's the question. The first regex is including hits that the second regex is explicitly excluding. IOW, the exclusions don't work if there are quotes prior to them. If you don't care about that, I don't either. :)

I'll remove the subset, and probably just include it in the PR for the next set of tests.

@acabal
Copy link
Member

acabal commented Apr 3, 2024

If it works it works. If not, feel free to make any tweaks that might resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants