New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
y-019 #680
Comments
I need to see an example to visualize what's going on. If anything, the first regex should be merged into the xpath, and the xpath updated to exclude the subset of The match only shows a few preceding characters to make it easier for the user to see where the error is in very long paragraphs. Otherwise we'd get constant mailing list messages asking why this error is occurring because it can be hard to spot a single miscurled quote in long paragraphs with possibly multiple quotes. |
Attached is the file I've started for testing. The FAIL 1 and FAIL 2 were written as exclusions because they are excluded by the second regex, but I had to convert them to failures because they fail due to the first regex. It doesn't show the preceding characters of the match, it shows the last twenty characters of the node. Thus, if you run lint on the attached file, the last error doesn't show the offending rdquo at all, just the last twenty characters of the paragraph. I appreciate the "hard to spot," and had issues with Proust over this very thing because of the lengths of his paragraphs, but IMO showing no context is worse than showing all of it. |
OK so you can just remove the subsetting then. As far as the structure of the test goes, if it works then I'm not too concerned about the internals. If you want to try to merge those two regexes into a single xpath, that gets the same results, then go for it. |
Right, it's the "if it works" that's the question. The first regex is including hits that the second regex is explicitly excluding. IOW, the exclusions don't work if there are quotes prior to them. If you don't care about that, I don't either. :) I'll remove the subset, and probably just include it in the PR for the next set of tests. |
If it works it works. If not, feel free to make any tweaks that might resolve the issue. |
The first test looks for
”[^“‘]+?”
, i.e. two consecutive rdquo's with no intervening ld/squo.If then filters those to exclude nested quotations; I haven't gotten to that part yet.
Then it adds "additional matches" using another slightly simplified regex,
^[^“]+”
to find a rdquo with no preceding ldquo on the line. Here there are exceptions: previous sibling isn't a blockquote, it's not in a blockquote itself, and it's not in a z3998:poem/verse/song/hymn.The issue is that the first test is going to catch things that the "additional matches" excludes. E.g., if a blockquote has a single rdquo with no preceding ldquo, then the second regex would specifically exclude it. But, if anywhere above that blockquote there was a rdquo, then the first regex is going to report the rdquo in the blockquote, even though the second regex wanted to exclude it. (I found this in testing, so this is real, not theoretical.)
The question is, do you care? This goes to intent more than code: if the reason for the exceptions in the second regex is valid, then shouldn't the first regex deal with them as well? Or, if we don't care where the second consecutive rdquo is, then are the exceptions on the second regex needed?
On an unrelated note, the additional matches only show the last twenty characters of the node, rather than whole node. That's seems a little odd to me, is inconsistent with what is done elsewhere, and in many cases (all, in my testing cases) doesn't show the problematic rdquo at all. Why aren't we showing the whole node there?
The text was updated successfully, but these errors were encountered: