Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add localization (and reduce regex duplication) #84

Open
TBG-FR opened this issue Aug 4, 2021 · 0 comments
Open

Add localization (and reduce regex duplication) #84

TBG-FR opened this issue Aug 4, 2021 · 0 comments

Comments

@TBG-FR
Copy link

TBG-FR commented Aug 4, 2021

Hello again,

While working and searching in the Parser/EmailParser.php file, I thought about something:


Instead of having things like

private $quoteHeadersRegex = array(
    '/^.{0,5}(On(?:(?!\bOn\b|\bwrote(\s|\xc2\xa0)?:).){0,1000}wrote(\s|\xc2\xa0)?:)$/ms', // On DATE, NAME <EMAIL> wrote:
    '/^.{0,5}(Le\b(?:(?!\bLe\b|\bécrit(\s|\xc2\xa0)?:).){0,1000}écrit(\s|\xc2\xa0)?:)$/ms', // Le DATE, NAME <EMAIL> a écrit :
    '/^.{0,5}(El(?:(?!\bEl\b|\bescribió\s?:).){0,1000}escribió\s?:)$/ms', // El DATE, NAME <EMAIL> escribió:
    '/^.{0,5}(El(?:(?!\bEl\b|\bha escrit\s?:).){0,1000}ha escrit\s?:)$/ms', // El DATE, NAME <EMAIL> ha escrit:
    '/^.{0,5}(Il(?:(?!\bIl\b|\bscritto(\s|\xc2\xa0)?:).){0,1000}scritto(\s|\xc2\xa0)?:)$/ms', // Il DATE, NAME <EMAIL> ha scritto:
    [...]
    '/^\s*(From\s?:.+\s?(\[|<).+(\]|>))/mu', // "From: NAME <EMAIL>" OR "From : NAME <EMAIL>" OR "From : NAME<EMAIL>"(With support whitespace before start and before <)
    '/^\s*(发件人\s?:.+\s?(\[|<).+(\]|>))/mu', // "发件人: NAME <EMAIL>" OR "发件人 : NAME <EMAIL>" OR "发件人 : NAME<EMAIL>"(With support whitespace before start and before <)
    '/^\s*(De\s?:.+\s?(\[|<).+(\]|>))/mu', // "De: NAME <EMAIL>" OR "De : NAME <EMAIL>" OR "De : NAME<EMAIL>"  (With support whitespace before start and before <)
    '/^\s*(Van\s?:.+\s?(\[|<).+(\]|>))/mu', // "Van: NAME <EMAIL>" OR "Van : NAME <EMAIL>" OR "Van : NAME<EMAIL>"  (With support whitespace before start and before <)
    '/^\s*(Da\s?:.+\s?(\[|<).+(\]|>))/mu', // "Da: NAME <EMAIL>" OR "Da : NAME <EMAIL>" OR "Da : NAME<EMAIL>"  (With support whitespace before start and before <)
    [...]
);

couldn't we have only one variabilized line for each "type" of reply like that (of course it's only a draft):

private $quoteHeadersRegex = array(
    '/^.{0,5}($on(?:(?!\b$on\b|\b$wrote(\s|\xc2\xa0)?:).){0,1000}$wrote(\s|\xc2\xa0)?:)$/ms', // On DATE, NAME <EMAIL> wrote:
    [...]
    '/^\s*($from\s?:.+\s?(\[|<).+(\]|>))/mu', // "From: NAME <EMAIL>" OR "From : NAME <EMAIL>" OR "From : NAME<EMAIL>"(With support whitespace before start and before <)
    [...]
);

Then we would run these Regex checks using a list of language files, so for example $wrote would be checked with "wrote", then "a écrit", then "escribió", ...


Here are the advantages I see in that modification:

  • Adding a new language or variation is easier
  • You don't have to duplicate X times the same Regex, modifying one or two words each time
  • You're less likely to make a mistake in a Regex

That was my two cents, thanks for reading 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant