Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fuzzy matching to name completion #2563

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

przepompownia
Copy link
Contributor

@przepompownia przepompownia commented Feb 26, 2024

Co-authored-by: @mamazu

Resolves #2562

@przepompownia przepompownia changed the title Implement ShortNameFuzzilyMatchesTo Criteria Add fuzzy matching to name completion Feb 26, 2024
return false;
}

$regex = '#' . implode('.*', array_map(preg_quote(...), str_split($this->name))) . '#i';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we move this to the constructor? That way we don't have to build the regex every time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea

@mamazu
Copy link
Contributor

mamazu commented Feb 26, 2024

Alternatively you could try something less regex:

function fuzzySearch(string $search, string $subject): bool
{
    $index = -1;
    foreach(str_split($search) as $char){
        $index = stripos($subject, $char, $index + 1);
        if ($index === false) {
            return false;
        }
    }
    
    return true;
}

Would need some benchmarking though.

@przepompownia
Copy link
Contributor Author

przepompownia commented Feb 26, 2024

First I needed to learn how to write my first benchmark for phpbench and used existing Criteria to compare:

\Phpactor\Indexer\Tests\Benchmark\Model\Query\Criteria\ShortNameMatchesToBench

    benchRegexFuzzyMatching # substring.....I4 - Mo3.035μs (±1.56%)
    benchRegexFuzzyMatching # subsequence...I4 - Mo2.963μs (±1.79%)
    benchBeginsWith # substring.............I4 - Mo2.197μs (±2.12%)
    benchBeginsWith # subsequence...........I4 - Mo2.242μs (±3.74%)

Subjects: 2, Assertions: 0, Failures: 0, Errors: 0
+------+-------------------------+-------------------------+-------------+------+------------+----------+--------------+----------------+
| iter | benchmark               | subject                 | set         | revs | mem_peak   | time_avg | comp_z_value | comp_deviation |
+------+-------------------------+-------------------------+-------------+------+------------+----------+--------------+----------------+
| 0    | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring   | 1000 | 5,152,448b | 3.023μs  | -0.55σ       | -0.87%         |
| 1    | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring   | 1000 | 5,152,448b | 3.132μs  | +1.73σ       | +2.71%         |
| 2    | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring   | 1000 | 5,152,448b | 3.058μs  | +0.18σ       | +0.28%         |
| 3    | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring   | 1000 | 5,152,448b | 2.988μs  | -1.29σ       | -2.01%         |
| 4    | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring   | 1000 | 5,152,448b | 3.046μs  | -0.07σ       | -0.11%         |
| 0    | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 3.023μs  | +1.27σ       | +2.28%         |
| 1    | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.918μs  | -0.71σ       | -1.27%         |
| 2    | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.953μs  | -0.05σ       | -0.09%         |
| 3    | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.880μs  | -1.43σ       | -2.56%         |
| 4    | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 3.004μs  | +0.91σ       | +1.64%         |
| 0    | ShortNameMatchesToBench | benchBeginsWith         | substring   | 1000 | 5,152,400b | 2.188μs  | -0.27σ       | -0.57%         |
| 1    | ShortNameMatchesToBench | benchBeginsWith         | substring   | 1000 | 5,152,400b | 2.207μs  | +0.14σ       | +0.29%         |
| 2    | ShortNameMatchesToBench | benchBeginsWith         | substring   | 1000 | 5,152,400b | 2.202μs  | +0.03σ       | +0.06%         |
| 3    | ShortNameMatchesToBench | benchBeginsWith         | substring   | 1000 | 5,152,400b | 2.276μs  | +1.62σ       | +3.43%         |
| 4    | ShortNameMatchesToBench | benchBeginsWith         | substring   | 1000 | 5,152,400b | 2.130μs  | -1.51σ       | -3.21%         |
| 0    | ShortNameMatchesToBench | benchBeginsWith         | subsequence | 1000 | 5,152,400b | 2.415μs  | +1.29σ       | +4.84%         |
| 1    | ShortNameMatchesToBench | benchBeginsWith         | subsequence | 1000 | 5,152,400b | 2.399μs  | +1.11σ       | +4.14%         |
| 2    | ShortNameMatchesToBench | benchBeginsWith         | subsequence | 1000 | 5,152,400b | 2.221μs  | -0.96σ       | -3.59%         |
| 3    | ShortNameMatchesToBench | benchBeginsWith         | subsequence | 1000 | 5,152,400b | 2.219μs  | -0.98σ       | -3.67%         |
| 4    | ShortNameMatchesToBench | benchBeginsWith         | subsequence | 1000 | 5,152,400b | 2.264μs  | -0.46σ       | -1.72%         |
+------+-------------------------+-------------------------+-------------+------+------------+----------+--------------+----------------+

Regex fuzzy looks ~50% slower than lead string equivalent.

@przepompownia
Copy link
Contributor Author

Maybe more examples are needed - at the moment I do not se any significant difference:

Details
\Phpactor\Indexer\Tests\Benchmark\Model\Query\Criteria\ShortNameMatchesToBench
benchRegexFuzzyMatching # substring.....I4 - Mo2.754μs (±1.41%)
benchRegexFuzzyMatching # subsequence...I4 - Mo2.763μs (±2.36%)
benchStringFuzzyMatching # substring....I4 - Mo2.794μs (±3.08%)
benchStringFuzzyMatching # subsequence..I4 - Mo2.918μs (±1.96%)
benchBeginsWith # substring.............I4 - Mo2.190μs (±3.51%)
benchBeginsWith # subsequence...........I4 - Mo2.115μs (±2.80%)

Subjects: 3, Assertions: 0, Failures: 0, Errors: 0
+------+-------------------------+--------------------------+-------------+------+------------+----------+--------------+----------------+
| iter | benchmark | subject | set | revs | mem_peak | time_avg | comp_z_value | comp_deviation |
+------+-------------------------+--------------------------+-------------+------+------------+----------+--------------+----------------+
| 0 | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring | 1000 | 5,152,448b | 2.742μs | -0.97σ | -1.37% |
| 1 | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring | 1000 | 5,152,448b | 2.752μs | -0.72σ | -1.01% |
| 2 | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring | 1000 | 5,152,448b | 2.811μs | +0.78σ | +1.11% |
| 3 | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring | 1000 | 5,152,448b | 2.842μs | +1.57σ | +2.22% |
| 4 | ShortNameMatchesToBench | benchRegexFuzzyMatching | substring | 1000 | 5,152,448b | 2.754μs | -0.67σ | -0.94% |
| 0 | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.775μs | -0.30σ | -0.71% |
| 1 | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.914μs | +1.81σ | +4.27% |
| 2 | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.812μs | +0.26σ | +0.62% |
| 3 | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.742μs | -0.80σ | -1.89% |
| 4 | ShortNameMatchesToBench | benchRegexFuzzyMatching | subsequence | 1000 | 5,152,448b | 2.731μs | -0.97σ | -2.28% |
| 0 | ShortNameMatchesToBench | benchStringFuzzyMatching | substring | 1000 | 5,152,496b | 2.835μs | -0.02σ | -0.05% |
| 1 | ShortNameMatchesToBench | benchStringFuzzyMatching | substring | 1000 | 5,152,496b | 2.958μs | +1.39σ | +4.29% |
| 2 | ShortNameMatchesToBench | benchStringFuzzyMatching | substring | 1000 | 5,152,496b | 2.907μs | +0.81σ | +2.49% |
| 3 | ShortNameMatchesToBench | benchStringFuzzyMatching | substring | 1000 | 5,152,496b | 2.729μs | -1.23σ | -3.79% |
| 4 | ShortNameMatchesToBench | benchStringFuzzyMatching | substring | 1000 | 5,152,496b | 2.753μs | -0.95σ | -2.94% |
| 0 | ShortNameMatchesToBench | benchStringFuzzyMatching | subsequence | 1000 | 5,152,496b | 2.927μs | +0.66σ | +1.29% |
| 1 | ShortNameMatchesToBench | benchStringFuzzyMatching | subsequence | 1000 | 5,152,496b | 2.918μs | +0.50σ | +0.98% |
| 2 | ShortNameMatchesToBench | benchStringFuzzyMatching | subsequence | 1000 | 5,152,496b | 2.895μs | +0.09σ | +0.18% |
| 3 | ShortNameMatchesToBench | benchStringFuzzyMatching | subsequence | 1000 | 5,152,496b | 2.779μs | -1.95σ | -3.83% |
| 4 | ShortNameMatchesToBench | benchStringFuzzyMatching | subsequence | 1000 | 5,152,496b | 2.930μs | +0.71σ | +1.39% |
| 0 | ShortNameMatchesToBench | benchBeginsWith | substring | 1000 | 5,152,400b | 2.264μs | +1.39σ | +4.88% |
| 1 | ShortNameMatchesToBench | benchBeginsWith | substring | 1000 | 5,152,400b | 2.093μs | -0.87σ | -3.04% |
| 2 | ShortNameMatchesToBench | benchBeginsWith | substring | 1000 | 5,152,400b | 2.192μs | +0.44σ | +1.55% |
| 3 | ShortNameMatchesToBench | benchBeginsWith | substring | 1000 | 5,152,400b | 2.053μs | -1.39σ | -4.89% |
| 4 | ShortNameMatchesToBench | benchBeginsWith | substring | 1000 | 5,152,400b | 2.191μs | +0.43σ | +1.50% |
| 0 | ShortNameMatchesToBench | benchBeginsWith | subsequence | 1000 | 5,152,400b | 2.110μs | -0.20σ | -0.56% |
| 1 | ShortNameMatchesToBench | benchBeginsWith | subsequence | 1000 | 5,152,400b | 2.096μs | -0.43σ | -1.22% |
| 2 | ShortNameMatchesToBench | benchBeginsWith | subsequence | 1000 | 5,152,400b | 2.210μs | +1.48σ | +4.16% |
| 3 | ShortNameMatchesToBench | benchBeginsWith | subsequence | 1000 | 5,152,400b | 2.034μs | -1.48σ | -4.14% |
| 4 | ShortNameMatchesToBench | benchBeginsWith | subsequence | 1000 | 5,152,400b | 2.159μs | +0.63σ | +1.75% |
+------+-------------------------+--------------------------+-------------+------+------------+----------+--------------+----------------+

@przepompownia
Copy link
Contributor Author

As expected, the result of the compound check is between str_starts_with and str_starts_with + fuzzy (depending on example)

with PHP version 8.3.3-1+0~20240216.17+debian12~1.gbp87e37b, xdebug ❌, opcache ✔

\Phpactor\Indexer\Tests\Benchmark\Model\Query\Criteria\ShortNameMatchesToBench

    benchRegexFuzzyMatching # leading subst.I4 - Mo2.850μs (±2.40%)
    benchRegexFuzzyMatching # empty search..I4 - Mo2.145μs (±5.57%)
    benchRegexFuzzyMatching # subsequence...I4 - Mo2.819μs (±1.17%)
    benchRegexFuzzyMatching # multibyte.....I4 - Mo2.638μs (±1.02%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.185μs (±1.35%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo2.065μs (±2.98%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.303μs (±2.46%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.176μs (±1.35%)
    benchStringFuzzyMatching # leading subs.I4 - Mo2.822μs (±1.26%)
    benchStringFuzzyMatching # empty search.I4 - Mo2.006μs (±10.71%)
    benchStringFuzzyMatching # subsequence..I4 - Mo2.757μs (±2.02%)
    benchStringFuzzyMatching # multibyte....I4 - Mo2.719μs (±2.06%)
    benchBeginsWith # leading substring.....I4 - Mo2.109μs (±3.38%)
    benchBeginsWith # empty search..........I4 - Mo1.905μs (±2.01%)
    benchBeginsWith # subsequence...........I4 - Mo2.139μs (±7.05%)
    benchBeginsWith # multibyte.............I4 - Mo2.167μs (±1.19%)

@przepompownia
Copy link
Contributor Author

ShortNameBeginsOrFuzzilyMatchesWith2 composed of two classes is slower than using both ways in one class:

    benchRegexFuzzyMatching # leading subst.I4 - Mo2.685μs (±7.27%)
    benchRegexFuzzyMatching # empty search..I4 - Mo2.128μs (±2.82%)
    benchRegexFuzzyMatching # subsequence...I4 - Mo2.814μs (±11.79%)
    benchRegexFuzzyMatching # multibyte.....I4 - Mo2.758μs (±13.90%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.327μs (±1.33%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo1.984μs (±2.34%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.291μs (±1.82%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.139μs (±1.75%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.366μs (±1.52%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo2.968μs (±1.80%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.801μs (±1.31%)
    benchShortNameBeginsOrFuzzilyMatchesWit.I4 - Mo3.772μs (±0.69%)
    benchStringFuzzyMatching # leading subs.I4 - Mo2.788μs (±0.72%)
    benchStringFuzzyMatching # empty search.I4 - Mo1.933μs (±1.72%)
    benchStringFuzzyMatching # subsequence..I4 - Mo2.835μs (±0.91%)
    benchStringFuzzyMatching # multibyte....I4 - Mo2.749μs (±5.00%)
    benchBeginsWith # leading substring.....I4 - Mo2.074μs (±2.48%)
    benchBeginsWith # empty search..........I4 - Mo1.889μs (±3.06%)
    benchBeginsWith # subsequence...........I4 - Mo2.054μs (±1.79%)
    benchBeginsWith # multibyte.............I4 - Mo2.115μs (±4.89%)

so it's probably to remove although looks more elegant.

@przepompownia
Copy link
Contributor Author

I'd like to clean this PR from redundant implementations but waiting for comments from @dantleech before.

@dantleech
Copy link
Collaborator

dantleech commented Mar 8, 2024

Have been testing, and notice that with the PR it can no longer find TestCase in a largish project (it's not listed at all):

image

Without this PR:

image

@mamazu
Copy link
Contributor

mamazu commented Mar 8, 2024

Is there a test that checks if a full match still works with fuzzy matching?

@przepompownia
Copy link
Contributor Author

The hint where the response can be truncated by exceeding any time limit (on the server side) could be helpful for me.

On the other side I need to prepare (probably with systemd-run) enough slow and restricted resources to running Phpactor within them (and without loosing any needed env etc.).

@przepompownia
Copy link
Contributor Author

przepompownia commented Mar 8, 2024

At the moment I want to defer expreiments but have the example

#!/usr/bin/env bash

# see: man systemd.resource-control
# see: man systemd-run

systemd-run \
    --quiet \
    --user \
    --pty \
    --pipe \
    --same-dir \
    --property=PrivateUsers=yes \
    --property=CPUQuota=10% \
    --property=IOWeight=10 \
    --setenv=XDG_CACHE_HOME="${XDG_CACHE_HOME}" \
    -- \
    "$(command -v phpactor)" "$@"

that seems to be too good (at least to displaying help - I did not check if it communicates with my client):

 13:48:09 arctgx@del: phpactor [fuzzy] $ time XDG_CACHE_HOME='/tmp' phpactor-limited --help >/dev/null

real    0m8.946s
user    0m0.023s
sys     0m0.014s

@przepompownia
Copy link
Contributor Author

przepompownia commented Mar 8, 2024

image
image
image

Please test and suggest better names than the current 😀

@przepompownia
Copy link
Contributor Author

@mamazu your proposition seemed to be easiest to convert to "camel matching".

@przepompownia
Copy link
Contributor Author

image

@thomas-hiron
Copy link

Hello, I'm using this branch for almost 3 weeks now, and here are a few thoughts:

  • I've always found what I was looking for
  • Autocomplete is slower but acceptable given I find classes with fewer keystrokes
  • Disabling this by default may be a good idea

Thanks for the work!

PHP LOC
Directories                                        513
Files                                             2478

Size
  Lines of Code (LOC)                           189174
  Comment Lines of Code (CLOC)                    9577 (5.06%)
  Non-Comment Lines of Code (NCLOC)             179597 (94.94%)
  Logical Lines of Code (LLOC)                   34914 (18.46%)
    Classes                                      34788 (99.64%)
    Functions                                       20 (0.06%)
    Not in classes or functions                    106 (0.30%)

@przepompownia
Copy link
Contributor Author

Nice to read. Do you use the recent ("camel case") implementation? I was going to clean this PR from other implementations but still wait for opinions (especially from @dantleech and @mamazu).

@thomas-hiron
Copy link

I don't think so, I cloned the repo on the 6th of March and didn't pull after that. I was using this commit all along: przepompownia@492bc9e5
I updated the project this morning, it seems a little faster!
I can give an update in a few days.

@mamazu
Copy link
Contributor

mamazu commented Mar 25, 2024

Looks good to me, haven't tried it on my projects out yet. I think something that we should also maybe look into is trying to sort them by namespace. (Like own code before vendor code) that would also help if you have the same classes in both directories.

@przepompownia
Copy link
Contributor Author

przepompownia commented Mar 25, 2024

I think something that we should also maybe look into is trying to sort them by namespace. (Like own code before vendor code) that would also help if you have the same classes in both directories.

Good idea but let do it in a separate PR.

For example, both `filgc` and `fil_g_c` match to `file_get_contents`
buf `fits` don't.
@przepompownia
Copy link
Contributor Author

Yet one addition:
image

@przepompownia
Copy link
Contributor Author

I reduced this PR to "begin or camel-fuzzy" proposition to make it ready to review. If you want to compare the current matcher with any previously tried, switch back to 64b46aa.

@przepompownia przepompownia marked this pull request as ready for review April 17, 2024 15:20
Copy link
Contributor

@mamazu mamazu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

@dantleech
Copy link
Collaborator

I need to test this, if there is any performance penalty or if it significantly changes the behavior it will need to be configurable.

@przepompownia
Copy link
Contributor Author

Please possibly change the way of this configuration (including especially renaming, spell mistakes, etc.). I started with a choice between the old and the new method but replaced it with a simple boolean option (indexer.searcher_semi_fuzzy).

As a side effect I added tests for both methods: they revealed missing mb_strtolower in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add fuzzy matching for name completion.
4 participants