Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content Search: use AND operator by default and don't apply mask for phrases #6979

Merged
merged 24 commits into from
May 21, 2024
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
0b2ce57
Content Search: use AND operator by default and don't apply mask for …
yurabakhtin Apr 30, 2024
fcbe691
Fix test
yurabakhtin Apr 30, 2024
4f92d84
Fix SearchQueryTest
yurabakhtin Apr 30, 2024
a74a574
Merge branch 'develop' into enh/230-content-search-default-operator
yurabakhtin Apr 30, 2024
c233527
Merge branch 'develop' into enh/230-content-search-default-operator
yurabakhtin May 2, 2024
06c6fe0
Fix creating of FULLTEXT index for mysql 8+
yurabakhtin May 2, 2024
c75eee8
Merge branch 'develop' into enh/230-content-search-default-operator
yurabakhtin May 2, 2024
92e4881
Fix mysql search driver for short words
yurabakhtin May 3, 2024
4fc9d98
Improve keyword highlighting
yurabakhtin May 3, 2024
30379f1
Use fixed value instead of `mysql.ft_min_word_len`
yurabakhtin May 3, 2024
ce85d0c
Exclude highlighting on console request
yurabakhtin May 6, 2024
308e400
Escape special char `@` on mysql search driver
yurabakhtin May 6, 2024
ff2705f
Merge branch 'develop' into enh/230-content-search-default-operator
yurabakhtin May 7, 2024
7aed7d6
Escape special chars on mysql search driver
yurabakhtin May 7, 2024
3673239
Escape char `/` on mysql search driver
yurabakhtin May 7, 2024
b37078d
Escape quote chars on mysql search driver
yurabakhtin May 7, 2024
bda6c61
Improve word highlighting with quote or apostrophe
yurabakhtin May 7, 2024
a6bcf94
Merge branch 'develop' into enh/230-content-search-default-operator
yurabakhtin May 9, 2024
9cec0c6
Add code comment for search query and driver
yurabakhtin May 9, 2024
9faa03a
Merge branch 'develop' into enh/230-content-search-default-operator
yurabakhtin May 16, 2024
b850754
Refactor search query
yurabakhtin May 17, 2024
5bdb059
Fix test for ZendLucence Driver
yurabakhtin May 17, 2024
2d165ae
Fix ZendLucence Driver for phrases
yurabakhtin May 17, 2024
f21605e
Merge branch 'develop' into enh/230-content-search-default-operator
luke- May 21, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG-DEV.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ HumHub Changelog
- Fix #6970: MultiSelect loads wrong options (since #6768 in 1.16.0-beta.1)
- Enh #6974: Highlight a content searching keyword on show more comments
- Fix #6977: Index sub comments for searching
- Enh #6979: Content Search: use AND operator by default and don't apply mask for phrases

1.16.0-beta.2 (April 9, 2024)
-----------------------------
Expand Down
26 changes: 17 additions & 9 deletions protected/humhub/libs/SearchQuery.php
Original file line number Diff line number Diff line change
Expand Up @@ -60,11 +60,20 @@ public function __construct(string $query)

if (!empty($result[0]) && is_array($result[0])) {
foreach ($result[0] as $i => $term) {
if (!preg_match('/^(\+|\-|AND |NOT )?".+"$/', $term)) {
// A not quoted term should be searched with mask by default
$term = rtrim($term, '*') . '*';
}

// Remove quotation marks
$term = str_replace('"', '', $term);

if (str_starts_with($term, '+') || str_starts_with($term, 'AND ')) {
if (str_starts_with($term, 'OR ')) {
$orTerms[] = preg_replace('/^((?i)OR )?/', '', $term);
} elseif (str_starts_with($term, '-') || str_starts_with($term, 'NOT ')) {
$notTerms[] = preg_replace('/^\-?((?i)NOT )?/', '', $term);
} else {
// Use AND operator by default

/**
* Special Case: In search queries like "Apple AND Banana", Apple should
Expand All @@ -76,18 +85,17 @@ public function __construct(string $query)
}

$andTerms[] = preg_replace('/^\+?((?i)AND )?/', '', $term);
} elseif (str_starts_with($term, '-') || str_starts_with($term, 'NOT ')) {
$notTerms[] = preg_replace('/^\-?((?i)NOT )?/', '', $term);
} else {
$orTerms[] = preg_replace('/^((?i)OR )?/', '', $term);
}
}
}

$this->notTerms = array_filter($notTerms);
$this->orTerms = array_filter($orTerms);
$this->andTerms = array_filter($andTerms);
$this->notTerms = array_filter($notTerms, [$this, 'filterEmptyTerms']);
$this->orTerms = array_filter($orTerms, [$this, 'filterEmptyTerms']);
$this->andTerms = array_filter($andTerms, [$this, 'filterEmptyTerms']);
}


private function filterEmptyTerms($term): bool
{
return !empty($term) && $term !== '*';
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -129,10 +129,10 @@ private function createMysqlFullTextQuery(SearchQuery $query, array $matchFields
$againstQuery = '';

foreach ($query->andTerms as $keyword) {
$againstQuery .= '+' . rtrim($keyword, '*') . '* ';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurabakhtin Is there a reason why you removed the wildcards?

Basically, the search for “Apple” should also find “Applepie”.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luke- I have moved the wildcards from drivers(mysql, zend, solr) to SearchQuery code here, because the chars " are removed on the SearchQuery side, so we cannot know what original term was really when we are on the driver side.

This is how it worked before the change:

  • Apple => Apple* (find Applepie)
  • "Apple" => Apple* (find Applepie)

After the change:

  • Apple => Apple* (find Applepie)
  • "Apple" => Apple (find only Apple, no Applepie)

As I understand it should work as it is described in the first table here.

Tests for quoted cases are here:

I have done the change for the request "Phrases should actually work. Can you check what is causing the problem here and add tests if necessary?", please correct me if I misunderstood it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurabakhtin My intention was for “SearchQuery” to be as neutral a representation of the search query as possible. Without any syntax like wildcards or terms in quotation marks.

e.g. Asterisk or Quotation marks is more of a driver-specific syntax. (Even if the similar for MySQL and Lucene)

But I understand that, for example, with SearchRequest::andTerms we cannot distinguish whether it was originally quoted or not.

Do you have any ideas?

  • We could introduce new attributes such as andParts (with Wildcard) and andTerms (with Quotationmarks).

  • We could define that as soon as a space is contained in e.g. andTerms, it is seen as a term, otherwise it is provided with a wildcard. (But then “Apple” always becomes Apple*.

  • ?

Or did I get something wrong here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luke-

My intention was for “SearchQuery” to be as neutral a representation of the search query as possible. Without any syntax like wildcards or terms in quotation marks.

e.g. Asterisk or Quotation marks is more of a driver-specific syntax. (Even if the similar for MySQL and Lucene)

Ok, I have understood your intention.

For solve it we could implement new array as andParts or extend the existing array with new array item to mark each term like this:

SearchQuery->andTerms = [
    ['Apple', 'mask' => false],
    ['Pine', 'mask' => true],
];

so on driver side we will know what term should be masked with *.

But I am thinking why should we keep an additional array andParts or use the complex structure if we can use a last char of each term as flag. I.e. if the last char is * then we can decide such term should be processed with specific way on a driver side.
For current drivers(mysql, zend, solr) we can keep the code as is, but for future drivers the last char * can be used as flag like this, if the driver doesn't support terms ended with *:

foreach ($request->getSearchQuery()->andTerms as $term) {
    $keywordQuery->addSubquery(str_ends_with($term, '*') ? $this->prepareMaskedTerm($term) : $term);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurabakhtin Okay, I got your point.

Do I understand it correctly that now e.g. in SearchQuery::andTerms, can contain entries that are either quoted and contain an asterisk. Or is only the asterisk included?

Your approach is probably the right one. Since in fact all current drivers support this and it also corresponds to the query language of SearchQuery.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand it correctly that now e.g. in SearchQuery::andTerms, can contain entries that are either quoted and contain an asterisk. Or is only the asterisk included?

@luke- Yes, but a bit correction about "quoted":

  • entries that are either simple word without quotes or contain an asterisk

Examples how a requested keyword is stored in terms:

  • Apple => Apple*
  • "Apple" => Apple
  • Foo bar => Foo bar*
  • "Foo bar" => Foo bar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurabakhtin Can you please add this example somewhere in the SearchQuery PHPDoc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurabakhtin Can you please add this example somewhere in the SearchQuery PHPDoc?

@luke- 9cec0c6#diff-70f82fb3d09d4959fb9f44a526f8455144c112df88ec7299a2efe28a76171169R28-R64

$againstQuery .= '+' . $keyword . ' ';
}
foreach ($query->orTerms as $keyword) {
$againstQuery .= rtrim($keyword, '*') . '* ';
$againstQuery .= $keyword . ' ';
}
foreach ($query->notTerms as $keyword) {
$againstQuery .= '-' . $keyword . ' ';
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -163,11 +163,11 @@ protected function buildSearchQuery(SearchRequest $request): Boolean

$keywordQuery = new Boolean();
foreach ($request->getSearchQuery()->orTerms as $term) {
$keywordQuery->addSubquery(new Wildcard(new Term(mb_strtolower($term) . '*')), null);
$keywordQuery->addSubquery(new Wildcard(new Term(mb_strtolower($term))), null);
}

foreach ($request->getSearchQuery()->andTerms as $term) {
$keywordQuery->addSubquery(new Wildcard(new Term(mb_strtolower($term) . '*')), true);
$keywordQuery->addSubquery(new Wildcard(new Term(mb_strtolower($term))), true);
}

foreach ($request->getSearchQuery()->notTerms as $term) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,17 @@ public function testKeywords()
// Test Multiple AND Keywords
#$this->assertCount(1, $this->getSearchResultByKeyword('Marabru')->results);

$this->assertCount(1, $this->getSearchResultByKeyword('Marabru Leav Abcd')->results);
$this->assertCount(1, $this->getSearchResultByKeyword('"Marabru" Tes')->results);
$this->assertCount(0, $this->getSearchResultByKeyword('"Marabr" Test')->results);

$this->assertCount(1, $this->getSearchResultByKeyword('Marabru Leav')->results);
$this->assertCount(1, $this->getSearchResultByKeyword('Marabru Leav OR Abcd')->results);
$this->assertCount(2, $this->getSearchResultByKeyword('OR Marabru OR Something')->results);
$this->assertCount(0, $this->getSearchResultByKeyword('+Marabru +Leav* +Abcd')->results);
$this->assertCount(0, $this->getSearchResultByKeyword('Marabru Leav +Abcd')->results);

$this->assertCount(1, $this->getSearchResultByKeyword('Something -Marabru')->results);
$this->assertCount(1, $this->getSearchResultByKeyword('Something -Marab')->results);

// Wildcards
$this->assertCount(1, $this->getSearchResultByKeyword('Marabr*')->results);
Expand Down
50 changes: 25 additions & 25 deletions protected/humhub/tests/codeception/unit/libs/SearchQueryTest.php
Original file line number Diff line number Diff line change
Expand Up @@ -16,64 +16,64 @@ public function testTermsWithSigns()
{
$query = new SearchQuery('Apple Pie +"Foo" -"Foo bar"');

$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Pie', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Pie*', $query->andTerms);
$this->assertContains('Foo', $query->andTerms);
$this->assertContains('Foo bar', $query->notTerms);

$query = new SearchQuery('Apple');
$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertEmpty($query->orTerms);
$this->assertEmpty($query->notTerms);
$this->assertEmpty($query->andTerms);

$query = new SearchQuery('-Apple');
$this->assertContains('Apple', $query->notTerms);
$this->assertContains('Apple*', $query->notTerms);
$this->assertEmpty($query->orTerms);
$this->assertEmpty($query->andTerms);

$query = new SearchQuery('Apple +Banana');
$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Banana', $query->andTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Banana*', $query->andTerms);
}

public function testTermsWithWords()
{
$query = new SearchQuery('Apple Pie AND "Foo" NOT "Foo bar"');

$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Pie', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Pie*', $query->andTerms);
$this->assertContains('Foo', $query->andTerms);
$this->assertContains('Foo bar', $query->notTerms);

$query = new SearchQuery('Apple');
$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertEmpty($query->orTerms);
$this->assertEmpty($query->notTerms);
$this->assertEmpty($query->andTerms);

$query = new SearchQuery('NOT Apple');
$this->assertContains('Apple', $query->notTerms);
$this->assertContains('Apple*', $query->notTerms);
$this->assertEmpty($query->orTerms);
$this->assertEmpty($query->andTerms);

$query = new SearchQuery('Apple OR Banana');
$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Banana', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Banana*', $query->orTerms);

$query = new SearchQuery('Apple AND Banana');
$this->assertContains('Apple', $query->andTerms);
$this->assertContains('Banana', $query->andTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Banana*', $query->andTerms);

$query = new SearchQuery('Apple AND Banana OR Grape');
$this->assertContains('Apple', $query->andTerms);
$this->assertContains('Banana', $query->andTerms);
$this->assertContains('Grape', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Banana*', $query->andTerms);
$this->assertContains('Grape*', $query->orTerms);
}

public function testInvalid()
{
$query = new SearchQuery('Apple "Pie');
$this->assertContains('Apple', $query->orTerms);
$this->assertContains('Pie', $query->orTerms);
$this->assertContains('Apple*', $query->andTerms);
$this->assertContains('Pie*', $query->andTerms);

$query = new SearchQuery('');
$this->assertEmpty($query->orTerms);
Expand All @@ -99,13 +99,13 @@ public function testInvalid()
public function testTermsWithNumbers()
{
$query = new SearchQuery('Quote 2024');
$this->assertContains('2024', $query->orTerms);
$this->assertContains('Quote', $query->orTerms);
$this->assertEmpty($query->andTerms);
$this->assertContains('2024*', $query->andTerms);
$this->assertContains('Quote*', $query->andTerms);
$this->assertEmpty($query->orTerms);
$this->assertEmpty($query->notTerms);

$query = new SearchQuery('"Quote 2024"');
$this->assertContains('Quote 2024', $query->orTerms);
$this->assertContains('Quote 2024', $query->andTerms);

}
}