Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ManticoresearchAdapter #103

Draft
wants to merge 2 commits into
base: 0.5
Choose a base branch
from

Conversation

alexander-schranz
Copy link
Member

@alexander-schranz alexander-schranz commented Feb 13, 2023

Manticoresearch is a Sphinx Fork providing PHP implementation over https://github.com/manticoresoftware/manticoresearch-php. As requested by some on reddit we are trying to support also this.

TODO

  • SchemaManagerTest
  • AdapterTest
    • Create and Drop Indexes
    • Create and Drop Schemas
    • Write Document
    • Read Document by id/uuid (currently stuck on using uuids)
  • ConnectionTest
    • testSaveDeleteIdentifierCondition
    • testFindMultipleIndexes Skipped TODO issue
    • testSearchCondition
    • testNoneSearchableFields
    • testLimitAndOffset
    • testEqualCondition
    • testMultiEqualCondition
    • testNotEqualCondition
    • testGreaterThanCondition
    • testGreaterThanEqualCondition
    • testLessThanCondition
    • testLessThanEqualCondition
    • testSortByAsc
    • testSortByDesc

External

@alexander-schranz alexander-schranz added the features New feature or request label Feb 13, 2023
@alexander-schranz alexander-schranz force-pushed the feature/add-manticoresearch-adapter branch 2 times, most recently from ab0ccb5 to b74f9aa Compare February 13, 2023 22:21
@alexander-schranz alexander-schranz added the Adapter: Manticoresearch Manticoresearch Adapter related issue label Feb 13, 2023
@alexander-schranz alexander-schranz force-pushed the feature/add-manticoresearch-adapter branch from e85cd0b to 7a97e4c Compare February 15, 2023 22:08
@alexander-schranz alexander-schranz added the help wanted Extra attention is needed label Feb 15, 2023
@alexander-schranz alexander-schranz force-pushed the feature/add-manticoresearch-adapter branch 4 times, most recently from 8cff708 to 2d70635 Compare February 22, 2023 21:43
@alexander-schranz alexander-schranz marked this pull request as draft April 1, 2023 17:41
@sanikolaev
Copy link

currently stuck on create valid schema for complex document

Hi. I'm a member of Manticore team. Please let me know if we can help with this.

@alexander-schranz alexander-schranz force-pushed the feature/add-manticoresearch-adapter branch 5 times, most recently from c95518c to c508f6c Compare May 16, 2023 19:10
@alexander-schranz
Copy link
Member Author

alexander-schranz commented May 16, 2023

@sanikolaev your help is really welcome here. Maybe we can do this step by step, first would be nice if you could help how to map a SEAL Schema to Manticore Search schema.

The abstraction is supporting the following kind of fields for single representation I currently did use the following mapping, I think that should be correct. The datetime / timestamps seems to be representated in Manticoresearch as Number so I did use in our converter to convert "2023-..." to a timestamp number presentation like we already using for Apache Solr. So a very basic mapping should look like this hope atleast that is correct:

SEAL Field Type Manticore Field Type Example in JSON
BooleanField bool { "field": false }
IntegerField int { "field": 1 }
FloatField float { "field": 1.5 }
DateTimeField timestamp { "field": 1684265085 }
TextField text { "field": "Text" }

But now the more difficult part, every Field can be multiple, I'm not yet sure how I can map something else then a Integer field to an array as based on https://manual.manticoresearch.com/Creating_a_table/Data_types#Multi-value-integer-(MVA) the multi currently only works for integers not other types of values. How to store other type of data then which represented in an array?

SEAL Schema Type Manticore Field Type Example in JSON
multi BooleanField ??? { "field": [false, true] }
multi IntegerField multi { "field": [1, 3] }
multi FloatField ??? { "field": [1.5, 2.5] }
multi DateTimeField ??? { "field": [1684265085, 1684265025] }
multi TextField ??? { "field": ["Text 1", "Text"] }

The TextField has special flag called searchable which can be (default) true or false, I currently did based on that add ['indexed'] or not think that should work but as the above is currently blocking I was not able yet to test it: https://manual.manticoresearch.com/Creating_a_table/Data_types#Character-data-types:

SEAL Schema Type Manticore Field Type Example in JSON
TextField searchable text ['indexed'] { "field": "Text 1" }
multi TextField searchable ??? ['indexed'] { "field": ["Text 1", "Text"] }

While reading the documentation about text / string I'm not sure if a field which contains text would maybe be better to be string instead of a text field.

SEAL Schema Type Manticore Field Type Example in JSON
TextField not searchable string { "field": "Text 1" }
multi TextField not searchable ??? { "field": ["Text 1", "Text"] }

All kind of fields can be filtearable and sortable based on the documentation such fields required to be marked as attribute: https://manual.manticoresearch.com/Creating_a_table/Data_types#Character-data-types:

SEAL Schema Type Manticore Field Type Example in JSON
BooleanField bool ['attribute'] { "field": false }
IntegerField int ['attribute'] { "field": 1 }
FloatField float ['attribute'] { "field": 1.5 }
DateTimeField timestamp ['attribute'] { "field": 1684265085 }
TextField text ['attribute'] { "field": "Text" }
multi BooleanField ??? ['attribute'] { "field": [false, true] }
multi IntegerField multi ['attribute'] { "field": [1, 3] }
multi FloatField ??? ['attribute'] { "field": [1.5, 2.5] }
multi DateTimeField ??? ['attribute'] { "field": [1684265085, 1684265025] }
multi TextField ??? ['attribute'] { "field": ["Text 1", "Text"] }

The problem with the multiple fields is what currently make the Implementation crashing as I'm not sure how this can be handle with manticore search engine or sphinx:

Warning
Manticoresearch\Exceptions\ResponseException: "MVA elements should be integers"

Form the previous discussion some JSON field maybe would support this, but I'm not sure about correclty defining that types. as it fails there in case of combination with indexed:

Warning
Manticoresearch\Exceptions\ResponseException: "sphinxql: syntax error, unexpected INDEXED, expecting ')' or ',' near 'indexed,blocks_text_description json indexed,blocks_text_media multi,blocks_embed_title json indexed,blocks_embed_media json,footer_title text indexed,created timestamp,commentsCount integer,rating float,comments_email json,comments_text json indexed,tags json indexed attribute,categoryIds multi,_source text)'"

As example our test has a tags fields here and the tags are filterable, searchable, multi TextField in the definitions see here.

@alexander-schranz alexander-schranz force-pushed the feature/add-manticoresearch-adapter branch from c508f6c to 38db6ff Compare May 16, 2023 19:49
@alexander-schranz alexander-schranz force-pushed the feature/add-manticoresearch-adapter branch from 38db6ff to 9087029 Compare May 16, 2023 20:02
@sanikolaev
Copy link

How to store other type of data then which represented in an array?

This is only possible using the json type, e.g.:

mysql> drop table if exists t; create table t(string_array json, float_array json, bool_array json); insert into t values(0, '["abc", "def"]', '[1.23, 2.34]', '[true, false]'),(0, '["ghi", "jkl"]', '[3.45, 4.56]', '[true, true]'); select *, any(x = 'abc' for x in string_array), any(x > 3.0 and x < 4.0 for x in float_array), all(x = 1 for x in bool_array) from t;
--------------
drop table if exists t
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
create table t(string_array json, float_array json, bool_array json)
--------------

Query OK, 0 rows affected (0.01 sec)

--------------
insert into t values(0, '["abc", "def"]', '[1.23, 2.34]', '[true, false]'),(0, '["ghi", "jkl"]', '[3.45, 4.56]', '[true, true]')
--------------

Query OK, 2 rows affected (0.00 sec)

--------------
select *, any(x = 'abc' for x in string_array), any(x > 3.0 and x < 4.0 for x in float_array), all(x = 1 for x in bool_array) from t
--------------

+---------------------+---------------+---------------------+--------------+--------------------------------------+-----------------------------------------------+--------------------------------+
| id                  | string_array  | float_array         | bool_array   | any(x = 'abc' for x in string_array) | any(x > 3.0 and x < 4.0 for x in float_array) | all(x = 1 for x in bool_array) |
+---------------------+---------------+---------------------+--------------+--------------------------------------+-----------------------------------------------+--------------------------------+
| 1515343812221005444 | ["abc","def"] | [1.230000,2.340000] | [true,false] |                                    1 |                                             0 |                              0 |
| 1515343812221005445 | ["ghi","jkl"] | [3.450000,4.560000] | [true,true]  |                                    0 |                                             1 |                              1 |
+---------------------+---------------+---------------------+--------------+--------------------------------------+-----------------------------------------------+--------------------------------+
2 rows in set (0.00 sec)

BTW timestamp internally is just int, so an array of timestamps would be multi.

@alexander-schranz
Copy link
Member Author

@sanikolaev thx for the response, what about indexed and attribute on this fields. It currently ends in the following error:

Warning
Manticoresearch\Exceptions\ResponseException: "sphinxql: syntax error, unexpected INDEXED, expecting ')' or ',' near 'indexed,blocks_text_description json indexed,blocks_text_media multi,blocks_embed_title json indexed,blocks_embed_media json,footer_title text indexed,created timestamp,commentsCount integer,rating float,comments_email json,comments_text json indexed,tags json indexed attribute,categoryIds multi,_source text)'"

@alexander-schranz
Copy link
Member Author

I tried to skip the attribute and indexed part for the json fields still run into another error this is the manticore field defintions had to use _ instead of . for field seperator for nested objects:

{
    "title": {
        "type": "text",
        "options": [
            "indexed"
        ]
    },
    "header_image_media": {
        "type": "integer",
        "options": []
    },
    "header_video_media": {
        "type": "string",
        "options": []
    },
    "article": {
        "type": "text",
        "options": [
            "indexed"
        ]
    },
    "blocks_text_title": {
        "type": "json",
        "options": []
    },
    "blocks_text_description": {
        "type": "json",
        "options": []
    },
    "blocks_text_media": {
        "type": "multi",
        "options": []
    },
    "blocks_embed_title": {
        "type": "json",
        "options": []
    },
    "blocks_embed_media": {
        "type": "json",
        "options": []
    },
    "footer_title": {
        "type": "text",
        "options": [
            "indexed"
        ]
    },
    "created": {
        "type": "timestamp",
        "options": []
    },
    "commentsCount": {
        "type": "integer",
        "options": []
    },
    "rating": {
        "type": "float",
        "options": []
    },
    "comments_email": {
        "type": "json",
        "options": []
    },
    "comments_text": {
        "type": "json",
        "options": []
    },
    "tags": {
        "type": "json",
        "options": []
    },
    "categoryIds": {
        "type": "multi",
        "options": []
    },
    "_source": {
        "type": "string",
        "options": []
    }
}

This is the document:

{
    "title": "New Blog",
    "header_image_media": 1,
    "article": "<article><h2>New Subtitle<\/h2><p>A html field with some content<\/p><\/article>",
    "blocks_text_title": "[\"Titel\",\"Titel 2\",\"Titel 4\"]",
    "blocks_text_description": "[\"<p>Description<\\\/p>\",\"<p>Description 4<\\\/p>\"]",
    "blocks_text_media": [
        3,
        4,
        3,
        4
    ],
    "blocks_embed_title": "[\"Video\"]",
    "blocks_embed_media": "[\"https:\\\/\\\/www.youtube.com\\\/watch?v=iYM2zFP3Zn0\"]",
    "footer_title": "New Footer",
    "created": "2022-01-24T12:00:00+01:00",
    "commentsCount": 2,
    "rating": 3.5,
    "comments_email": "[\"admin.nonesearchablefield@localhost\",\"example.nonesearchablefield@localhost\"]",
    "comments_text": "[\"Awesome blog!\",\"Like this blog!\"]",
    "tags": "[\"Tech\",\"UI\"]",
    "categoryIds": [
        1,
        2
    ],
    "_source": "{\"unrelated\":\"Unrelated\"}"
}

it is indixed via the PHP client this way:

$searchIndex = $this->client->index('test_complex');
$searchIndex->addDocument($aboveDocument, '23b30f01-d8fd-4dca-b36a-4710e360a965');

But when try to load that document via:

$searchIndex = $this->client->index('test_complex');
$searchIndex->getDocumentById('23b30f01-d8fd-4dca-b36a-4710e360a965');

It errors with:

Manticoresearch\Exceptions\ResponseException: "index test_complex: unsupported filter type 'string' on attribute 'id'"

Not sure why this is happening.

@sanikolaev
Copy link

what about indexed and attribute on this fields

indexed makes sense only for textual fields. It makes the field full-text indexed. It may be a little bit confusing since "string" and "text" without additional properties mean different things, but when you add one of the properties ("indexed", "stored", "attribute") they become aliases. We tried to describe that all in the docs. Let me know if smth is not clear, I'll be glad to help and update the docs afterwards.

had to use _ instead of . for field seperator for nested objects

I see. This is right. Manticore doesn't natively support nested objects and the period sign is used for json, e.g.: where json_attr.a.b = 123, that's why it's not allowed in column names.

unsupported filter type 'string' on attribute 'id'"

Manticore doesn't support string IDs. The ID requirements can be found here https://manual.manticoresearch.com/Creating_a_table/Data_types#Document-ID.

@alexander-schranz
Copy link
Member Author

alexander-schranz commented May 18, 2023

indexed makes sense only for textual fields. It makes the field full-text indexed. It may be a little bit confusing since "string" and "text" without additional properties mean different things

From the document above we have text which is searchable but are represented by an array of texts, as we did flatten the whole blocks objects. As suggested by you I did now use for this array text fields (tags, blocks_text_description, blocks_text_title, ..) the type json. Is that maybe not the correct way here for searchable text if that can not be indexed?

The tags field or sometimes called search keyword I think is a very common example which is multi field which is searchable and filterable. In elasticsearch this is achieved this way:

[
    'type' => 'text',
    'index' => true,
    'fields' => [
        'raw' => ['type' => 'keyword'],
    ],
]

So a field tags for searchability is created and a field tags.raw for filterability. How is this done in manticoresearch?

@sanikolaev
Copy link

The equivalent of Elasticsearch's

The tags field I think is a very common example which is multi field which is searchable and filterable. In elasticsearch this is achieved this way:

[
    'type' => 'text',
    'index' => true,
    'fields' => [
        'raw' => ['type' => 'keyword'],
    ],
]

in Manticore is type text indexed attribute or type string indexed attribute, e.g. :

create table t(type text indexed attribute, type2 string indexed attribute)
--------------

Query OK, 0 rows affected (0.00 sec)

--------------
desc t
--------------

+-------+--------+-------------------+
| Field | Type   | Properties        |
+-------+--------+-------------------+
| id    | bigint |                   |
| type  | string | indexed attribute |
| type2 | string | indexed attribute |
+-------+--------+-------------------+
3 rows in set (0.00 sec)

@alexander-schranz
Copy link
Member Author

alexander-schranz commented May 18, 2023

I'm not sure if I did understand you correctly tags or keywords are a list of data which can contain zero or many keywords e.g.:

{
   "uuid": "23b30f01-d8fd-4dca-b36a-4710e360a965",
   "tags": ["UI", "UX"]
}

For searchable I think we could use type string indexed by converting the document multi text fields to something like this:

{
   "uuid": "23b30f01-d8fd-4dca-b36a-4710e360a965",
   "tags": "UI UX"
}

But that how we could still get then filterability to work to get document tagged with that tags. That would still I think require a json field, but how can we make that json filter attribute to receive only documents having a specific tag?:

{
   "uuid": "23b30f01-d8fd-4dca-b36a-4710e360a965",
   "tags": "UI UX",
   "tags_raw": ["UI", "UX"]
}

PS: we are using the https://github.com/manticoresoftware/manticoresearch-php here so we are not actually do any create table ... statement ourselfs.

@sanikolaev
Copy link

how can we make that json filter attribute to receive only documents having a specific tag?

Filtering in an array of strings separated with a space there's a special mechanism in Manticore which you can use to avoid using JSON. Here's what it looks like in PHP:

<?php
require_once __DIR__ . '/vendor/autoload.php';
use Manticoresearch\Search;

$config = ['host'=>'127.0.0.1','port'=>9308];
$client = new \Manticoresearch\Client($config);
$index = $client->index('test');

$index->drop();

$index->create([
  'uuid'=>['type'=>'string'],
  'tags'=>['type'=>'text indexed attribute']
]);

$index->addDocument([
  'uuid' => '23b30f01-d8fd-4dca-b36a-4710e360a965',
  'tags' => 'UI UX'
]);

echo "--------------- \$index->search('UI')->get(): -------------------\n";
$docs = $index->search('UI')->get();
foreach($docs as $doc) print_r($doc);

echo "--------------- \$index->search('UI')->get(): -------------------\n";
$docs = $index->search('UI')->get();
foreach($docs as $doc) print_r($doc);

echo "--------------- \$index->search('')->filter('any(tags)', 'in', ['UI'])->get(): -------------------\n";
$docs = $index->search('')->filter('any(tags)', 'in', ['UI'])->get();
foreach($docs as $doc) print_r($doc);

which will give you:

➜  ~ php schranz.php
--------------- $index->search('UI')->get(): -------------------
Manticoresearch\ResultHit Object
(
    [data:protected] => Array
        (
            [_id] => 1515343812221005526
            [_score] => 1500
            [_source] => Array
                (
                    [tags] => UI UX
                    [uuid] => 23b30f01-d8fd-4dca-b36a-4710e360a965
                )

        )

)
--------------- $index->search('UI')->get(): -------------------
Manticoresearch\ResultHit Object
(
    [data:protected] => Array
        (
            [_id] => 1515343812221005526
            [_score] => 1500
            [_source] => Array
                (
                    [tags] => UI UX
                    [uuid] => 23b30f01-d8fd-4dca-b36a-4710e360a965
                )

        )

)
--------------- $index->search('')->filter('any(tags)', 'in', ['UI'])->get(): -------------------
Manticoresearch\ResultHit Object
(
    [data:protected] => Array
        (
            [_id] => 1515343812221005526
            [_score] => 1
            [_source] => Array
                (
                    [tags] => UI UX
                    [uuid] => 23b30f01-d8fd-4dca-b36a-4710e360a965
                )

        )

)

So in this specific case 'tags'=>['type'=>'text indexed attribute'] should suffice.

From the docs:

image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Adapter: Manticoresearch Manticoresearch Adapter related issue features New feature or request help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants