You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The queries that create memory tables rely on some complicated joins that can take a while (I've seen hours) to execute. This means a system restart takes a bookworm offline for hours.
Current situation
The memory tables speed things up, or did when I benchmarked things five years ago.
There are currently up to four tables for each variable. Let me take the language field from hathipd as an example of them.
On disk tables
The first is the raw data imported from the user. This is keyed to some id variable--usually, but not necessarily, bookid. The field itself is text.
MariaDB [hathipd]> SELECT * FROM languagesDisk LIMIT 3;
+--------+-----------+
| bookid | languages |
+--------+-----------+
| 1 | English |
| 2 | English |
| 3 | English |
+--------+-----------+
3 rows in set (0.00 sec)
A SQL query then generates a table of ids and counts for each language.
The ids are useful in queries because they make grouping and filtering queries significantly faster.
The ids are in decreasing prevalence; this isn't strictly necessarily, but enables a very useful query restraint I use a lot that uses only the most common elements. {"languages__id":{"lte":10}}.
In order to get this property, the __id field is actually created by this query. It might be possible to create in python.
The __count field is not, AFAIK, used anywhere and could safely be deleted.
SELECT * FROM languages__id LIMIT 3;
+---------------+-----------+------------------+
| languages__id | languages | languages__count |
+---------------+-----------+------------------+
| 1 | English | 2793265 |
| 2 | German | 578369 |
| 3 | French | 541092 |
+---------------+-----------+------------------+
3 rows in set (0.00 sec)
In memory tables
Those tables persist through restart, but are not used in queries. Instead, two other queries are used.
The first is the Lookup table that does a hashmap from ids to keys.
If you search for "languages":["English"], this is the field that will be checked during the search for each book. (I assume, though I'm not sure, that after the first hit mysql will actually cache the proper languages__id and this table isn't actually consulted 10 million times).
This table is just about the same as languages__id, except that
It lacks a languages -> languages__id index, because those sometimes confuse the query optimizer.
It lives in memory
It lacks the useless count field.
SELECT * FROM languagesLookup LIMIT 3;
+---------------+-----------+
| languages__id | languages |
+---------------+-----------+
| 1 | English |
| 2 | German |
| 3 | French |
+---------------+-----------+
3 rows in set (0.00 sec)
In order to hit that lookup table, there needs to be another memory table that maps from bookid (or other id) to the language id. This is called heap.
This memory table is created with a merge of two of the above tables.
There's potentially a single solution for all three of the above problems.
If the Lookup table is empty, use the __id table instead (with index hints to avoid paralyzing wrong-way lookups).
Create a new table on disk with the bookid->languages__id map (like languagesheap). This will make creation of the languagesheap table faster.
On db queries, try to use the memory tables; if they don't exist (or have no rows, indicating a restart) fall back to using the on-disk versions.
Turning on memory tables then becomes a specific performance optimization you can turn on specific bookworms, without requiring a huge memory overhead for every bookworm.
Namespace issues
It would be nice to do a full rename of all these tables, but that might break some existing bookworms. So instead I think the path of least resistance is to add two new disk tables, called languagesheap_ and languages__id_. (i.e., trailing underscores). If the memory table is empty or nonexistent, they'll be used instead; and the creation of the memory table will be from this table, rather than from the other on-disk tables. (The old on-disk tables could be deleted, I guess).
base tables
This process can apply to the base tables as well. The disk -> memory pattern also exists in the table pairs
catalog->fastcat
words -> wordsheap
The creation of these tables (especially fastcat) is going to be a big pain on the new hathi bookworm; so the existence of a fastcat_ table will help there a lot.
The text was updated successfully, but these errors were encountered:
I'm dissatisfied with several features of the way memory tables are handled.
Current situation
The memory tables speed things up, or did when I benchmarked things five years ago.
There are currently up to four tables for each variable. Let me take the language field from
hathipd
as an example of them.On disk tables
The first is the raw data imported from the user. This is keyed to some id variable--usually, but not necessarily,
bookid
. The field itself is text.A SQL query then generates a table of ids and counts for each language.
The ids are useful in queries because they make grouping and filtering queries significantly faster.
The ids are in decreasing prevalence; this isn't strictly necessarily, but enables a very useful query restraint I use a lot that uses only the most common elements.
{"languages__id":{"lte":10}}
.In order to get this property, the
__id
field is actually created by this query. It might be possible to create in python.The
__count
field is not, AFAIK, used anywhere and could safely be deleted.In memory tables
Those tables persist through restart, but are not used in queries. Instead, two other queries are used.
The first is the
Lookup
table that does a hashmap from ids to keys.If you search for
"languages":["English"]
, this is the field that will be checked during the search for each book. (I assume, though I'm not sure, that after the first hit mysql will actually cache the proper languages__id and this table isn't actually consulted 10 million times).This table is just about the same as languages__id, except that
languages -> languages__id
index, because those sometimes confuse the query optimizer.count
field.In order to hit that lookup table, there needs to be another memory table that maps from bookid (or other id) to the language id. This is called
heap
.This memory table is created with a merge of two of the above tables.
Solution
There's potentially a single solution for all three of the above problems.
Lookup
table is empty, use the__id
table instead (with index hints to avoid paralyzing wrong-way lookups).bookid->languages__id
map (likelanguagesheap
). This will make creation of thelanguagesheap
table faster.Turning on memory tables then becomes a specific performance optimization you can turn on specific bookworms, without requiring a huge memory overhead for every bookworm.
Namespace issues
It would be nice to do a full rename of all these tables, but that might break some existing bookworms. So instead I think the path of least resistance is to add two new disk tables, called
languagesheap_
andlanguages__id_
. (i.e., trailing underscores). If the memory table is empty or nonexistent, they'll be used instead; and the creation of the memory table will be from this table, rather than from the other on-disk tables. (The old on-disk tables could be deleted, I guess).base tables
This process can apply to the base tables as well. The disk -> memory pattern also exists in the table pairs
catalog
->fastcat
words
->wordsheap
The creation of these tables (especially fastcat) is going to be a big pain on the new
hathi
bookworm; so the existence of afastcat_
table will help there a lot.The text was updated successfully, but these errors were encountered: