Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We must migrate the mysql databases from utf8mb3 to utf8mb4 #938

Open
regisb opened this issue Nov 14, 2023 · 18 comments
Open

We must migrate the mysql databases from utf8mb3 to utf8mb4 #938

regisb opened this issue Nov 14, 2023 · 18 comments
Assignees

Comments

@regisb
Copy link
Contributor

regisb commented Nov 14, 2023

Currently, Open edX runs with MySQL tables that are encoded with utf8mb3. It is time that we upgrade to utf8mb4 (the default).

For reference:
#887
https://discuss.openedx.org/t/palm-upgrade-running-rdbms-in-tutor-mysql-vs-mariadb-utf8-vs-utf8mb4/11030

edX.org is currently upgrading their environment to MySQL 8.0 (and utf8mb4). We should use this opportunity to learn from them any pitfalls they faced during the upgrade.

@regisb regisb self-assigned this Nov 24, 2023
@regisb
Copy link
Contributor Author

regisb commented Nov 24, 2023

This is blocked by the investigation currently being led by 2U. Migrating from utf8mb3 to utf8mb4 would require to extend the maximum length of certain columns.

Until this investigation is complete, we should probably put all new users on utf8mb4, such that they don't have to migrate later.

@ormsbee
Copy link
Contributor

ormsbee commented Dec 8, 2023

@regisb: I think we should be careful to distinguish between changing the connection settings and converting existing table data. We can change the connection settings without rebuilding tables or changing CharField lengths. The length of those fields might have been a problem a couple versions ago, but MySQL 8.0 is going to be making new tables with a row format of DYNAMIC by default anyway, and the indexing-length limits there are substantially higher. So changing the connection settings would leave a bunch of older tables as utf8mb3 (and possibly older row formats), but at least let the new stuff be made in utf8mb4 by default.

@ormsbee
Copy link
Contributor

ormsbee commented Dec 8, 2023

@regisb: I'm talking with folks at 2U about switching over the Studio connection initially, followed by LMS if that goes smoothly, hopefully in the next couple of weeks. Migrating existing tables from ut8mb3 data to utf8mb4 data would be a separate conversation. If the connection transition goes smoothly, would you be up for switching the connection options for Tutor and addressing the data migration as a separate step?

@regisb
Copy link
Contributor Author

regisb commented Dec 8, 2023

Yes, that would be great Dave.

@regisb
Copy link
Contributor Author

regisb commented Apr 8, 2024

As discussed today in our Tutor users group meetup, we should tackle this issue in time for Redwood: https://openedx.atlassian.net/wiki/spaces/COMM/pages/3583016961/Tutor+Users+Group#2024-04-08
cc @DawoudSheraz

@Danyal-Faheem
Copy link
Contributor

Danyal-Faheem commented Apr 9, 2024

Listing my findings here for future reference.

Running Tutor-17.0.3-nightly,I performed the following steps:

  1. I ran this script on the mysql databases
  2. Updated the connection string by creating the openedx-lms-common-settings and openedx-cms-common-settings patches with the setting:
DATABASES["default"]["OPTIONS"]["charset"] = "utf8mb4"
  1. Replaced the do sqlshell command charset to utf8mb4 (probably didn't need to for now)
  2. Created a new course with Emojis in the titles and content and it successfully worked. I was able to publish and view in the LMS as well.
    image
    image
    image
  3. Tested the existing democourse and that was working just fine as well. Made changes to the demo course by adding an emoji and the changes were reflected on the LMS.

@DawoudSheraz DawoudSheraz assigned Danyal-Faheem and unassigned regisb Apr 26, 2024
@Danyal-Faheem
Copy link
Contributor

Hey @ormsbee.

While working on a fix for this issue, I upgraded the MySQL character_set and collation in Tutor-nightly to utf8mb4 and utf8mb4_general_ci respectively which you can find in this commit for reference.

This is working perfectly fine for me, however when I looked into the DB, I noticed some columns were not upgraded to the utf8mb4 character set. The list of these columns and tables is:

image

Similarly, some columns seemed to force the utf8mb4_unicode_cicollation and some columns were forcing the utf8mb4_bin collation. The list of these columns and tables is:

image
image

The rest followed the utf8mb4_general_ci collation as specified.

If I used utf8mb4_unicode_ci instead as the default collation, then some columns seemed to force the utf8mb4_general_ci collation. The list of these columns and tables is:

image

The rest followed the utf8mb4_unicode_ci collation as specified.

What I wanted to confirm from you is if these are intentionally saved in this specific charset and collation by edx-platform for a reason or if they are just an issue on our end that we need to resolve.

One more thing I wanted to confirm is if we are using the utf8mb4_general_ci collation or the utf8mb4_unicode_ci collation or any other collation for the Redwood release or later on. The previous releases in Tutor were using the utf8mb3_general_ci collation.

CC: @DawoudSheraz

@ormsbee
Copy link
Contributor

ormsbee commented May 2, 2024

Thank you so much for taking on this important work!

While working on a fix for this issue, I upgraded the MySQL character_set and collation in Tutor-nightly to utf8mb4 and utf8mb4_general_ci respectively which you can find in this commit for reference.

We should prefer utf8mb4_unicode_ci over utf8mb4_general_ci. From the MySQL 8.0 docs:

utf8mb4_general_ci is a legacy collation that does not support expansions, contractions, or ignorable characters.

With respect to the explicit encodings set in the system...

This is working perfectly fine for me, however when I looked into the DB, I noticed some columns were not upgraded to the utf8mb4 character set. The list of these columns and tables is:

image

Tagging @bradenmacdonald, since he wrote the split_modulestore stuff here. It should be fine to update these to utf8mb4 / utf8mb4_bin, but we should also address whatever root issue is setting these to utf8mb3_bin. These are all subclasses of OpaqueKeyField, so I don't know if there's something in opaque-keys that does this or whether it's being manually set in migrations in edx-platform.

Similarly, some columns seemed to force the utf8mb4_unicode_cicollation and some columns were forcing the utf8mb4_bin collation. The list of these columns and tables is:

image image

These are intentionally using the different collations to indicate whether something should be treated like a machine identifier (case-sensitive) or like normal char fields (case-insensitive). They should keep these encodings and collations.

If I used utf8mb4_unicode_ci instead as the default collation, then some columns seemed to force the utf8mb4_general_ci collation. The list of these columns and tables is:

image

You can ignore these tables. They will be removed shortly.

@ormsbee
Copy link
Contributor

ormsbee commented May 2, 2024

Also, I'm not sure how we want to handle the migrations of really big tables for the community. Converting something like courseware_studentmodule may cause significant downtime for large sites.

@ormsbee
Copy link
Contributor

ormsbee commented May 2, 2024

As background, the explicitly created encodings/collations for content_staging and the oel_ apps use openedx_learning helper functions to define their fields:

Those use the MultiCollationMixin defined here:

https://github.com/openedx/openedx-learning/blob/main/openedx_learning/lib/collations.py

This also helps us normalize case sensitivity between sqlite and MySQL so tests behave more like production.

@ormsbee
Copy link
Contributor

ormsbee commented May 2, 2024

Also, we need to summarize this work in the operator release notes for Redwood before the release goes out.

@DawoudSheraz
Copy link
Contributor

@ormsbee Hi, thanks for the detailed input on this.

Also, I'm not sure how we want to handle the migrations of really big tables for the community. Converting something like courseware_studentmodule may cause significant downtime for large sites.

I was thinking of using factory-boy to generate dummy/test data for CSM, around 10-20M, and see how it plays out. Or should we increase the record count? I understand CSM is The heaviest table when it comes to data.

@ormsbee
Copy link
Contributor

ormsbee commented May 3, 2024

Oh, actually I think I may have a hypothesis on the utf8mb3 thing for course/context keys. The CourseOverviews table has its id field set to be a CourseKeyField, meaning that everything that has a foreign key to it need to have the same field type. So if new CourseKeyFields are created with a type of utf8mb4 and it's trying to make a foreign key, that won't work.

@ormsbee
Copy link
Contributor

ormsbee commented May 3, 2024

I was thinking of using factory-boy to generate dummy/test data for CSM, around 10-20M, and see how it plays out. Or should we increase the record count? I understand CSM is The heaviest table when it comes to data.

I think the important thing is that we're going to have to make it opt-in, i.e. we don't require that table be converted, and we don't trigger it automatically. Because on some sites, running that conversion could literally take days (or weeks if they're trying to do it while serving some traffic).

@ormsbee
Copy link
Contributor

ormsbee commented May 9, 2024

@DawoudSheraz, @Danyal-Faheem: Is it still the intention to get this into Redwood?

@DawoudSheraz
Copy link
Contributor

Contributor

Hey Dave. So far, it is the plan. We will update the issue if we run into challenges.

@Danyal-Faheem
Copy link
Contributor

Hi @ormsbee, after discussing with @DawoudSheraz, we've decided that we're going to add the utf8mb4 upgrade only for new tutor installations, i.e. tutor 18.0.0 and onwards. There's already a PR up for that in #1065.

A method to upgrade the individual tables one by one for existing tutor installations will be provided in a later release but hopefully pretty soon.

@Danyal-Faheem
Copy link
Contributor

Hi @ormsbee.
Just wanted to provide a status update on this issue. The draft PR #1065 already contains this work for now.

The default charset and collation in Tutor 18.0.0 aka redwood and onwards will utf8mb4 and utf8mb4_unicode_ci respectively.

We've also developed a tutor do command to manually upgrade the charset and collation of the tables in MySQL. For now, this command will not be automatically executed upon upgrading to redwood but will be left up to the user to do so. The command also provides options to upgrade:

  1. All the tables at once
  2. Upgrade a select few tables
  3. Upgrade all the tables barring select few tables

We're still in the process of testing how this command works out for earlier installations and how long it takes on tables with a lot of data. From my initial testing, I dumped about 1.2 Million rows into the courseware_studentmodule table using StudentModuleFactory.

The upgrade process had the following benchmarks:

  • Tables upgraded: courseware_studentmodule and courseware_studentmodulehistory
  • No of rows in each of the tables: ~1.2M each
  • Time it took to upgrade both the tables: ~35 secs

For now, we've only tried to upgrade the tables the openedx database but from our initial testing, services that import data from openedx, e.g, discovery importing course titles with emojis in them would break the service. The databases for these services will also need to be upgraded alongside the openedx database.

Feel free to share your thoughts on this matter and if you have something that we should add/remove.

CC: @regisb @DawoudSheraz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In review
Development

No branches or pull requests

4 participants