Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve general performance around short URLs for instances with a large number #2112

Open
marijnvandevoorde opened this issue Apr 25, 2024 · 11 comments · May be fixed by #2113
Open

Improve general performance around short URLs for instances with a large number #2112

marijnvandevoorde opened this issue Apr 25, 2024 · 11 comments · May be fixed by #2113
Milestone

Comments

@marijnvandevoorde
Copy link
Contributor

Shlink version

4.1.0

PHP version

8.x

How do you serve Shlink

Docker image

Database engine

MySQL

Database version

8.x

Current behavior

The recently introduced DeleteExpiredShortUrlsCommand to clean up expired links is great, as we foresee our database to grow in the millions over time. However, as there is no index on the expiry date column, this command will become increasingly heavy as it has to go over every single record.

Expected behavior

An index on the expiry date column will drastically increase the speed of the DeleteExpiredShortUrlsCommand

Minimum steps to reproduce

Fill up the database with a couple of thousand/million links.
Run the DeleteExpiredShortUrlsCommand
Now add the index and compare the performance of the DeleteExpiredShortUrlsCommand

@acelaya
Copy link
Member

acelaya commented Apr 25, 2024

Thanks!

An index on the expiry date column will drastically increase the speed

Yep, I think you are right there. I'll do some testing.

@marijnvandevoorde
Copy link
Contributor Author

I took the liberty of adding it, hope I didn't overlook any guidelines you have while doing so. If not, it's not the biggest change and the PR can maybe provide you some ready-to-copy-paste material ;-)

@acelaya
Copy link
Member

acelaya commented Apr 26, 2024

Yep, I saw it, thanks a lot 🙂. I was definitely planning to use that migration to test this.

@acelaya
Copy link
Member

acelaya commented Apr 26, 2024

I have tried inserting 1M short URLs, all with a valid_until date in the past. Then when trying to delete them, it takes around 20 seconds (in my laptop) regardless of the presence of the index, so it doesn't seem to have much effect.

I have tried with Postgres only, I will try at least with MySQL as well.

@marijnvandevoorde
Copy link
Contributor Author

I was super surprised by your message at first, but I think I have an explanation:
Since all links are expired, the command has to delete all of them, so it has to go over all links, regardless of the presence of an index.

The real test would be to have 1M links but only 1k expired, and then run it? It should be almost instand with an index but will probably take 20 seconds without the index?

@acelaya
Copy link
Member

acelaya commented Apr 26, 2024

Hmm, yeah, that actually makes sense. I'll try that and come back with the results.

I also want to experiment a bit with short urls which are "expired" due to having reached the max amount of visits. In that case, the query requires a sub query, so it'll require a different optimization approach.

@acelaya
Copy link
Member

acelaya commented Apr 27, 2024

I was super surprised by your message at first, but I think I have an explanation: Since all links are expired, the command has to delete all of them, so it has to go over all links, regardless of the presence of an index.

The real test would be to have 1M links but only 1k expired, and then run it? It should be almost instand with an index but will probably take 20 seconds without the index?

Well, the execution times are again very similar 😅

To be honest, with just 1k expired URLs and the rest having valid_since in the future, the query takes ~30ms with the index, and ~150ms without it, so it somewhat has an effect, but it is not terribly bad without it.

And then, the more expired short URLs there are, the less the index makes a difference. I then tried with 100k expired short URLs from the 1M total URLs, and it took ~2s in both cases.

Again, I just tested with Postgres, and it could be this difference is bigger with MySQL, specially with >=8.0 versions.

I feel tempted to merge the PR anyway, with a couple changes, as it definitely doesn't make things worst, but I feel like I should test other database engines first.

In the meantime, feel free to do some tests yourself, as I might be making some mistakes.

@marijnvandevoorde
Copy link
Contributor Author

We were planning to do some performance testing so we can do that as well. To be honest, the PR also comes from "this should have an impact" rather than from experiencing issues.

Unrelated, we also noticed there's no index on the long urls. Shouldn't that be there for people who enable the "findIfExists" feature? I can imagine that becoming slow when the number of urls starts to rise. Again, just what "I would expect", haven't done any testing. We just anticipate serving several millions of links over time and we're checking if there are any reasons NOT to use shlink :-)

@acelaya
Copy link
Member

acelaya commented Apr 29, 2024

We were planning to do some performance testing so we can do that as well. To be honest, the PR also comes from "this should have an impact" rather than from experiencing issues.

That's fine. These are sometimes "easy" to spot when you have experienced it a couple of times. Your original report did not strike as something crazy and impossible to happen.

Unrelated, we also noticed there's no index on the long urls. Shouldn't that be there for people who enable the "findIfExists" feature?

Could be helpful as well, although I suspect that feature is not extensively used.

However, I did notice a bit of slowness just by listing short URLs while testing this, so that's something I want to dig into and improve, and adding more indexes might be needed.

In general, I have focused on improving performance around visits more than around short URLs, as for most users, the list of visits will be exponentially larger than the short URLs one.

We just anticipate serving several millions of links over time and we're checking if there are any reasons NOT to use shlink :-)

I cannot guarantee there's absolutely no bottlenecks, as there's always code branches which are less frequently hit and could be skipped.

What I can guarantee is that I will work on improving any rough edges as soon as they are found.

@marijnvandevoorde
Copy link
Contributor Author

That last part is what we were looking for. Also happy to contribute. I think we'll have a slightly different usecase (high volume links, each link only visited a few times), so it might be interesting for shlink to see how it holds up under that usecase ;-).

Either way, thanks for the reactiveness, much appreciated.

@acelaya acelaya added this to the 4.2.0 milestone May 13, 2024
@acelaya acelaya changed the title Pruning expired links will become increasingly slow as the table grows Improve general performance around short URLs for instances with a large number May 13, 2024
@acelaya
Copy link
Member

acelaya commented May 13, 2024

I'm going to re-purpose this issue with a more general context around short URLs. I'll list the different improvements as they have been applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging a pull request may close this issue.

2 participants