Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Enable prometheus metrics #3675

Open
lyz-code opened this issue Apr 4, 2024 · 8 comments
Open

BUG: Enable prometheus metrics #3675

lyz-code opened this issue Apr 4, 2024 · 8 comments
Labels
bug Things that should work, but don’t
Milestone

Comments

@lyz-code
Copy link

lyz-code commented Apr 4, 2024

Describe the bug
I've seen that Prometheus metrics have been available for a while but I'm not able to make them work.

To Reproduce
Steps to reproduce the behavior:

  1. Set PROMETHEUS_ENABLED=true in your aleph.env file and restart Aleph
  2. docker exec -it aleph_api_1 bash
  3. curl http://localhost:9100
  4. See error curl: (7) Failed to connect to localhost port 9100: Connection refused

Expected behavior
Prometheus metrics are fetched

Aleph version
3.15.5

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I'm not able to unset the command directive on the docker-compose maybe that's preventing the prometheus metrics server to be loaded.

@lyz-code lyz-code added bug Things that should work, but don’t triage These issues need to be reviewed by the Aleph team labels Apr 4, 2024
@tillprochaska
Copy link
Contributor

Hi, sorry for the confusion. This is indeed related to an incorrect default command specified in the Dockerfile. I’ve fixed this in 0a154ff, but the fix hasn’t been released yet.

In the meantime, can you try setting the following command in docker-compose.yml and let me know if that works for you?

gunicorn --config /aleph/gunicorn.conf.py --workers 6 --log-level debug --log-file -

This should be roughly equivalent to the command that was previously specified in docker-compose.yml, except that a separate Gunicorn configuration file is loaded. This file makes sure that Gunicorn binds to port 9100 when Prometheus metrics are enabled.

(If you have adjusted some of the Gunicorn configuration values such as the number of workers or the log level, that’s fine -- only thing that’s important is that you specify the --config flag.)

Sorry again for the inconvenience! (We also have documentation about this feature in the works.)

@tillprochaska
Copy link
Contributor

@lyz-code Here’s a link to the WIP documentation, but please take it with a grain of salt, as it is still a work in progress: https://github.com/alephdata/aleph/blob/docs/tech-docs/docs/src/pages/developers/how-to/operations/prometheus/index.mdx

If you run into other problems, please let me know. I’m happy to help and will make sure to update the documentation accordingly.

@lyz-code
Copy link
Author

lyz-code commented Apr 4, 2024

Hi @tillprochaska first thank you so much for the Prometheus work it looks very promising. I haven't seen that many applications with so detailed app metrics, so congratulations.

I've followed your guides and now I'm seeing the next error on the API:

Error: '/aleph/gunicorn.conf.py' doesn't exist

I didn't set any gunicorn configurations myself

@tillprochaska
Copy link
Contributor

@lyz-code I think you’re onto something. It seems there was a mistake in our release process. I’ll let you know when I know more.

@stchris stchris removed the triage These issues need to be reviewed by the Aleph team label Apr 9, 2024
@tillprochaska
Copy link
Contributor

tillprochaska commented Apr 10, 2024

Hi @lyz-code, sorry, just a quick update: This is indeed an issue with the 3.15.5 release. While we did include the Prometheus feature in the release candidates for 3.15.5, we made a mistake when releasing 3.15.5 and so it’s not actually included in that release. We’ll try to do a proper, new release soon.

@lyz-code
Copy link
Author

lyz-code commented Apr 25, 2024

Hi @tillprochaska I've seen that 3.15.6 didn't fix the bug. I know you didn't say it has but I wanted to try :P. FYI, I'm seeing another error when spawning the exporter on the latest version.

exporter_1       | [2024-04-25 09:16:33 +0000] [8] [ERROR] Exception in worker process
exporter_1       | Traceback (most recent call last):
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/arbiter.py", line 609, in spawn_worker
exporter_1       |     worker.init_process()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 134, in init_process
exporter_1       |     self.load_wsgi()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/workers/base.py", line 146, in load_wsgi
exporter_1       |     self.wsgi = self.app.wsgi()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/app/base.py", line 67, in wsgi
exporter_1       |     self.callable = self.load()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/app/wsgiapp.py", line 58, in load
exporter_1       |     return self.load_wsgiapp()
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/app/wsgiapp.py", line 48, in load_wsgiapp
exporter_1       |     return util.import_app(self.app_uri)
exporter_1       |   File "/usr/local/lib/python3.8/dist-packages/gunicorn/util.py", line 371, in import_app
exporter_1       |     mod = importlib.import_module(module)
exporter_1       |   File "/usr/lib/python3.8/importlib/__init__.py", line 127, in import_module
exporter_1       |     return _bootstrap._gcd_import(name[level:], package, level)
exporter_1       |   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
exporter_1       |   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
exporter_1       |   File "<frozen importlib._bootstrap>", line 961, in _find_and_load_unlocked
exporter_1       |   File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
exporter_1       |   File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
exporter_1       |   File "<frozen importlib._bootstrap>", line 991, in _find_and_load
exporter_1       |   File "<frozen importlib._bootstrap>", line 973, in _find_and_load_unlocked
exporter_1       | ModuleNotFoundError: No module named 'aleph.metrics'

I've also seen that the suggested port of the docker-compose for the aleph_exporter is 9100. That one is usually taken by the node_exporter so maybe it's better to use other one by default

@tillprochaska
Copy link
Contributor

I've seen that 3.15.6 didn't fix the bug. I know you didn't say it has but I wanted to try

Yes, you’re right! 3.15.6 is a security patch release, so we decided to not include anything else besides these patches. I’ll post an update here once the Prometheus feature is properly released.

@Rosencrantz Rosencrantz added this to the 3.15.7 milestone May 7, 2024
@tillprochaska
Copy link
Contributor

Sorry for the slow response. I’ve published a release candidate for a new release that should fix this issue. A final release will hopefully follow soon.

Note that if you want to test this release candidate you might need to adjust your docker-compose.yml file again to remove the command override for the api service. Also see the Compose config at the 3.17.0-rc1 tag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Things that should work, but don’t
Projects
None yet
Development

No branches or pull requests

4 participants