Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1340 better metric support #1343

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

jelmd
Copy link
Contributor

@jelmd jelmd commented Dec 6, 2023

Fix #1340 - this PR provides live monitoring of the turn server application by providing up-to-date metrics exposed via HTTP in OpenMetrics/Prometheus Exposition format at relay-thread level per default, on relay-thread/session level if desired. For more details see docs/Metrics.md.

@jelmd
Copy link
Contributor Author

jelmd commented Dec 6, 2023

Build fail because CI tries to link against the wrong/buggy prometheus-client-c lib.
See 121b319

@jelmd jelmd force-pushed the #1340_better_metric_support branch from 99ff06b to 9969511 Compare December 7, 2023 15:28
@jelmd
Copy link
Contributor Author

jelmd commented Dec 7, 2023

These tests are an absolute nightmare! They seem to be pretty bogus. Please advise, what's going wrong, what needs to be fixed.

@jelmd jelmd force-pushed the #1340_better_metric_support branch from 089ebdc to 6f737da Compare December 8, 2023 07:20
@jelmd jelmd force-pushed the #1340_better_metric_support branch from 26c7a72 to 9e49ddc Compare December 8, 2023 15:05
Jens Elkner added 2 commits December 8, 2023 16:35
Actually the s390x bloat never succeed and prevents a priori
all PRs from being accepted. Never heard, that anybody is
running coturn on a non-{Linux,Solaris,BSD} platform.
Having tests for all possible OS/arches is one thing, actually
testing against all is a completely different thing - not a
green at all and wastes a lot of resources. So IMHO the test
matrix should be revised to reflect, what is really needed.
@jelmd jelmd force-pushed the #1340_better_metric_support branch from 9e49ddc to 6980c5b Compare December 8, 2023 15:36
@jelmd
Copy link
Contributor Author

jelmd commented Dec 8, 2023

Test bugs (home grown install wrt. libprom libs and includes and different clang-format versions depending on the distro in use) fixed.

@jelmd
Copy link
Contributor Author

jelmd commented Dec 8, 2023

fixes #1337, too.

@jelmd
Copy link
Contributor Author

jelmd commented Dec 16, 2023

Nobody there to review this PR?

@KangLin , @eakraly , @misi ?

It really helps to better understand coturn and get a feeling for it. E.g. the grafana dashboard I use:
turn-metrics

Copy link
Contributor

@KangLin KangLin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good work!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommendation is defined as a structure:

struct _PROMETHEUS
{
int prom;
int prom_port;
int prom_compact;
int prom_realm;
int prom_usernames;
int prom_usid;
int prom_rsid;
vint prom_sid_retain;
vint log_ip;
};

@@ -1,8 +1,8 @@
# Prometheus setup
# Metrics setup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add in here:
Metrics introduction

- https://docs.victoriametrics.com/keyConcepts.html
- https://docs.victoriametrics.com/FAQ.html#what-is-high-cardinality
- https://docs.victoriametrics.com/FAQ.html#what-is-high-churn-rate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add in here:

Usage

E.g. add the grafana dashboard configure and usage documents.

Metrics setup

pms_t *sample_session_state;
pms_t *sample_stun_req;
pms_t *sample_stun_resp;
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recommendation is defined as a structure

@eakraly
Copy link
Collaborator

eakraly commented Dec 19, 2023

Thank you @jelmd for this monumental piece of work.

But I do not think we can accept this PR in its current form:

  1. It does not do what it claims it does - there is a bunch of changes all losely related to prometheus interface
  2. Yes, digitalocean/prometheus-client-c looks abandoned but I doubt it will serve anyone to replace it with another potentially undermaintained library. Not sure what other good candidates may be though.
  3. Realtime update of metrics: great idea but it has nothing to do with a different library. Should be separate.
  4. Removing support for platforms - why?
  5. clang-format let's not reformat back and forth
  6. There is a lot more metrics collected which creates a problem, and introduces solutions to the problem ...
  7. Packages or sources stored on a university ftp server running openssl/1.0.2u (deprecated 4 years ago) will not be taken seriously by security conscious people. Which will introduce work in the future (to address those concerns) or people would just stop using coturn silently for lack of trust - speaking from experience here on both accounts.
  8. Additional labels like realm: good idea! But again, nothing to do with the library and should be separated out. Another thing is that this going toward using prometheus as debugging which it is not (IP address). But I can see how commercial users would benefit from that (while destroying performance obviously and introducing a way to break)
  9. Solutions to keep prometheus consumed memory in check: I need to dive more into this - it's a double edged sword and moves the problem upstream (and creates new problems that need solutions). Restarting counters is not a trivial thing.
  10. We should have new tests, especially for the new functionality

Though this list is long and negative - there are many changes we can adopt but as separate PRs. I'm trying to be practical - if something gets broken it is going to be very hard to track it and fix it (broken as in addition to the list above).

@eakraly eakraly self-requested a review December 19, 2023 06:37
Copy link
Collaborator

@eakraly eakraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending back to author to split into separate PRs

@jelmd
Copy link
Contributor Author

jelmd commented Dec 20, 2023

Very good work!

* Why use https://github.com/jelmd/libprom instead of dirt bikes
  digitalocean/prometheus-client-c?

Basically the Readme gives you a coarse overview. Also it exposes stringbuffer and simple log interface, so that one does not need to re-invent the wheels to just add metrics and stuff to its "project" (but can if needed).

So way much better user experience than the religious (machine translated Go to C?) interface.

@jelmd
Copy link
Contributor Author

jelmd commented Dec 20, 2023

Thank you @jelmd for this monumental piece of work.

Thanx @eakraly for your time and having a look at it!

But I do not think we can accept this PR in its current form:

1. It does not do what it claims it does - there is a bunch of changes all losely related to prometheus interface

Not sure, what you mean. Is it possible to pinpoint the sections, which are a problem? Basically my focus is not Prometheus, but metrics. They finally get exposed in OpenMetrics/Prometheus exposition format, yes - but I wouldn't insist on this (a lot of other formats are possible). The meat is, that the metrics gets stored somehow, and to not re-invent the wheel, libprom is good enough to store the samples and as side effect can be used easily to expose the data. However, OpenMetrics/Prometheus exposition format is that simple, that one could generate a report simply via metric_name[{key=value}[,key=value]...}]=%g\n... and does not need libprom at all (and libpromhttp is a really simple wrapper around libmicrohttpd so not really needed, too). It just helps the "do not want all the bloat" people, makes life easier.

2. Yes, digitalocean/prometheus-client-c looks abandoned but I doubt it will serve anyone to replace it with another potentially undermaintained library. Not sure what other good candidates may be though.

Yes, if no-one uses it, vendors certainly are not going to include it in their distros (is there really a difference between un- and under-maintained?). I'm not sure, what "potentially undermaintained" means in this context. Do you mean, it needs to be bloated every month or year to give people the impression, that some stuff got really changed and urge them to update ASAP? Hmm, I maintain stuff for free, so I update on demand and not to earn money for questionable stuff. However, if there is a good sponsor, I could increment the version on demand each month ;-).

3. Realtime update of metrics: great idea but it has nothing to do with a different library. Should be separate.

What? The problem right now is, that "metric" data get updated when an allocation gets closed. So one can really forget it - the data are worth nothing at all - just paper weight, do not provide, what a provider really needs. Similar is the redis stuff - beside that it doesn't provide life time data but coarse grained samples it is probably more resource demanding than the prom stuff (IIRC I've seen some issues regarding high CPU usage wrt. redis). So right now, coturn really lacks a proper monitoring facility, which allows operators to identify leakages and bottlenecks, monitor bandwidth and resource usage, and to size the envs as needed.

4. Removing support for platforms - why?

You mean dropping s390x stuff? Well, actually I just ditched it, because the tests always failed on it: not because the software had an error but because the test platform was so unreliable, that the tests couldn't even be started. =8-(

Very, very bad test environment, which prevents PRs to succeed just because the test platform can not be kept in a stable state to be able to start any test at all. This furthermore rises the questions: For what purpose gets the software tested on any known platform on earth? IMHO not green at all, a huge waste of resources for nothing but "compiles on every platform on earth". Who cares? Who needs it? x86_64 is sufficient for our purposes, if it works on arm64 or sparc64 and friends, fine. Everything else: please give a hint!

5. clang-format let's not reformat back and forth

The PR rules want make lint to succeed. So there is not really an option, unless one approves PRs, which do not pass ..., which probably happened in the past (or a different linter has been used and a new one got introduced without testing the current sources) =8-( . E.g. tests were made against Ubuntu 16.04 (BTW: unsupported since April 2021) and were ok, however, make lint tests were made on Ubuntu 20.04 (or 22.04) and failed, on old unchanged code!!! So either the test under 20.04 failed or if fixed for this, they failed for 16.04. Hmmm, yes, the current test env seems to be more a kindergarden chaos than a proper CI env.

6. There is a lot more metrics collected which creates a problem, and introduces solutions to the problem ...

Ehmm, nope. Can you elaborate please? Right now, as said, metrics are completely useless. They even do not deserve the term monitoring, it is just a junk yard of meaningless data. With the PR one gets the first time the opportunity to really monitor coturn application on the fly and being able to turn the knobs as needed. Depending on the degree you want to play with your knobs, you may monitor coarse grained at relay-thread level or even more fine grained on session/allocation level. So everyone's milage may vary, but the default level should be sufficient for production envs and overhead negligible (I would scare much more, if redis is enabled). Of course, if one uses its old C64, all bets are off ;-).

7. Packages or sources stored on a university ftp server running openssl/1.0.2u (deprecated 4 years ago) will not be taken seriously by security conscious people. Which will introduce work in the future (to address those concerns) or people would just stop using coturn silently for lack of trust - speaking from experience here on both accounts.

Thanks for nitpicking this one :). Actually paranoid people would download the source via github using HTTPS and compile it by themselves. Really paranoid people would download, review and probably file an issue, if they discover something, what is probably not ok (whatever this means). All others do not really care at all, would be happy even with HTTP, too (my experience so far) - they just want to get it work. Anybody can host/provide it for its customers/clients. So "lack of trust" or "stop using coturn" here is IMHO a really questionable assumption without any proof or relation to reality (usually distro vendors provide packages, not the users themselves, and in turn if they do not trust their vendors ...), especially because it is OpenSource! Last but not least, yes, people trust any software much more just because they can download it via OpenSSL 1000+.x.y instead of 1.0.2? Yepp, that's for sure, this software must be safe ;-).

8. Additional labels like realm: good idea! But again, nothing to do with the library and should be separated out. Another thing is that this going toward using prometheus as debugging which it is not (IP address). But I can see how commercial users would benefit from that (while destroying performance obviously and introducing a way to break)

Ehhhm, realm is not my idea, but I can see use for it (e.g one may group metrics by realm). Therefore I just kept it, but only as session_state attribute - would otherwise blow up the metrics numbers, if it really gets used with different values (threadID:sessionID is already unique, so other labels are just add. info - noise - one usually does not need). Wrt. to performance: I doubt that general assumptions and prefer some numbers. Actually I thing, if one disables redis and just use the prom stuff, it would be much, much better wrt. performance and less resource demanding. Unfortunately I haven't had time to waste my time with redis - just had a look at the code :(.

9. Solutions to keep prometheus consumed memory in check: I need to dive more into this - it's a double edged sword and moves the problem upstream (and creates new problems that need solutions). Restarting counters is not a trivial thing.

I'm not sure, whether I understand, what you mean. For each metric you need to keep counters somewhere. If one does it file based, no one needs to wonder about performance. So the obvious things to do is to keep counters in memory. Simply apps do it by maintaining global vars, others do it by using a lib like libprom to keep counters (key:value pairs alias samples) in a hash map and provide an API, which make it thread-safe (thread-safe is not really an issue for coturn, because the related counters are touched always by the same thread anyway). So basically, one could completely ditch libprom et al, if one bloats its one code with counters (like int server->metricXY), updates it in a thread-safe manner and generates a report as needed. So can't see the double edged sword and would be happy to understand, what you mean.

10. We should have new tests, especially for the new functionality

What tests? Right now, it is by far much better than before. Wondering, whether there are any tests for the current useless metrics implementation (or even redis things?)?

Though this list is long and negative - there are many changes we can adopt but as separate PRs. I'm trying to be practical - if something gets broken it is going to be very hard to track it and fix it (broken as in addition to the list above).

Not really and not sure whether it makes any sense to break it into separate PRs - actually I think it would make it much harder to understand the features implementation. But lets see, I'm open to any suggestions :)

Copy link
Contributor

@KangLin KangLin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jelmd

  • prometheus-client-c is pre-distributed in most systems
  • It took me two days to look at the source code(prometheus-client-c and your libprom) and learn prometheus.
    • I think the interface of prometheus-client-c is well defined. Simple to use. Although its implementation is a bit bug, but it does not affect our use.
    • IMHO, the interface defined by libprom is a bit messy(machine translated). Some of the internal interfaces are exposed. If used in a project, it can cause a performance penalty.

So I recommend going ahead with prometheus-client-c.
You fix bug. please pull request to prometheus-client-c.

// clang-format off
// different clang-format versions have different opinions on this and prevent PRs to succeed
enum /* permitted values for its `has_arg' field... */
{ no_argument = 0, /* option never takes an argument */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehmm, the PR failed because ’make lint’ was not happy with it, when it got linted on ubuntu 20.04. If fixed for the linter there, the ’make lint’ for ubuntu 22.04 failed. So a deadlock for PRs (wondering, how all the other PRs got through ;-)).

@KangLin
Copy link
Contributor

KangLin commented Dec 21, 2023

@jelmd About http server:

  • Use prometheus-client-c's built-in promhttp directly in the program. Currently it works in the admin service thread.
  • In practice, I think that in most cases, nginx, apache etc http servers will be used. So I think there needs to be a plugin as well. prometheus-client-c already provides a simple and easy-to-use interface 😄

@jelmd
Copy link
Contributor Author

jelmd commented Dec 21, 2023

@jelmd

* prometheus-client-c is pre-distributed in most systems

Really? Is any software actually using it? Didn't found any, when decided to fork (but obviously it was not the only reason to fork). Really strange, that this gets bundled. But package maintainer do not necessarily understand the software they package or have a closer look on it, sometimes they just volunteer to package it because someone asked for it ... ;-)

* It took me two days to look at the source code(prometheus-client-c and your libprom) and learn prometheus.

Thanx a lot :)

  * I think the interface of prometheus-client-c is well defined. Simple to use.

cluttered, sometimes not really smart/inefficient, nightmare to build (just look at the bloat coturn CI needs to get something usable out of it), buggy wrt. histograms and even tests, ...

Although its implementation is a bit bug, but it does not affect our use.
* IMHO, the interface defined by libprom is a bit messy(machine translated).

Messy? Plz give a hint.

Some of the internal interfaces are exposed.

That's exactly the point. IMHO reinventing the wheel does not solve any problem, this is a problem. I wrote some metric exporters, enhanced apps with it and always found it so boring, unproductive that one has to copy-and-paste all the boiler plate code like "loggers" and/or string buffers etc.. Therefore I exposed them intentionally, so that one may focus on the real task (data collection and metric export) and does not need to to waste a lot of time to find full blown loggers, string buffers, etc. libs, which are actually not needed, because they are already there. So the lib should not be understood as a data collection/formation thing, but as a utility that helps to get the more or less pretty simple job done. Finally an interface is an interface and the maintainer decides, whether to expose it or to keep it hidden because unstable or buggy or just a bad hack, whatever ...

If used in a project, it can cause a performance penalty.

How? It is actually the purpose to have something, which is safe and fast enough to do the job. Not sure, if someone has to write a new string buffer implementation, whether it is bug free and as fast as the one in libprom ;-) Same thing wrt. to a simple logger. Last but not least, as for any lib, one does not need to use all the features/functions it provides (e.g. libapr, libexpat, libssl, ...) - it just offers functions to make the life easier, there is no "you have to use ...".

So I recommend going ahead with prometheus-client-c. You fix bug. please pull request to prometheus-client-c.

Sorry but nope, there was a reason to fork it and if you have a look on the diffs, at its Issues and PRs it is pretty obvious, that a) the project is dead, b) nobody would review a PR and thus c) it would be just a huge waste of time (I do not have BTW) to create a PR, which sits in the review queue forever. Actually some of the reasons, why github encourages people to fork ;-)

@jelmd
Copy link
Contributor Author

jelmd commented Dec 22, 2023

@jelmd About http server:

* Use prometheus-client-c's built-in promhttp directly in the program.

What do you mean. Plz name what you mean! You mean promhttp_start_daemon() ? Otherwise I do not know, what the "built-in promhttp" should be.

Currently it works in the admin service thread.

Everything needs to be started somewhere, no change here BTW. It gets started at the same place as before, which is ok - all other threads get started there, too. And for completeness, promhttp_start_daemon() is just a simple wrapper around MHD_start_daemon(), which in turn just creates some worker-threads, which listen for requests at the specified port and call the related handler, if a request comes in - really, really simple stuff - just have a look at the source.

So no, it does not "work in the admin service thread" (wouldn't it block it BTW, if it would ;-)), just the serving threads get started there, only.

see also: docs/Metrics.md.

* In practice, I think that in most cases, nginx, apache etc http servers will be used.

OMG. Why one should do that? I get goose bumps ;-). But of course, if someone's life is not hard enough yet, nothing prevents them to do such strange things. Can you perhaps explain, what the advantage would be and what it has actually to do with the implementation?

So I think there needs to be a plugin as well.

LOL - no, metrics are an integral part of the software - not only for performance reason it needs to collect the data by itself: only the app [authors] knows, where, when, why, what to collect and is able to do it at the right place. Using an external plugin for it does not make any sense at all. If you refer to the http part - of course there needs to be an interface, where the collected data are provided to the "public". This part gets done using the OpenMetrics/Prometheus exposition format and providing the "snapshots" via HTTP. There is no need at all to clutter the app with "plugins", which are able to format data in several 1000 different formats or expose it via SNMP or InfluxDB or whatever format (and describe it via XML schema :)) - just paperweight, because the OpenMetrics format is that simple, that it can be transformed into whatever is needed by the clients themselves easily.

So IMHO, plugin === bad idea.

prometheus-client-c already provides a simple and easy-to-use interface 😄

Can you plz be specific here? What do you mean with "simple and easy-to-use interface". Somehow this sounds like the opposite to your plugin idea ;-)

@KangLin
Copy link
Contributor

KangLin commented Dec 22, 2023

About HTTP servers. There are several solutions:

  1. The HTTP service is used as a thread for the turnserver. Currently in use.
  • Pros: Easy to develop
  • Cons:
    • Stability is determined by the developer's capabilities
    • A turnserver must contain the full functionality of an HTTP server (turnserver must pay for the resource overhead of the HTTP server, such as CPU, memory, program size, etc.)
  1. The HTTP service is used as a child process of turnserver.
  • Pros:
    • High stability. Even if the HTTP service crashes, it doesn't affect turnserver
  • Cons:
    • A turnserver must contain the full functionality of an HTTP server (turnserver must pay for the resource overhead of the HTTP server, such as CPU, memory, program size, etc.)
    • Difficult to develop. The problem of inter-process communication needs to be solved.
  1. The HTTP server is used as a programe(as push gateway). Develop the appropriate plugins. For example: nginx, apache etc http servers; push gateway
  • Pros:
    • High stability. Even if the HTTP service crashes, it doesn't affect turnserver
    • turnserver doesn't need to include an HTTP server. Reduces the resource overhead of the HTTP server, such as CPU, memory, and program size.
    • Easy to use. One HTTP server can serve multiple turnservers
  • Cons: Difficult to develop. Need to:
    • Inter-process communication.
    • Cross-machine communication.

We are currently talking about option 1.

From the perspective of Prometheus design and stability, it is recommended to use Option 2.

In O&M. Option 3. is also needed.

Functional requirements for Prometheus:

  • Provides operations on metric types (counter, gauge, and histogram).
  • Provide conversion of metric types (counter, Gauge, Histogram) to data model (string format)
  • Provide a full HTTP service function.(used as a thread)

Well, prometheus-client-c provides just that.

Implementation of option 1

  • Start a sub-thread in the main thread of turnserver to call promhttp_start_daemon() to simply complete the HTTP service. And it already contains Metric types conversion operations (hidden from the user).
  • Define and document the metrics to be monitored
  • Initialize the metric
  • Call the function of the manipulative indicator where needed

@jelmd @eakraly
Let's discuss option 1 first, and if we still have the energy, we will do other options.

@jelmd
Copy link
Contributor Author

jelmd commented Dec 22, 2023

About HTTP servers. There are several solutions:

Ehmm, 1st thing to note: libmicrohttpd as used in libprom is a http server (it even support https). It is not named nginx, apache httpd, or whatever, but it is an http server, pretty stable and used in several other projects (e.g. like flutter). But if one doesn't trust it, one can try to provide its own implementation, like coturn does for the web admin interface, or acme-redirect, etc. ...

1. The HTTP service is used as a thread for the turnserver. Currently in use.

* Pros: Easy to develop
* Cons:
  
  * Stability is determined by the developer's capabilities

and other projects and community using it. BTW: I use it for several years and never had an issue with it.

  * A turnserver must contain the full functionality of an HTTP server (turnserver must pay for the resource overhead of the HTTP server, such as CPU, memory, program size, etc.)

Please define "full functionality of an HTTP server". The metrics interface speaks HTTP as far as needed and that's it. Nobody claims, that it needs to provide all the same features as apache httpd, or nginx, etc. do - so if you mean this, the answer has to be: plain wrong - it does not need to .... The protocol to get the data is simply HTTP and here the interface just needs to "print out" all metrics if a correct request with the path /metrics has been received. That's it. Everything else should be answered with Bad request unless the implementation decided to provide additional data via other URLs.

2. The HTTP service is used as a child process of turnserver.

You mean a forked fat process?

* Pros:
  
  * High stability. Even if the HTTP service crashes,

What happens if a thread dies ;-) How should coturn control the forked process?

it doesn't affect turnserver

How get the metrics xfered to the forked process/its threads? I would be happy to see a "scott me up beamy" implementation, but as long as this isn't usable on earth I doubt "doesn't affect" wrt. to any problems and especially wrt. performance.

* Cons:
  
  * A turnserver must contain the full functionality of an HTTP server,

Again, please define "functionality of an HTTP server". Is it possible to make a list of what you think libmicrohttpd is missing? As already said, in general a metrics provider should be able to answer HTTP requests and provide the metrics via /metrics path - that's all.

(turnserver must pay for the resource overhead of the HTTP server, such as CPU, memory, program size, etc.)

Where is the difference? If you wanna answer an HTTP request you always need a request listener which (no matter whether you hide it in a cascade of other servers spawned as thread or fat processes, or process it directly). And you need certainly a request processor which answers the request. No matter where it sits, or how one names its cat, the resources are always needed. However, one can do it in an efficient way (right now, similar as coturn does with the relay-threads), or by introducing a lot of overhead, complexity (perhaps IPC, or file based exchange ;-) ), which would be simply a nightmare. I prefer KISSes.

BTW: Especially in the coturn context I hear always "performance, performance, performance", but read actually the opposite incl. a lot of unfounded assumptions. Where does this come from?

  * Difficult to develop. The problem of inter-process communication needs to be solved.

Indeed. It does not make any sense in this context.

3. The HTTP server is used as a programe(as push gateway). Develop the appropriate plugins. For example: nginx, apache etc http servers; [push gateway](https://github.com/prometheus/pushgateway)

Dangerous misconception, flawed design. Again, just because a service provides data via HTTP protocol, it does not need to provide all the features all other known HTTP servers provide. The purpose is key. If some wants to put a proxy in front of it, for whatever strange reason, he can do it. However, if one does this, the env certainly has already other severe problems regarding security, performance, stability - basically all what you claim it should actually solve ;-). At this stage it is better to do the homework first, and consider providing services, when this has been done.

* Pros: 
  * High stability. Even if the HTTP service crashes, it doesn't affect turnserver

If you make such assumption, please provide some proof of it. I think libmicrohttpd users/developers would answer back: "The opposite is the case". Just the much higher complexity (features they provide, third party SW they use), is an indicator, that they are affected by much more bugs and thus "high stability" in this context is a really questionable term. One may also check the Issues and PR pages of the related projects to get more indicators, how "stable" they actually are. Last but not least google CVE for the mentioned servers and check the results - apache httpd e.g. wrt. to libapr/libexpat/libxslt/..., nginx for a lot of other things which are completely unrelated to the metrics interface, ... ;-)

Deduced from the apps/utilities I have in use for several years and what I've read so far including the mentioned web pages I can say: libmicrohttpd and its use is at least as stable as all the other thingies you mentioned and is probably much more reliable.
If you request processor is flawed/buggy (and dies), than it doesn't matter, what http implementation gets used - one should really not blame them for its own faults.

  * turnserver doesn't need to include an HTTP server.

Doesn't it already? What about src/apps/relay/http_server.c ;-)

Reduces the resource overhead of the HTTP server, such as CPU, memory, and program size.

Nope, the opposite is the case.

  * Easy to use.

Huhh? Please define "easy to use".

One HTTP server can serve multiple turnservers

And why should the metrics provider in coturn implement such a feature. This is completely unrelated and a clear indicator, that your idea about infrastructure setup/metric providers is severely flawed. See above.

* Cons: Difficult to develop. Need to:
  
  * Inter-process communication.
  * Cross-machine communication.

We are currently talking about option 1.

From the perspective of Prometheus design and stability, it is recommended to use Option 2.

Not at all. It is the opposite of what you claim it should be.

In O&M. Option 3. is also needed.

Never ever. This is not a coturn but an operator problem. E.g. consider the odometer of your car being a metrics provider. It provides the data to its operator aka driver, so that it might be able to avoid dangerous situations and getting a ticket for being to fast. No normal human being would mount a 65+" OLED display on the roof of its car and use it to show the speed it currently has to the public. Total non-sense, hurts performance and implies, that affected living things are able to recognize and decrypt the shown stuff. Sound and experience are usually sufficient for Papa and Mama Tomato to know, when they need to urge baby tomato to catch up ;-).

Functional requirements for Prometheus:

Ehmm, absolutely confusing. Are we talking about coturn or one of its potential consumers? Please do not mix up everything.

* Provides operations on metric types (counter, gauge, and histogram).
* Provide conversion of metric types (counter, Gauge, Histogram) to data model (string format)
* Provide a full HTTP service function.(used as a thread)

Your totally missing the point. Actually one does not need anything of it, neither coturn nor any metrics consumer! What is simply needed by a metrics provider is a key:value store and a way to answer HTTP requests for /metrics. If such a requests comes in it is as simple as print "HTTP/1.1 200 OK\r\n\r\n"; foreach (key:val in $store) { printf "$key = $value\n" }. Not more (key has to have the format $metric_name[{$attr_name="$attr_value"[,$attr_name="$attr_value"]...}] and value needs to be a double (most implementations use printf's "%g" - I think I've already mentioned it, but you are free to study the OpenMetrics format specs if in doubts).

Well, prometheus-client-c provides just that.

As libprom does. But as said, that's not really the point.

* Provides operations on metric types (counter, gauge, and histogram).
  * counter: https://github.com/digitalocean/prometheus-client-c/blob/master/prom/include/prom_counter.h
  * gauge: https://github.com/digitalocean/prometheus-client-c/blob/master/prom/include/prom_gauge.h

Technically - under the hood - this doesn't matter. Both are counters but the API makes sure, that one can be incremented, only. Again, a simple key:value store accomplishes the same.

  * Histogram: https://github.com/digitalocean/prometheus-client-c/blob/master/prom/include/prom_histogram.h

key:value pairs as well, just mimic a thing sliced into several pieces. So key= "$key{idx="$bucketIdx}" would be an alternative to use.

* Provides [prom_collector_registry_bridge](https://github.com/digitalocean/prometheus-client-c/blob/c57034d196582d99267d027abb52a05a55dc07f6/prom/include/prom_collector_registry.h#L107)
  to implement the conversion function.

Not at all. That's an implementation detail of the libraries. That's a little bit odd to say, we need libraryXY and that to enumerate all its functions it provides as requirements. If several times pointed out, what a metrics provider needs. If in doubts, read the [OpenMetrics] (https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md) or [Prometheus Expostion format] docs. BTW: They do not say for reason, how the provider has to implement it, just that they need to be exposed via HTTP and how to format it. And please, try to understand this first, before making any further claims and misleading/wrong assumptions.

* Has implemented an HTTP server in one thread as option 1. and it provide pipe too(It can be used as an option to achieve Option 2). [promhttp_start_daemon()](https://github.com/digitalocean/prometheus-client-c/blob/c57034d196582d99267d027abb52a05a55dc07f6/promhttp/src/promhttp.c#L62)

Plain wrong again.

  Starting a thread in the turnserver main() to call promhttp_start_daemon() is a simple HTTP service

Again, requiring the opposite of what you actually want does not help. Or do you ask for an additional thread to start the relay-threads as well. That's make simply no sense at all and would be BTW a very bad server design (it should fail early ...).

Implementation of option 1

* Start a sub-thread in the main thread of turnserver to call promhttp_start_daemon() to simply complete the HTTP service. And it already contains Metric types conversion operations (hidden from the user).

* Define and document the metrics to be monitored

What about docs/Metrics.md - did you really read/reviewed the whole PR.

* Initialize the metric

What do you think what src/apps/relay/prom_server.c:init_metrics() does?

* Call the function of the manipulative indicator where needed

LOL - isn't this obvious. If you do not collect data, what metrics do you wanna get out of it ;-)

@jelmd @eakraly Let's discuss option 1 first, and if we still have the energy, we will do other options.

As said, the only option to discuss is IMHO to implement the metrics provider by itself. libpromhttp is really simple and can be replaced without any effort by using libmicrohttpd directly, replacing libprom by a proper key:value store seems to require much more work and testing because it needs to be dynamic wrt. tid:sid. If one really does it, I guess it would probably end up with a hash map like thing again - and thus reinvent the wheel [which might be quadratic].

So there are other options, but not really good ones, especially if you ride the performance and stability horse again and again.

@@ -350,9 +350,9 @@ void str_buffer_append_sz(struct str_buffer *sb, size_t sz) {
str_buffer_append(sb, ssz);
}

void str_buffer_append_sid(struct str_buffer *sb, turnsession_id sid) {
void str_buffer_append_sid(struct str_buffer *sb, turnsession_id sid, uint32_t rsid) {
char ssz[129];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this enough space to hold the new data in all cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Actually 20 + 1 + 10 + 1 = 32 would be sufficient. Not sure, why the original author has chosen such a big buffer (doesn't make any sense to me).

static char Usage[] =
"Usage: turnserver [options]\n"
"Options:\n"
" -d, --listening-device <device-name> Listener interface device (NOT RECOMMENDED. Optional, Linux "
"only).\n"
" -p, --listening-port <port> TURN listener port (Default: 3478).\n"
" -p, --listening-port <port> TURN listener port. Default: " _STRVAL(DEFAULT_STUN_PORT) ".\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to stamping the port number into the help text would be lovely to have as a separate PR

Same with disabling prometheous at compile time, though that's a much bigger mess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, maybe. But hardcoding the port stuff and then rewrite it again is takes too much time w/o any benefit. Not sure, what is meant with "disabling prometheous at compile time".

@@ -45,6 +45,7 @@ static pthread_barrier_t barrier;

////////////// Auth Server ////////////////

uint32_t relay_servers_in_use = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a static variable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is needed in src/apps/relay/ns_ioalib_engine_impl.c and is a "global var". So no unless there is a special collection of global vars somewhere else...

#ifdef POOL_TRACE
#define _TRACE TURN_LOG_FUNC
#else
// dirty talk - is for developers, only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really understanding this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be log trashing better?

TURN_LOG_LEVEL_INFO, "session %018llu: realm <%s> user <%s>: incoming packet %s processed, success\n",
(unsigned long long)(ss->id), (const char *)(ss->realm_options.name), (const char *)(ss->username), method);
TURN_LOG_FUNC(TURN_LOG_LEVEL_INFO,
"session %012llu.%d: realm <%s> user <%s>: incoming packet %s processed, success\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You replace %018llu with %012llu in a bunch of different places.

Maybe it would be better to make a #define for this string, so it can be reused globally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume, that the original author had a reason for not using a macro. Perhaps to avoid clashes when the format gets changed.

@@ -40,7 +40,10 @@ extern "C" {

//////////// defines //////////////

#define TURN_SESSION_ID_FACTOR (1000000000000000LL)
// Even if a single threaded app handles on average about 1 session/s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suspect 1 session per second is on the low side. My production instances handle many thousands of concurrent users per thread.

Not that your analysis is wrong in the general sense, just that maybe the explanation should be a bit more cautious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the unit (k) is missing (1k * 1G = 1T =~ 12 zeros, 1 Gs =~ 32 years). Even if a box handles 50k session/s, session ID would wrap after ~231 days: unlikely that all sessions last that long ...

@KangLin
Copy link
Contributor

KangLin commented Jan 22, 2024

@jelmd OK! I am select prometheus-client-c. the interface of prometheus-client-c is defined fine.
It is recommended that you only fix bugs and add windows support. Don't modify the interface.
The metric is OK! Hope to add the metric of the drop packets.

@jelmd jelmd force-pushed the #1340_better_metric_support branch from d0acd7a to ed41ff2 Compare February 6, 2024 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

better metric support
4 participants