Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow performance and excessive SQL queries #156

Open
llamafilm opened this issue Jan 8, 2024 · 4 comments
Open

Slow performance and excessive SQL queries #156

llamafilm opened this issue Jan 8, 2024 · 4 comments

Comments

@llamafilm
Copy link
Contributor

While this plugin works great with a small number of devices, it becomes very slow at scale. For example, with 1158 devices on my test, the following query takes 23 seconds.

This instance is running v3.7.0 on EC2 m5.2xlarge (8 vCPU and 32GB memory), connected to Postgres 15 in RDS Aurora.

/api/plugins/prometheus-sd/devices/?status=active&cf_prometheus_blackbox_module__n=null

I did a little investigation and found that 99% of that time is spent querying the database. The HTTP download is only 10ms. Postgres log shows that this one API call causes 6157 SELECT statements, about 2/3 of which return 0 rows. You can see this log here: prometheus-sd-query.psql.zip

To easily replicate this issue, you can launch a fresh instance of netbox-docker and run the following code in nbshell. This way it's a bit faster, taking 13 seconds, perhaps because the database is local.

# >>> exec(open('/etc/netbox/config/scripts/init-dummy-data.py').read())

import ipaddress
from django.db import transaction

s = Site(name='Dummy Site', slug='dummy-site')
s.full_clean()
s.save()

m = Manufacturer(name='Generic', slug='generic')
m.full_clean()
m.save()

dt = DeviceType(model='Generic Device', manufacturer_id=1, slug='generic-device')
dt.full_clean()
dt.save()

it = InterfaceTemplate(name='eth0', type='virtual', device_type_id=1)
it.full_clean()
it.save()

dr = DeviceRole(name='Generic Role', slug='generic-role')
dr.full_clean()
dr.save()

print(f"Created Dummy Site, Manufacturer, Device Type, and Device Role")
print("Creating {num_hosts} dummy devices...")

hosts = list(ipaddress.IPv4Network('127.0.0.0/16'))
num_hosts= 2000
for i in range(num_hosts):
    # if any nested atomic transactions fail then device will not be created
    with transaction.atomic():
        # create device (and interface)
        d = Device(
            site=s,
            name = f'Dummy Device {i:03}',
            device_type=dt, 
            role=dr
        )
        d.full_clean()
        d.save()
        
        # create IP and assign to interface
        ip = IPAddress(address=f"{hosts[i]}/32")
        ip.assigned_object = d.interfaces.first()
        with transaction.atomic():
            ip.full_clean()
            ip.save()
        
        # assign device primary IP
        d.primary_ip4 = ip
        with transaction.atomic():
            d.full_clean()
            d.save()

print(f"Done!")
@FlxPeters
Copy link
Owner

I understand the issue. Unfortunately this is not an issue of the plugin it-self, more of netbox and django and the underlying django rest framework. The only thing this plugin makes different from the standard netbox api is to remove the paging of the API and use a new serializer.

Can you check the API performance for a direct caparison? A ?limit=0 should do the trick.

The serializer is another thing we may could improve. It has a lot of lookups to related objects. Not sure how much of the request cycle is spent on these lookups. This should also be tested if it has a relevant part of the request cycle.
I think this requires a proper analyses with a profiler to see where most of the time is spent in detail.

As a mitigation you may add filters to your API query or try to shard them to multiple scrape jobs where each job has another subset filtered via URL.

@llamafilm
Copy link
Contributor Author

llamafilm commented Jan 11, 2024

I don't understand SQL well enough to know why these queries are happening, but it does seem like they are caused by this plugin, because it doesn't happen with the endpoint you suggested.

I've moved my testing to a local Docker instance to compare these 2 endpoints in a more repeatable way. I'm happy to share the scripts if helps you reproduce. Here are the results with 1000 devices:

/api/dcim/devices/?limit=0
17 SELECT statements
Time: 2.4 seconds

/api/plugins/prometheus-sd/devices/
3016 SELECT statements
Time: 6.4 seconds

I'm hoping to scale this up to around 10K devices, which will all be scraped by a single instance of blackbox exporter. I'm not sure yet if that's feasible or if it will need to be split up. I'm testing that system with this plugin and netbox all at once.

@FlxPeters
Copy link
Owner

Good catch already. Might be the lookups to related objects. I have to look into djangos orm framework on how queries are implemented even if there is nothing in the list.

One option to reduce this could be a config on the plugin to disable some lookups, even before the hasattr is executed.
See: https://github.com/FlxPeters/netbox-plugin-prometheus-sd/blob/main/netbox_prometheus_sd/api/utils.py#L26

@llamafilm
Copy link
Contributor Author

This performance issue has led me to develop a different approach, which I’d like to share:
I have a Netbox Event Rule that triggers a script every time a device is modified or created. The script rebuilds the Prometheus config and writes it to the local disk, where it is then served directly by Apache. This way the database load is vastly reduced. It also has a minor benefit of eliminating extra labels that I don’t use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants