Slow performance and excessive SQL queries #156

llamafilm · 2024-01-08T20:47:27Z

While this plugin works great with a small number of devices, it becomes very slow at scale. For example, with 1158 devices on my test, the following query takes 23 seconds.

This instance is running v3.7.0 on EC2 m5.2xlarge (8 vCPU and 32GB memory), connected to Postgres 15 in RDS Aurora.

/api/plugins/prometheus-sd/devices/?status=active&cf_prometheus_blackbox_module__n=null

I did a little investigation and found that 99% of that time is spent querying the database. The HTTP download is only 10ms. Postgres log shows that this one API call causes 6157 SELECT statements, about 2/3 of which return 0 rows. You can see this log here: prometheus-sd-query.psql.zip

To easily replicate this issue, you can launch a fresh instance of netbox-docker and run the following code in nbshell. This way it's a bit faster, taking 13 seconds, perhaps because the database is local.

# >>> exec(open('/etc/netbox/config/scripts/init-dummy-data.py').read())

import ipaddress
from django.db import transaction

s = Site(name='Dummy Site', slug='dummy-site')
s.full_clean()
s.save()

m = Manufacturer(name='Generic', slug='generic')
m.full_clean()
m.save()

dt = DeviceType(model='Generic Device', manufacturer_id=1, slug='generic-device')
dt.full_clean()
dt.save()

it = InterfaceTemplate(name='eth0', type='virtual', device_type_id=1)
it.full_clean()
it.save()

dr = DeviceRole(name='Generic Role', slug='generic-role')
dr.full_clean()
dr.save()

print(f"Created Dummy Site, Manufacturer, Device Type, and Device Role")
print("Creating {num_hosts} dummy devices...")

hosts = list(ipaddress.IPv4Network('127.0.0.0/16'))
num_hosts= 2000
for i in range(num_hosts):
    # if any nested atomic transactions fail then device will not be created
    with transaction.atomic():
        # create device (and interface)
        d = Device(
            site=s,
            name = f'Dummy Device {i:03}',
            device_type=dt, 
            role=dr
        )
        d.full_clean()
        d.save()
        
        # create IP and assign to interface
        ip = IPAddress(address=f"{hosts[i]}/32")
        ip.assigned_object = d.interfaces.first()
        with transaction.atomic():
            ip.full_clean()
            ip.save()
        
        # assign device primary IP
        d.primary_ip4 = ip
        with transaction.atomic():
            d.full_clean()
            d.save()

print(f"Done!")

The text was updated successfully, but these errors were encountered:

FlxPeters · 2024-01-10T18:18:16Z

I understand the issue. Unfortunately this is not an issue of the plugin it-self, more of netbox and django and the underlying django rest framework. The only thing this plugin makes different from the standard netbox api is to remove the paging of the API and use a new serializer.

Can you check the API performance for a direct caparison? A ?limit=0 should do the trick.

The serializer is another thing we may could improve. It has a lot of lookups to related objects. Not sure how much of the request cycle is spent on these lookups. This should also be tested if it has a relevant part of the request cycle.
I think this requires a proper analyses with a profiler to see where most of the time is spent in detail.

As a mitigation you may add filters to your API query or try to shard them to multiple scrape jobs where each job has another subset filtered via URL.

llamafilm · 2024-01-11T07:26:49Z

I don't understand SQL well enough to know why these queries are happening, but it does seem like they are caused by this plugin, because it doesn't happen with the endpoint you suggested.

I've moved my testing to a local Docker instance to compare these 2 endpoints in a more repeatable way. I'm happy to share the scripts if helps you reproduce. Here are the results with 1000 devices:

/api/dcim/devices/?limit=0
17 SELECT statements
Time: 2.4 seconds

/api/plugins/prometheus-sd/devices/
3016 SELECT statements
Time: 6.4 seconds

I'm hoping to scale this up to around 10K devices, which will all be scraped by a single instance of blackbox exporter. I'm not sure yet if that's feasible or if it will need to be split up. I'm testing that system with this plugin and netbox all at once.

FlxPeters · 2024-01-11T12:22:50Z

Good catch already. Might be the lookups to related objects. I have to look into djangos orm framework on how queries are implemented even if there is nothing in the list.

One option to reduce this could be a config on the plugin to disable some lookups, even before the hasattr is executed.
See: https://github.com/FlxPeters/netbox-plugin-prometheus-sd/blob/main/netbox_prometheus_sd/api/utils.py#L26

llamafilm · 2024-02-02T03:57:10Z

This performance issue has led me to develop a different approach, which I’d like to share:
I have a Netbox Event Rule that triggers a script every time a device is modified or created. The script rebuilds the Prometheus config and writes it to the local disk, where it is then served directly by Apache. This way the database load is vastly reduced. It also has a minor benefit of eliminating extra labels that I don’t use.

llamafilm mentioned this issue Jan 10, 2024

Add support for Netbox webhooks #131

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow performance and excessive SQL queries #156

Slow performance and excessive SQL queries #156

llamafilm commented Jan 8, 2024

FlxPeters commented Jan 10, 2024

llamafilm commented Jan 11, 2024 •

edited

FlxPeters commented Jan 11, 2024

llamafilm commented Feb 2, 2024

Slow performance and excessive SQL queries #156

Slow performance and excessive SQL queries #156

Comments

llamafilm commented Jan 8, 2024

FlxPeters commented Jan 10, 2024

llamafilm commented Jan 11, 2024 • edited

FlxPeters commented Jan 11, 2024

llamafilm commented Feb 2, 2024

llamafilm commented Jan 11, 2024 •

edited