New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Server Set up #8736
Comments
Hi! Could you @NateBrady23 please share the specs of the new servers? |
HI, @NateBrady23 And later we need to show the last run id, from both servers. |
In general I agree, but I prefer to tune things for the extreme use-cases, and benchmarking is definitely one of such cases.
This I second |
All machines are identical with these specs Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
Network
|
Mellanox!? Juicy! |
Sounds great! While the faster network won't help with the majority of the tests (only the cached queries and plaintext tests should see an improvement, and maybe the fortunes one, since it was doing around 5 Gb/s of network traffic, if I am not mistaken), the doubling of the cores and the jump from the Skylake to the Ice Lake microarchitecture should (the latter should not require Spectre mitigations that are as harsh, I believe).
It is actually 28 cores and 56 threads, visible from the |
Right, my comment is wrong. |
Even for a corporation, it is a pretty huge and unusual setup, especially the network part. Only the SSD is a weird chose: a SATA version for database process? In 2024? Really? |
Thanks for providing the update @sebastienros! Sorry this setup is taking so long. It's been a matter of ordering things and people in the office at the right time to work on it. @msmith-techempower is doing some work with this today and I'm in on Thursday. |
Just as a general update - I am really trying to get these up and working, but the going is slow given that I am not an IT professional by trade 😅. I know everyone, myself included, is anxious to get the continuous runs back up as soon as possible, and I don't want anyone thinking we are sitting on our hands. |
Another update - we have gotten the machines mostly spun up and verified (using Thank you to everyone for being so patient, but I am seeing light at the end of this tunnel and hope to have runs started back up soon. |
I second this as I updated my benchmarks in the meantime and would love to see the impact independent from the hardware changes. Looking forward to the new environment, keep up the good work! |
I get that you guys are just about across the finish line. But I recommend updating the announement banner at the top of https://tfb-status.techempower.com/ anyway. It's a one-liner in your website's HTML (aside from publishing the change). This will encourage thousands of your site's followers and, regardless, "better late than never". |
@joanhey @Kaliumhexacyanoferrat Yes, the first real run from the new servers will be with the last full run's commit. Great idea. Pinging @msmith-techempower ^ We got the "final" parts in on Friday evening at the office. Mike, give us hope for Monday or Tuesday! 🙏 |
Hardware install complete and "flash point" tested. Everything appears to be working correctly, and one of our major concerns appears to be okay (issue with power draw). Tomorrow, I'll be getting the software environments up and running and HOPEFULLY (not promising anything - yes, you Nate) get the parity commit run started. I am sure there will be more to fix/hone/etc. in the coming week or two, but we are slowly getting the new environment on its feet. Again, thank you all for your continued patience! |
What version of Ubuntu are you using? 24.04 is almost there...
|
We have 22 atm, but it may end up prudent to move to 24 when it's released since it's LTS. |
Are you using the regular kernel or the Hardware Enablement (HWE) one, as I suggested here? Using the HWE kernel essentially eliminates the need to move to Ubuntu 24.04 (when it is out) until possibly early 2025 because it would be updated to the same release as the one that 24.04 is based on, and IMHO the differences due to other software components amount to a rounding error. The switch to the HWE is done with a simple command and a reboot. |
HWE |
HOWDY! Okay, I believe that we have a run started. So far, nothing seems out of the ordinary, so we will see how it plays out over the next few days. In the meantime, please be aware that this is a first attempt, and there are sure to be issues that creep up. Please report those issues here, and we will trudge on! Again, thank you for your continued patience! |
Same run with commit https://github.com/TechEmpower/FrameworkBenchmarks/tree/625684fcc442767af013de2dfd1fc90dd73f1744 Old servers New servers |
About the kernels: New Ubuntu 24.04 come with Kernel 6.8.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e7aeb78ab01 We want it, but we will check it !! |
The actual run is stuck !! |
Yes, the page is not refreshed since yesterday: |
Confirmed - I am looking into it now. Appears to have been a thermal issue on the primary machine. About 4 hours (I think) into the run the machine shut itself down. |
Ok things are back up and running and we're still monitoring. Just so you guys know, all of us at techempower get an email when the citrine environment stops getting updates. You don't have to add to the thread or open issues when it crashes; it may happen a few more times. But appreciate everyone's enthusiasm! |
OKAY. Little update. TechEmpower is located in a small office and we do not have a dedicated server rack any longer - we bought a small rack that has insulation (it's very loud), but that resulted in the switch being too close to the app server... and it produces a TON of heat which, in turn, tripped the heat sensor on the intake of the machine, which fired off a safety shutdown. I fiddled with a bunch of setups, but what seems to be working at the moment is having the switch powered down, and plugging the fiber directly. So, App is connected to Database on 10.0.0.x, and App is connected to Client on 10.0.1.x. I tested this setup with Anyway, the current run has benchmarked a couple, I am monitoring temperature (among other stats) while it is running, and hopefully we will be okay moving forward. |
Have no fear, the continuous run is still going on and everything looks healthy! Just an issue with tfb-status receiving updates. Should be fixed shortly. FYI: The parity run we're doing is with Round 22 https://tfb-status.techempower.com/results/66d86090-b6d0-46b3-9752-5aa4913b2e33 I'll be out early next week; when this run completes, it will automatically start a new run from the current state of the repo. |
Impressive numbers !! I think that will be good to create |
Yes, the numbers are very, very, very nice. It seems that we still have the network bandwidth bottleneck for the cached queries and the plaintext tests, though in the former case only one implementation,
I am assuming that the network adapter on the application server is dual-ported, in which case wouldn't this be a superior configuration? If the machine is connected to a switch via a single port, then the traffic both from the load generator and the database would pass through the same link, so there might be some interference, while in the current configuration everything would be nicely isolated. |
@sebastienros Thanks for clarifying the number of physical cores later in the thread. Would you be willing to re-edit the 6th comment here, with the specs, so the top text does not continue to say, "56 physical cores, 1 socket, 1 NUMA, 64 GB RAM"? I fear some who read this will view that 'headline' and perhaps miss the later clarification. |
The run did fail, and is aborting:
Some details:
Too much Martini, perhaps, or shaken whereas it should not according to the agent 007. About the hardware and storage (it has nothing to do with the problem): I wonder why these servers have huge CPU, RAM and network, but a slow SATA drive. At least for the DB, the number of IOs do make a difference. |
@synopse the database data in this bench is very small and it'll fit in memory always. And it's correct for a framework benchmark. We don't want to bench the HD from the database server. Still I have ready new database configs for this big server, but after the next run that all databases update the version. @volyrique the vulnerability mitigations are still a big performance problem.
IMO the only solution it is to change the CPUs with vulnerabilities, to have a good performance again. |
@joanhey I'm pretty sure that updates generate a significant load on the disk, even with a minimal WAL level. Should we just create the |
I think that the database discussion is for another Issue. |
The run seems stucked...I would like to check the failures for Netty/Vertex and Quarkus (which I am a developer), because in our CI tests we didn't have anything similar... Related being a NUMA CPU; I have to double check but I think it is a kind of NUMA arch, or better, there is not partitioning of memory, but (last level) cache accesses have etherogenous access cost. On my local machine (Ryzen 7950X) I had to enable it..more info on it at https://www.reddit.com/r/Amd/comments/ce6pj9/ccd_equivalent_to_numa_in_functionality/ |
@franz1981 after the Martini framework |
@joanhey On production (we would like to reproduce production state, right?) we should enable fsync on PostgreSQL. Anyway, we have to make it pass and run all tests, before trying to maximize the hardware. |
Are they really? The speedup in the database updates test is in line with the one in the multiple queries test (i.e. 2-3 times faster) - just check Obviously, we can't expect the database updates test to have the same performance as the multiple queries one - it must be slower. |
I say again:
PD: open new issues to discuss it !! |
Sorry folks. This was a partitioning mistake. It's been fixed and we've restarted the Round 22 parity run. |
Howdy! The latest run completed successfully (and didn't run out of disk space this time It looks like everything is operating smoothly. Please feel free to report if you notice anything out of the ordinary or have questions. The newest in the continuous run is the latest pull from Github, so this will include everything merged in as of this morning. I THINK we are about ready to close this ticket, but I will leave it open for a bit longer while this next run is going. Thanks again for the ongoing support and patience. |
@msmith-techempower I have just one comment - The only weirdness I have noticed in the results is the |
@volyrique I believe I may have jumped the gun on this one. I thought that I had installed HWE initially, but then wanted to double checks so I stopped the current run and installed it as recommended via |
Hi. It seems the dashboard is currently stuck. It hasn’t update in almost a day. |
@p8 Yeah, I'm troubleshooting this... we're experiencing thermal issues again and the server decided to power itself down late yesterday. Honestly, we have these in a small rack that has airflow problems, and it seems like these new machines have lower tolerances than the previous ones to heat. Weighing out options, but hard to say when we will get continuous continuous runs for the short-term. |
Thanks @msmith-techempower ! |
In the meantime. We don't need a continuous run, but half a year or quarter !!! |
We do not have credits/funding for cloud benchmarks, nor do we have the infrastructure set up. If someone wants to support that, including the time to maintain, we'd absolutely be open to having that discussion. |
Some clouds give free servers to open source projects, perhaps we could ask (Azure, Amazon, Digital Ocean, ...) 🤔 But we have another option, now that we use Docker. Only we need to decide the number of CPUs and memory to use. |
Limiting CPU and RAM is not only what a cloud server do. RAM limitation does not change anything on most solutions (but perhaps Java). And it is very likely to have a separated cloud DB instance hosted by the provider, with some specific network abilities, and so on... |
I said "simulating a cloud server" but we can change it for "simulating commodity servers". And yes, the CPU throttling is a problem actually. |
@synopse you are a new comer with a fast framework !! Always come problems, they had the same problem with the old servers, to fix the thermal issues. |
Update: the application server shut down due to heat again. I'm still looking into resolutions, but this is a blocker for the moment. |
Update to the Update: I have redone the geometry of the machines, wires, and whatnot in the small rack and tried a few things to try and improve air-flow. I kicked off another run (and I see results coming into TFBStatus, now) and will continue to monitor throughout the week. I am 99% certain that we will need to install some additional airflow measures (intake/exhaust fans, push/pull setup, etc), but we will cross that bridge when we get to it. |
Okay, closing this issue. Looking like we're getting good runs reliably with thermal issues under control (we'll see when summer rolls in). Feel free to follow-up here if you have questions/concerns, but otherwise it'll be business as usual moving forward. |
Good morning, friends!
We are working through some issues with the new servers. Nothing serious, but it's required ordering some extra parts/cables and the delay will be a bit longer. I appreciate everyone's patience while we work through this. We're attempting to get the 40-gigabit fiber setup working, some power issues, and the SFP connectors don't fit in our current enclosure.
The text was updated successfully, but these errors were encountered: