Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Server Set up #8736

Closed
NateBrady23 opened this issue Feb 9, 2024 · 57 comments
Closed

New Server Set up #8736

NateBrady23 opened this issue Feb 9, 2024 · 57 comments

Comments

@NateBrady23
Copy link
Member

Good morning, friends!

We are working through some issues with the new servers. Nothing serious, but it's required ordering some extra parts/cables and the delay will be a bit longer. I appreciate everyone's patience while we work through this. We're attempting to get the 40-gigabit fiber setup working, some power issues, and the SFP connectors don't fit in our current enclosure.

@itrofimow
Copy link
Contributor

Hi!

Could you @NateBrady23 please share the specs of the new servers?
My framework requires some manual tuning of its configuration for the best performance, and I'd like to do that upfront, if possible.

@joanhey
Copy link
Contributor

joanhey commented Feb 27, 2024

HI,
the good fact will be to show, the frameworks that work better without any change !!
And that need to be an enhancement to any framework !!

@NateBrady23
please run the first run with the new servers, with the last full run commit: [0ec8ed4]
(https://github.com/TechEmpower/FrameworkBenchmarks/tree/0ec8ed488ec87718eaee9ed05c0ffd51ca48113b)

And later we need to show the last run id, from both servers.

@joanhey
Copy link
Contributor

joanhey commented Feb 27, 2024

😕
please we need more info:
image

We undersstand that you are busy, but please send news !!

@itrofimow
Copy link
Contributor

And that need to be an enhancement to any framework !!

In general I agree, but I prefer to tune things for the extreme use-cases, and benchmarking is definitely one of such cases.
Users of my framework (myself included) are fine with tuning it for their specific production workloads, and if what you maintain hits its best numbers for any workload possible without even a slight manual tuning -- that's a thing to be really proud of, I think.

please run the first run with the new servers, with the last full run commit

This I second

@sebastienros
Copy link
Contributor

sebastienros commented Mar 1, 2024

All machines are identical with these specs

Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
56 logical cores, 1 socket, 1 NUMA, 64 GB RAM
40Gbit/s network
SSD 960GB

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         46 bits physical, 57 bits virtual
  Byte Order:            Little Endian
CPU(s):                  56
  On-line CPU(s) list:   0-55
Vendor ID:               GenuineIntel
  Model name:            Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
    CPU family:          6
    Model:               106
    Thread(s) per core:  2
    Core(s) per socket:  28
    Socket(s):           1
    Stepping:            6
    CPU max MHz:         3100.0000
    CPU min MHz:         800.0000
    BogoMIPS:            4000.00
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fx
                         sr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts re
                         p_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx
                         est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_t
                         imer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single
                         ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase ts
                         c_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma
                          clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_
                         llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pt
                         s hwp hwp_act_window hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq av
                         x512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_ca
                         pabilities
Virtualization features:
  Virtualization:        VT-x
Caches (sum of all):
  L1d:                   1.3 MiB (28 instances)
  L1i:                   896 KiB (28 instances)
  L2:                    35 MiB (28 instances)
  L3:                    42 MiB (1 instance)
NUMA:
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-55
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

SSD - 960GB

Network

       description: Ethernet interface
       product: MT28908 Family [ConnectX-6]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:10:00.0
       logical name: ens1f0np0
       version: 00
       capacity: 40Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical fibre 1000bt-fd 10000bt-fd 25000bt-fd 40000bt-fd autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.15.0-73-generic duplex=full firmware=20.33.1048 (MT_0000000594) ip=10.0.0.121 latency=0 link=yes multicast=yes port=fibre
       resources: irq:18 memory:b0000000-b1ffffff memory:b2000000-b20fffff

@franz1981
Copy link
Contributor

Mellanox!? Juicy!

@volyrique
Copy link
Contributor

Sounds great! While the faster network won't help with the majority of the tests (only the cached queries and plaintext tests should see an improvement, and maybe the fortunes one, since it was doing around 5 Gb/s of network traffic, if I am not mistaken), the doubling of the cores and the jump from the Skylake to the Ice Lake microarchitecture should (the latter should not require Spectre mitigations that are as harsh, I believe).

56 physical cores

It is actually 28 cores and 56 threads, visible from the lscpu output.

@sebastienros
Copy link
Contributor

It is actually 28 cores and 56 threads, visible from the lscpu output.

Right, my comment is wrong.

@synopse
Copy link

synopse commented Mar 4, 2024

Even for a corporation, it is a pretty huge and unusual setup, especially the network part.

Only the SSD is a weird chose: a SATA version for database process? In 2024? Really?

@NateBrady23
Copy link
Member Author

Thanks for providing the update @sebastienros! Sorry this setup is taking so long. It's been a matter of ordering things and people in the office at the right time to work on it. @msmith-techempower is doing some work with this today and I'm in on Thursday.

@msmith-techempower
Copy link
Member

Just as a general update - I am really trying to get these up and working, but the going is slow given that I am not an IT professional by trade 😅. I know everyone, myself included, is anxious to get the continuous runs back up as soon as possible, and I don't want anyone thinking we are sitting on our hands.

@msmith-techempower
Copy link
Member

Another update - we have gotten the machines mostly spun up and verified (using iperf as a baseline) the 40Gbps connections over fiber. We are still trying to get each machine able to connect to the internet (which has been a slog, but I think the hardware for it should be arriving today), but once that is done we will start in on the software side of setup.

Thank you to everyone for being so patient, but I am seeing light at the end of this tunnel and hope to have runs started back up soon.

@Kaliumhexacyanoferrat
Copy link
Contributor

@NateBrady23 please run the first run with the new servers, with the last full run commit

I second this as I updated my benchmarks in the meantime and would love to see the impact independent from the hardware changes.

Looking forward to the new environment, keep up the good work!

@mkvalor
Copy link

mkvalor commented Mar 24, 2024

I get that you guys are just about across the finish line. But I recommend updating the announement banner at the top of https://tfb-status.techempower.com/ anyway. It's a one-liner in your website's HTML (aside from publishing the change). This will encourage thousands of your site's followers and, regardless, "better late than never".

@NateBrady23
Copy link
Member Author

@joanhey @Kaliumhexacyanoferrat Yes, the first real run from the new servers will be with the last full run's commit. Great idea.

Pinging @msmith-techempower ^

We got the "final" parts in on Friday evening at the office. Mike, give us hope for Monday or Tuesday! 🙏

@msmith-techempower
Copy link
Member

Hardware install complete and "flash point" tested. Everything appears to be working correctly, and one of our major concerns appears to be okay (issue with power draw). Tomorrow, I'll be getting the software environments up and running and HOPEFULLY (not promising anything - yes, you Nate) get the parity commit run started. I am sure there will be more to fix/hone/etc. in the coming week or two, but we are slowly getting the new environment on its feet.

Again, thank you all for your continued patience!

@sebastienros
Copy link
Contributor

What version of Ubuntu are you using? 24.04 is almost there...

February 29, 2024 – Feature Freeze
March 21, 2024 – User Interface Freeze
April 4, 2024 – Ubuntu 24.04 Beta
April 11, 2024 – Kernel Freeze
April 25, 2024 – Ubuntu 24.04 LTS Released

@msmith-techempower
Copy link
Member

We have 22 atm, but it may end up prudent to move to 24 when it's released since it's LTS.

@volyrique
Copy link
Contributor

volyrique commented Mar 26, 2024

Are you using the regular kernel or the Hardware Enablement (HWE) one, as I suggested here? Using the HWE kernel essentially eliminates the need to move to Ubuntu 24.04 (when it is out) until possibly early 2025 because it would be updated to the same release as the one that 24.04 is based on, and IMHO the differences due to other software components amount to a rounding error. The switch to the HWE is done with a simple command and a reboot.

@msmith-techempower
Copy link
Member

HWE

@msmith-techempower
Copy link
Member

HOWDY! Okay, I believe that we have a run started. So far, nothing seems out of the ordinary, so we will see how it plays out over the next few days.

In the meantime, please be aware that this is a first attempt, and there are sure to be issues that creep up. Please report those issues here, and we will trudge on!

Again, thank you for your continued patience!

@joanhey
Copy link
Contributor

joanhey commented Mar 28, 2024

About the kernels:
Last Ubuntu 22.04.4 (February 2024) change to Kernel 6.15 (from 5.15)
https://ubuntu.com/about/release-cycle#ubuntu-kernel-release-cycle
We didn't see this change !!

New Ubuntu 24.04 come with Kernel 6.8.
And the next Ubuntu 22.04.5 also it will come with 6.8 (after 24.04).

Network-related: Linux 6.8 includes networking buffs that provide better cache efficiency. This is said to improve “TCP performances with many concurrent connections up to 40%” – a sizeable uplift, though to what degree most users will benefit is unclear.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3e7aeb78ab01

We want it, but we will check it !!

@joanhey
Copy link
Contributor

joanhey commented Mar 28, 2024

The actual run is stuck !!

@synopse
Copy link

synopse commented Mar 28, 2024

Yes, the page is not refreshed since yesterday:
last updated 2024-03-27 at 4:02 PM
https://tfb-status.techempower.com/

@msmith-techempower
Copy link
Member

Confirmed - I am looking into it now. Appears to have been a thermal issue on the primary machine. About 4 hours (I think) into the run the machine shut itself down.

@NateBrady23
Copy link
Member Author

Ok things are back up and running and we're still monitoring.

Just so you guys know, all of us at techempower get an email when the citrine environment stops getting updates. You don't have to add to the thread or open issues when it crashes; it may happen a few more times. But appreciate everyone's enthusiasm!

@msmith-techempower
Copy link
Member

OKAY.

Little update. TechEmpower is located in a small office and we do not have a dedicated server rack any longer - we bought a small rack that has insulation (it's very loud), but that resulted in the switch being too close to the app server... and it produces a TON of heat which, in turn, tripped the heat sensor on the intake of the machine, which fired off a safety shutdown.

I fiddled with a bunch of setups, but what seems to be working at the moment is having the switch powered down, and plugging the fiber directly. So, App is connected to Database on 10.0.0.x, and App is connected to Client on 10.0.1.x. I tested this setup with iperf as I did with the switch and saw not appreciable difference in throughput, so I am hoping this is a fair way to test. VERY OPEN TO COMMENT HERE!

Anyway, the current run has benchmarked a couple, I am monitoring temperature (among other stats) while it is running, and hopefully we will be okay moving forward.

@NateBrady23
Copy link
Member Author

NateBrady23 commented Mar 29, 2024

Have no fear, the continuous run is still going on and everything looks healthy! Just an issue with tfb-status receiving updates. Should be fixed shortly.

FYI: The parity run we're doing is with Round 22 https://tfb-status.techempower.com/results/66d86090-b6d0-46b3-9752-5aa4913b2e33

I'll be out early next week; when this run completes, it will automatically start a new run from the current state of the repo.

@joanhey
Copy link
Contributor

joanhey commented Mar 30, 2024

Impressive numbers !!
We'll need some time to analyze the numbers.

I think that will be good to create Round 22N, so the regular visitors can see the difference.
Also it will be better to compare with Round 23.

@volyrique
Copy link
Contributor

Yes, the numbers are very, very, very nice. libh2o is between 2 and 3 times faster in the tests that did not suffer from a network bandwidth bottleneck (i.e. everything except cached queries), which is more or less the expected number - we have 2 times the number of cores running at the same or slightly lower frequency and the rest of the difference could be explained by the microarchitectural improvements, the larger CPU caches, the faster and more numerous memory channels (I am assuming an 8 x 8 GB configuration), and last, but not least, the newer kernel release. The plaintext numbers seem to imply that the effect of the speculative execution vulnerability mitigations is not as bad as before because the gap between libh2o and faf decreased significantly, but that might be purely due to the kernel.

It seems that we still have the network bandwidth bottleneck for the cached queries and the plaintext tests, though in the former case only one implementation, fiber prefork, so far scaled perfectly, so it is probably not much of a problem.

I fiddled with a bunch of setups, but what seems to be working at the moment is having the switch powered down, and plugging the fiber directly. So, App is connected to Database on 10.0.0.x, and App is connected to Client on 10.0.1.x. I tested this setup with iperf as I did with the switch and saw not appreciable difference in throughput, so I am hoping this is a fair way to test. VERY OPEN TO COMMENT HERE!

I am assuming that the network adapter on the application server is dual-ported, in which case wouldn't this be a superior configuration? If the machine is connected to a switch via a single port, then the traffic both from the load generator and the database would pass through the same link, so there might be some interference, while in the current configuration everything would be nicely isolated.

@mkvalor
Copy link

mkvalor commented Mar 31, 2024

@sebastienros Thanks for clarifying the number of physical cores later in the thread. Would you be willing to re-edit the 6th comment here, with the specs, so the top text does not continue to say, "56 physical cores, 1 socket, 1 NUMA, 64 GB RAM"? I fear some who read this will view that 'headline' and perhaps miss the later clarification.

@synopse
Copy link

synopse commented Apr 1, 2024

The run did fail, and is aborting:

791/791 frameworks tested (last was zysocket-v)
398 frameworks started and stopped cleanly
393 frameworks had problems starting or stopping

Some details:

    "martian": "20240331220730",
    "martini": "error during test: [Errno 28] No space left on device",
    "may-minihttp": "ERROR: Problem starting may-minihttp",
    "microdot": "ERROR: Problem starting microdot",
    "microdot-async": "ERROR: Problem starting microdot-async",
    "microdot-async-raw": "ERROR: Problem starting microdot-async-raw",
    "microdot-raw": "ERROR: Problem starting microdot-raw",
    "microhttp": "ERROR: Problem starting microhttp",
    "microhttp-mysql": "ERROR: Problem starting microhttp-mysql",
    "micronaut": "ERROR: Problem starting micronaut",
    "micronaut-data-jdbc": "ERROR: Problem starting micronaut-data-jdbc",
    "micronaut-data-jdbc-graalvm": "ERROR: Problem starting micronaut-data-jdbc-graalvm",
    "micronaut-data-mongodb": "ERROR: Problem starting micronaut-data-mongodb",
... and all following frameworks are abandonned.

Too much Martini, perhaps, or shaken whereas it should not according to the agent 007.
"martini": "error during test: [Errno 28] No space left on device"
I guess something like a wrong partition (e.g. small /root) used for the log storage.

About the hardware and storage (it has nothing to do with the problem): I wonder why these servers have huge CPU, RAM and network, but a slow SATA drive. At least for the DB, the number of IOs do make a difference.

@joanhey
Copy link
Contributor

joanhey commented Apr 1, 2024

@synopse the database data in this bench is very small and it'll fit in memory always. And it's correct for a framework benchmark. We don't want to bench the HD from the database server.

Still I have ready new database configs for this big server, but after the next run that all databases update the version.
To isolate the numbers from the new version and the new config.

@volyrique the vulnerability mitigations are still a big performance problem.
And the kernel help less than the new CPU.
The new CPU is not affected by: Meltdown, Retbleed, ... so it don't need the vulnerability mitigations.

Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Mitigation; Clear CPU buffers; SMT vulnerable
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl and seccomp
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
  Srbds:                 Not affected
  Tsx async abort:       Not affected

IMO the only solution it is to change the CPUs with vulnerabilities, to have a good performance again.

@itrofimow
Copy link
Contributor

@joanhey I'm pretty sure that updates generate a significant load on the disk, even with a minimal WAL level.

Should we just create the World table as UNLOGGED?

@joanhey
Copy link
Contributor

joanhey commented Apr 1, 2024

I think that the database discussion is for another Issue.
But first we need to wait for the next database versions and configs.

@franz1981
Copy link
Contributor

The run seems stucked...I would like to check the failures for Netty/Vertex and Quarkus (which I am a developer), because in our CI tests we didn't have anything similar...

Related being a NUMA CPU; I have to double check but I think it is a kind of NUMA arch, or better, there is not partitioning of memory, but (last level) cache accesses have etherogenous access cost. On my local machine (Ryzen 7950X) I had to enable it..more info on it at https://www.reddit.com/r/Amd/comments/ce6pj9/ccd_equivalent_to_numa_in_functionality/

@joanhey
Copy link
Contributor

joanhey commented Apr 1, 2024

@franz1981 after the Martini framework
"martini": "error during test: [Errno 28] No space left on device"
All frameworks failed, so there is no need to check the failures.

@synopse
Copy link

synopse commented Apr 1, 2024

@synopse the database data in this bench is very small and it'll fit in memory always. And it's correct for a framework benchmark. We don't want to bench the HD from the database server.

@joanhey On production (we would like to reproduce production state, right?) we should enable fsync on PostgreSQL.
https://postgresqlco.nf/doc/en/param/fsync/ so writes should wait for the data to be actually stored on the disk, not only changed in memory.
Even if the data is small enough to fit in memory, it is still written on disk, and we would need to wait for fsync.
Here a fast Nvme SDD makes a difference in respect to the SATA SDD offered by this setup.
We could expect better updates performance with a new hardware: updates are somewhat slow in respect to other tests in the current run.

Anyway, we have to make it pass and run all tests, before trying to maximize the hardware.

@volyrique
Copy link
Contributor

We could expect better updates performance with a new hardware: updates are somewhat slow in respect to other tests in the current run.

Are they really? The speedup in the database updates test is in line with the one in the multiple queries test (i.e. 2-3 times faster) - just check axum [postgresql], h2o, and just-js (I am looking at the fastest results because they are the least likely to have another scalability bottleneck on the software side in the framework implementation). A fast NVMe SSD exposes more parallelism than a SATA one, but I don't think that the tests have a level of concurrency that is affected enough by this potential bottleneck; neither do I think that we are bandwidth-limited.

Obviously, we can't expect the database updates test to have the same performance as the multiple queries one - it must be slower.

@joanhey
Copy link
Contributor

joanhey commented Apr 1, 2024

I say again:

I think that the database discussion is for another Issue.
But first we need to wait for the next database versions and configs.

PD: open new issues to discuss it !!
The benchmark have own live, and never, never, will be good for every one. But all frameworks play with the same rules (servers, configs, ...)

@NateBrady23
Copy link
Member Author

Sorry folks. This was a partitioning mistake. It's been fixed and we've restarted the Round 22 parity run.

@msmith-techempower
Copy link
Member

Howdy!

The latest run completed successfully (and didn't run out of disk space this time >_<) and can be inspected here. This round was run against the same commit as Round 22, but with Ubuntu 22, the HWE kernel, and the direct fiber networking.

It looks like everything is operating smoothly. Please feel free to report if you notice anything out of the ordinary or have questions. The newest in the continuous run is the latest pull from Github, so this will include everything merged in as of this morning.

I THINK we are about ready to close this ticket, but I will leave it open for a bit longer while this next run is going.

Thanks again for the ongoing support and patience.

@volyrique
Copy link
Contributor

@msmith-techempower I have just one comment - h2o reported the kernel version as Linux 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024, so it looks like the HWE kernel was not running for some reason. Of course, that makes the comparison with the previous hardware even more precise 😄.

The only weirdness I have noticed in the results is the fiber-prefork result in the cached queries test, but it doesn't seem to imply any kind of issue with the benchmarking environment setup, so I wouldn't comment on it any further.

@msmith-techempower
Copy link
Member

@volyrique I believe I may have jumped the gun on this one. I thought that I had installed HWE initially, but then wanted to double checks so I stopped the current run and installed it as recommended via sudo apt install linux-generic-hwe-22.04 on all the machines. Now, uname -a says Linux tfb-server 6.5.0-27-generic #28~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Mar 15 10:51:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux which I think is right. I'm kicking off another run now, but let me know if there was something else we were expecting.

@p8
Copy link
Contributor

p8 commented Apr 18, 2024

Hi. It seems the dashboard is currently stuck. It hasn’t update in almost a day.

@msmith-techempower
Copy link
Member

@p8 Yeah, I'm troubleshooting this... we're experiencing thermal issues again and the server decided to power itself down late yesterday. Honestly, we have these in a small rack that has airflow problems, and it seems like these new machines have lower tolerances than the previous ones to heat. Weighing out options, but hard to say when we will get continuous continuous runs for the short-term.

@p8
Copy link
Contributor

p8 commented Apr 18, 2024

Thanks @msmith-techempower !

@joanhey
Copy link
Contributor

joanhey commented Apr 19, 2024

In the meantime.
Where are the cloud benchmarks ??

We don't need a continuous run, but half a year or quarter !!!
Many frameworks only optimize for that big enterprises servers.
And the majority of users, use more moderate servers like in the cloud benchs !!

@NateBrady23
Copy link
Member Author

We do not have credits/funding for cloud benchmarks, nor do we have the infrastructure set up.

If someone wants to support that, including the time to maintain, we'd absolutely be open to having that discussion.

@joanhey
Copy link
Contributor

joanhey commented Apr 23, 2024

Some clouds give free servers to open source projects, perhaps we could ask (Azure, Amazon, Digital Ocean, ...)

🤔 But we have another option, now that we use Docker.
Any container can be limited by CPU and memory.
So we can run each quarter a limited run (simulating a cloud server). And without thermal issues :)

Only we need to decide the number of CPUs and memory to use.

@synopse
Copy link

synopse commented Apr 23, 2024

So we can run each quarter a limited run (simulating a cloud server). And without thermal issues :)

Limiting CPU and RAM is not only what a cloud server do. RAM limitation does not change anything on most solutions (but perhaps Java). And it is very likely to have a separated cloud DB instance hosted by the provider, with some specific network abilities, and so on...
And limiting CPU and memory won't necessarily fix the thermal issues. Perhaps preventing the shutdown, but not the CPU throttling. There should be no thermal issue at all, or the whole idea of fair benchmarking is pointless.

@joanhey
Copy link
Contributor

joanhey commented Apr 23, 2024

I said "simulating a cloud server" but we can change it for "simulating commodity servers".

And yes, the CPU throttling is a problem actually.

@joanhey
Copy link
Contributor

joanhey commented Apr 23, 2024

@synopse you are a new comer with a fast framework !!
This bench have more than 10 years, than the Techempower people (and more people) used their free time for free, to have a fair benchmark !!

Always come problems, they had the same problem with the old servers, to fix the thermal issues.
But later they fixed it.
If we can help, it's better than say I don't like !!
@NateBrady23 open a github sponsors for this bench !!

@msmith-techempower
Copy link
Member

Update: the application server shut down due to heat again. I'm still looking into resolutions, but this is a blocker for the moment.

@msmith-techempower
Copy link
Member

Update to the Update: I have redone the geometry of the machines, wires, and whatnot in the small rack and tried a few things to try and improve air-flow. I kicked off another run (and I see results coming into TFBStatus, now) and will continue to monitor throughout the week.

I am 99% certain that we will need to install some additional airflow measures (intake/exhaust fans, push/pull setup, etc), but we will cross that bridge when we get to it.

@msmith-techempower
Copy link
Member

Okay, closing this issue. Looking like we're getting good runs reliably with thermal issues under control (we'll see when summer rolls in). Feel free to follow-up here if you have questions/concerns, but otherwise it'll be business as usual moving forward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests