Added run_batch functionality for the web mode solver, added parallel… #1654

prashkh · 2024-05-01T01:18:39Z

…ization using joblib for the local mode solver, and added new functions for unit testing.

Since there was some mix up between develop branch and pre/2.7 when I rebased, I created a new PR and implemented all the changes you requested. I also added the parallelization for the local mode solver and created two new unit tests for both testing the local mode solver and the run_batch for the web.

…ization using joblib for the local mode solver, and added new functions for unit testing

prashkh · 2024-05-01T13:31:43Z

tests/test_plugins/test_mode_solver.py

+    for i in range(num_of_sims):
+        results[i] = mode_solver_list[i].solve()
+
+    assert any(isinstance(x,ModeSolverData) for x in results)


This should be all instead of any.

prashkh · 2024-05-01T13:32:11Z

tests/test_plugins/test_mode_solver.py

+    # Run mode solver one at a time
+    results = msweb.run_batch(mode_solver_list, verbose = False)
+
+    assert any(isinstance(x,ModeSolverData) for x in results)


This should be all instead of any. Will fix it when I incorporate all other suggestions.

tylerflex

Looking good! A few new comments and I'd recommend getting another reviewer (maybe Lucas?). Also you'll need to use black to format the code for the tests to pass. Check out the link to the wiki that I sent for details. Thanks!

tylerflex · 2024-05-01T18:15:08Z

CHANGELOG.md

@@ -26,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Properties `num_time_steps_adjoint` and `tmesh_adjoint` to `JaxSimulation` to estimate adjoint run time.
 - Ability to add `path` to `updated_copy()` method to recursively update sub-components of a tidy3d model. For example `sim2 = sim.updated_copy(size=new_size, path="structures/0/geometry")` creates a recursively updated copy of `sim` where `sim.structures[0].geometry` is updated with `size=new_size`.
 - Python 3.12 support. Python 3.8 deprecation. Updated dependencies.
+- A batch of ModeSolver objects can be run concurrently using `tidy3d.plugins.mode.web.run_batch()`


please put ModeSolver inside of single backslash quotes

tylerflex · 2024-05-01T18:16:08Z

CHANGELOG.md

@@ -36,6 +37,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Changed sign of mode solver PML conductivity to match the `exp(-1j * omega * t)` convention used everywhere else. Previously, loss due to PML was coming out as negative `k_eff` of the mode, while now it comes out as positive `k_eff`, in line with material loss.
 - Augmented the PML mode solver profile to the form `s(x) = kappa(x) + 1j * sigma(x) / (omega * EPSILON_0)`. Previously, `kappa` was not used. The parameters are currently set to `kappa_min = 1`, `kappa_max = 3`, `sigma_max = 2`. These are not yet exposed to the user.
 - Also scaling `sigma_max` by the average relative impedance in the mode solver PML region.
+- Changed `_solve_all_freqs` function of the local model solver code to run mode solving in parallel.


since this is a private method (leading _ in the name) maybe this won't mean much to users. I wonder if we can just explain that "local mode solver will run in parallel over each frequency"

tylerflex · 2024-05-01T18:20:36Z

tidy3d/web/api/mode.py

+
+    # Check type to make sure the user is submitting a list of mode solvers
+    if not all(isinstance(x, ModeSolver) for x in mode_solvers):
+        log.error("[red]Stopped running a batch of mode solver simulations. Please provide a list of mode solvers of type [cyan]`td.plugins.mode.mode_solver.ModeSolver` [red]to run a batch of mode solver simulations.")


we usually just raise ValueError, let's change to that.

The error message could be more descriptive, such as:

mode_solvers passed to mode.web.run_batch() should be a list containing all ModeSolver instances. Can't run the batched mode solver.

tylerflex · 2024-05-01T18:23:37Z

tidy3d/web/api/mode.py

+                         results_file = results_files[index], 
+                         verbose = False, 
+                         progress_callback_upload = progress_callback_upload, 
+                         progress_callback_download = progress_callback_download


the style guide we follow usually doesn't put spaces between the = in kwargs. You can automatically format the code using the black tool. See our notion page in the EM wiki. In fact the tests won't pass until the code is unchanged under black and ruff. Make sure you have the version prescribed in the pyproject.toml

https://www.notion.so/flexcompute/Python-Style-Guide-afac5dbbf949463e89c18fe39b7453c9

tylerflex · 2024-05-01T18:24:34Z

tidy3d/web/api/mode.py

+                time.sleep(retry_delay)  
+                handle_mode_solver(index, retries + 1)
+            else:
+                log.warning(f"The {index}-th mode solver failed after {max_retries} tries.")          


what are we going to do about it? Maybe add another sentence that the data returned by this mode solver will be None?

tylerflex · 2024-05-01T18:32:40Z

Another random thought: is it worth building in a safeguard in case this fails? At least in the local mode solver? like a try except and the multi-threaded one fails, just warn and use the original?

prashkh · 2024-05-01T18:42:22Z

@tylerflex that makes sense so that we can decouple running mode solver from potential dependency issues with joblib. I can implement that as well.

prashkh · 2024-05-01T19:25:29Z

@tylerflex another thing is that when I run parallel processing for the local mode solver I am using all but one core. Mode solver runs fast but I can hear the fan going even on my mac for large frequency sweeps. I suppose this is fine for usability?

momchil-flex · 2024-05-01T20:43:18Z

tidy3d/plugins/mode/mode_solver.py

@@ -10,6 +10,9 @@
 import pydantic.v1 as pydantic
 import xarray as xr

+from joblib import Parallel, delayed


joblib will have to be added as a dependency in pyproject.toml right? Curious why you chose to use that instead of the native python multiprocessing? As far as I understand joblib seems to use that under the hood?

One thing I'm noticing is that you're not explicitly setting the openmp, etc. num threads to 1. This was a problem when you were doing multiprocessing? Did using joblib somehow remove the need for that?

@momchil-flex I initially tried using multiprocessing but was getting errors like this `AttributeError: Can't pickle local object 'ModeSolver._solve_all_freqs..'. Also, was getting errors when the make_simulation function that creates the mode solver loads gds files. Tried messing around with multiprocessing including using libraries like dill for serializing but couldn't make it work. Multithreading for concurrency works fine but since this task is compute limited we want multiprocessing. Then I found this library and it worked right away and was embarrassingly simple like they say in their website :) By default it is using backend="loky", which sets multiple process workers with n_jobs defining the number of cpu cores it is using. You can also switch to backend= "threading" for multithreading but this is not what we want for processing multiple frequencies in parallel. More details here https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

Yeah I could help you go around the "can't pickle object" problem which we've hit multiple times by now (and the answer is not to use dill, even though that's what you find on StackOverflow :) You just need to re-arrange things a bit usually). But I do like this joblib library and we may even switch to that in other places! It seems like it's pretty light and has no dependencies for basic functionality.

yeah, would love to learn how you fixed the "can't pickle object" error. Yeah, joblib just looks simple and very readable!

momchil-flex · 2024-05-01T20:46:35Z

tidy3d/plugins/mode/mode_solver.py

+                                                                            coords = coords, 
+                                                                            symmetry = symmetry) 
+                                                                            for freq in self.freqs
+                                                                            )


Interestingly we do something very similar on the backend, which is why we have this _solve_single_freq in the first place. We use multiprocessing there. I'm even wondering if we shouldn't just be bringing backend code here. But yeah the explicit thread setting is generally required or else things can get substantially slower. Similarly one reason I didn't want to put this on the frontend is that this can get stuck or crash if the user runs out of memory (big mode solves in prallel). Do you think this is a problem in your implementation?

It does seem like joblib provides a resource to avoid the over-subscription, but it doesn't seem like you are using it?

https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources

@momchil-flex So, what I am doing is I am looking at the number of cores the local computer has. If the number of cores is greater than 2, I use all but 1 core for parallel computation. So, if I have 4 cores, the code uses all 3 cores.
If the backend code has multiprocessing I would assume the same implementation also works on the frontend. I have tested this with 100s of frequency simulations but have not tested this with a big mode solving in parallel. Do you have some examples I can use to test this? I can also just make a really fine mesh and see how it does?

There are two things here. The first one is the memory. I think you should definitely test with some large mode planes. multiprocessing does not handle that smoothly, it can definitely freeze if you run out of memory. So on the backend we do some checks to estimate how much memory will be needed to decide how many cores to use.

The other point is about disabling multi-threading within each process to avoid over-allocation which can completely slow things down. This is what the link to joblib that I posted is about. I remember you experienced this in your initial testing (you needed to set the env variables I sent you to 1), but here you don't seem to be doing that. So I'm wondering what's happening, whether it's joblib doing something (which it seems to be able to, but you're not using that functionality as far as I can tell), or something else.

@momchil-flex okay, that makes sense to set inner_max_num_threads=1. Apparently, by default for "the 'loky' backend: by default each worker process will have environment variables set to allow a maximum of cpu_count() // n_jobs (which is 1 in my case with 12 CPUs and (12-1) njobs )so that the total number of threads used by all the workers does not exceed the number of CPUs of the host."

This is probably why it was working fine.

So, now I have something like this written down and it works the same as before

with parallel_config(backend="loky", inner_max_num_threads=1): results = Parallel(n_jobs=n_jobs)(delayed(self._solve_single_freq)( freq = freq, coords = coords, symmetry = symmetry) for freq in self.freqs )

@momchil-flex okay tried it with crazy fine with resolution of min_steps_per_wvl= 1000 compared to 15 before (so around 4400 times larger number of grid points). So, this time it crashed saying system out of memory. We need to implement similar checks like you had in the backend then. Could we discuss this tomorrow?

lucas-flexcompute · 2024-05-03T18:59:06Z

tests/test_plugins/test_mode_solver.py

+    results = msweb.run_batch(mode_solver_list, verbose = False)
+
+    assert any(isinstance(x,ModeSolverData) for x in results)
+    assert (results[i].n_eff.shape == (num_freqs, i+1) for i in range(num_of_sims))


This should also be all(...)

lucas-flexcompute · 2024-05-03T18:59:17Z

tests/test_plugins/test_mode_solver.py

+        results[i] = mode_solver_list[i].solve()
+
+    assert any(isinstance(x,ModeSolverData) for x in results)
+    assert (results[i].n_eff.shape == (num_freqs, i+1) for i in range(num_of_sims))


This should also be all(...)

…changes to local mode solver, updated errors in test_mode_solver_web_run_batch test function

prashkh · 2024-05-06T22:22:32Z

@tylerflex implemented all the feedback and ran both black and ruff according to the notion instructions. I did a new commit. Could you please review it. Also, I see that the black tool made a lot of small code changes. Hope this is okay. Let me know if everything is good now.

tylerflex · 2024-05-06T22:32:52Z

Hi @prashkh , I should have warned you about this. But we need to use a specific version of black as defined in the pyproject.toml. I think the version you used created a ton of extra changes which will unfortunately clutter our code history so we'll probably need to remove that commit and run with the correct version of black.

I can take a more detailed look tomorrow if you don't feel totally comfortable with it, sorry for the inconvenience :(

prashkh · 2024-05-07T02:39:33Z

@tylerflex I tried to fix it by reverting to the commit before this one but not sure it worked. Might need your help tomorrow. Sorry about it. I should have used the current black and ruff versions.

tylerflex · 2024-05-07T16:42:19Z

Hey @prashkh, check out #1685 and see if this PR contains all the changes you'd like made. If so, we can try fixing this PR up to incorporate them, or even just merge that PR.

tylerflex · 2024-05-22T13:36:31Z

closing this PR as it seems we're going with #1685

Added run_batch functionality for the web mode solver, added parallel…

25b399c

…ization using joblib for the local mode solver, and added new functions for unit testing

prashkh requested a review from tylerflex May 1, 2024 01:18

prashkh commented May 1, 2024

View reviewed changes

tylerflex mentioned this pull request May 1, 2024

Adding batch solving capability to the mode solver api #1649

Closed

tylerflex reviewed May 1, 2024

View reviewed changes

prashkh requested a review from lucas-flexcompute May 1, 2024 18:27

prashkh requested a review from lei-flex May 1, 2024 18:38

momchil-flex reviewed May 1, 2024

View reviewed changes

lucas-flexcompute approved these changes May 3, 2024

View reviewed changes

Implemented feedback to web mode solver run_batch funciton, reverted …

fb25ce6

…changes to local mode solver, updated errors in test_mode_solver_web_run_batch test function

tylerflex closed this May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added run_batch functionality for the web mode solver, added parallel… #1654

Added run_batch functionality for the web mode solver, added parallel… #1654

prashkh commented May 1, 2024

prashkh May 1, 2024

prashkh May 1, 2024

tylerflex left a comment

tylerflex May 1, 2024

tylerflex May 1, 2024

tylerflex May 1, 2024

tylerflex May 1, 2024

tylerflex May 1, 2024

tylerflex commented May 1, 2024

prashkh commented May 1, 2024

prashkh commented May 1, 2024

momchil-flex May 1, 2024

prashkh May 2, 2024 •

edited

momchil-flex May 2, 2024

prashkh May 2, 2024

momchil-flex May 1, 2024

momchil-flex May 1, 2024

prashkh May 2, 2024 •

edited

momchil-flex May 2, 2024

prashkh May 2, 2024

prashkh May 2, 2024

lucas-flexcompute May 3, 2024

lucas-flexcompute May 3, 2024

prashkh commented May 6, 2024

tylerflex commented May 6, 2024

prashkh commented May 7, 2024

tylerflex commented May 7, 2024

tylerflex commented May 22, 2024

Added run_batch functionality for the web mode solver, added parallel… #1654

Added run_batch functionality for the web mode solver, added parallel… #1654

Conversation

prashkh commented May 1, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerflex left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tylerflex commented May 1, 2024

prashkh commented May 1, 2024

prashkh commented May 1, 2024

Choose a reason for hiding this comment

prashkh May 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashkh May 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prashkh commented May 6, 2024

tylerflex commented May 6, 2024

prashkh commented May 7, 2024

tylerflex commented May 7, 2024

tylerflex commented May 22, 2024

prashkh May 2, 2024 •

edited

prashkh May 2, 2024 •

edited