Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added run_batch functionality for the web mode solver, added parallel… #1654

Closed
wants to merge 2 commits into from

Conversation

prashkh
Copy link

@prashkh prashkh commented May 1, 2024

…ization using joblib for the local mode solver, and added new functions for unit testing.

Since there was some mix up between develop branch and pre/2.7 when I rebased, I created a new PR and implemented all the changes you requested. I also added the parallelization for the local mode solver and created two new unit tests for both testing the local mode solver and the run_batch for the web.

…ization using joblib for the local mode solver, and added new functions for unit testing
@prashkh prashkh requested a review from tylerflex May 1, 2024 01:18
for i in range(num_of_sims):
results[i] = mode_solver_list[i].solve()

assert any(isinstance(x,ModeSolverData) for x in results)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be all instead of any.

# Run mode solver one at a time
results = msweb.run_batch(mode_solver_list, verbose = False)

assert any(isinstance(x,ModeSolverData) for x in results)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be all instead of any. Will fix it when I incorporate all other suggestions.

Copy link
Collaborator

@tylerflex tylerflex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! A few new comments and I'd recommend getting another reviewer (maybe Lucas?). Also you'll need to use black to format the code for the tests to pass. Check out the link to the wiki that I sent for details. Thanks!

CHANGELOG.md Outdated
@@ -26,6 +26,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Properties `num_time_steps_adjoint` and `tmesh_adjoint` to `JaxSimulation` to estimate adjoint run time.
- Ability to add `path` to `updated_copy()` method to recursively update sub-components of a tidy3d model. For example `sim2 = sim.updated_copy(size=new_size, path="structures/0/geometry")` creates a recursively updated copy of `sim` where `sim.structures[0].geometry` is updated with `size=new_size`.
- Python 3.12 support. Python 3.8 deprecation. Updated dependencies.
- A batch of ModeSolver objects can be run concurrently using `tidy3d.plugins.mode.web.run_batch()`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please put ModeSolver inside of single backslash quotes

CHANGELOG.md Outdated
@@ -36,6 +37,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Changed sign of mode solver PML conductivity to match the `exp(-1j * omega * t)` convention used everywhere else. Previously, loss due to PML was coming out as negative `k_eff` of the mode, while now it comes out as positive `k_eff`, in line with material loss.
- Augmented the PML mode solver profile to the form `s(x) = kappa(x) + 1j * sigma(x) / (omega * EPSILON_0)`. Previously, `kappa` was not used. The parameters are currently set to `kappa_min = 1`, `kappa_max = 3`, `sigma_max = 2`. These are not yet exposed to the user.
- Also scaling `sigma_max` by the average relative impedance in the mode solver PML region.
- Changed `_solve_all_freqs` function of the local model solver code to run mode solving in parallel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since this is a private method (leading _ in the name) maybe this won't mean much to users. I wonder if we can just explain that "local mode solver will run in parallel over each frequency"


# Check type to make sure the user is submitting a list of mode solvers
if not all(isinstance(x, ModeSolver) for x in mode_solvers):
log.error("[red]Stopped running a batch of mode solver simulations. Please provide a list of mode solvers of type [cyan]`td.plugins.mode.mode_solver.ModeSolver` [red]to run a batch of mode solver simulations.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we usually just raise ValueError, let's change to that.

The error message could be more descriptive, such as:

mode_solvers passed to mode.web.run_batch() should be a list containing all ModeSolver instances. Can't run the batched mode solver.

results_file = results_files[index],
verbose = False,
progress_callback_upload = progress_callback_upload,
progress_callback_download = progress_callback_download
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the style guide we follow usually doesn't put spaces between the = in kwargs. You can automatically format the code using the black tool. See our notion page in the EM wiki. In fact the tests won't pass until the code is unchanged under black and ruff. Make sure you have the version prescribed in the pyproject.toml

https://www.notion.so/flexcompute/Python-Style-Guide-afac5dbbf949463e89c18fe39b7453c9

time.sleep(retry_delay)
handle_mode_solver(index, retries + 1)
else:
log.warning(f"The {index}-th mode solver failed after {max_retries} tries.")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we going to do about it? Maybe add another sentence that the data returned by this mode solver will be None?

@tylerflex
Copy link
Collaborator

Another random thought: is it worth building in a safeguard in case this fails? At least in the local mode solver? like a try except and the multi-threaded one fails, just warn and use the original?

@prashkh prashkh requested a review from lei-flex May 1, 2024 18:38
@prashkh
Copy link
Author

prashkh commented May 1, 2024

@tylerflex that makes sense so that we can decouple running mode solver from potential dependency issues with joblib. I can implement that as well.

@prashkh
Copy link
Author

prashkh commented May 1, 2024

@tylerflex another thing is that when I run parallel processing for the local mode solver I am using all but one core. Mode solver runs fast but I can hear the fan going even on my mac for large frequency sweeps. I suppose this is fine for usability?

@@ -10,6 +10,9 @@
import pydantic.v1 as pydantic
import xarray as xr

from joblib import Parallel, delayed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

joblib will have to be added as a dependency in pyproject.toml right? Curious why you chose to use that instead of the native python multiprocessing? As far as I understand joblib seems to use that under the hood?

One thing I'm noticing is that you're not explicitly setting the openmp, etc. num threads to 1. This was a problem when you were doing multiprocessing? Did using joblib somehow remove the need for that?

Copy link
Author

@prashkh prashkh May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momchil-flex I initially tried using multiprocessing but was getting errors like this `AttributeError: Can't pickle local object 'ModeSolver._solve_all_freqs..'. Also, was getting errors when the make_simulation function that creates the mode solver loads gds files. Tried messing around with multiprocessing including using libraries like dill for serializing but couldn't make it work. Multithreading for concurrency works fine but since this task is compute limited we want multiprocessing. Then I found this library and it worked right away and was embarrassingly simple like they say in their website :) By default it is using backend="loky", which sets multiple process workers with n_jobs defining the number of cpu cores it is using. You can also switch to backend= "threading" for multithreading but this is not what we want for processing multiple frequencies in parallel. More details here https://joblib.readthedocs.io/en/latest/generated/joblib.Parallel.html

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I could help you go around the "can't pickle object" problem which we've hit multiple times by now (and the answer is not to use dill, even though that's what you find on StackOverflow :) You just need to re-arrange things a bit usually). But I do like this joblib library and we may even switch to that in other places! It seems like it's pretty light and has no dependencies for basic functionality.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, would love to learn how you fixed the "can't pickle object" error. Yeah, joblib just looks simple and very readable!

coords = coords,
symmetry = symmetry)
for freq in self.freqs
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interestingly we do something very similar on the backend, which is why we have this _solve_single_freq in the first place. We use multiprocessing there. I'm even wondering if we shouldn't just be bringing backend code here. But yeah the explicit thread setting is generally required or else things can get substantially slower. Similarly one reason I didn't want to put this on the frontend is that this can get stuck or crash if the user runs out of memory (big mode solves in prallel). Do you think this is a problem in your implementation?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like joblib provides a resource to avoid the over-subscription, but it doesn't seem like you are using it?

https://joblib.readthedocs.io/en/latest/parallel.html#avoiding-over-subscription-of-cpu-resources

Copy link
Author

@prashkh prashkh May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momchil-flex So, what I am doing is I am looking at the number of cores the local computer has. If the number of cores is greater than 2, I use all but 1 core for parallel computation. So, if I have 4 cores, the code uses all 3 cores.
If the backend code has multiprocessing I would assume the same implementation also works on the frontend. I have tested this with 100s of frequency simulations but have not tested this with a big mode solving in parallel. Do you have some examples I can use to test this? I can also just make a really fine mesh and see how it does?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two things here. The first one is the memory. I think you should definitely test with some large mode planes. multiprocessing does not handle that smoothly, it can definitely freeze if you run out of memory. So on the backend we do some checks to estimate how much memory will be needed to decide how many cores to use.

The other point is about disabling multi-threading within each process to avoid over-allocation which can completely slow things down. This is what the link to joblib that I posted is about. I remember you experienced this in your initial testing (you needed to set the env variables I sent you to 1), but here you don't seem to be doing that. So I'm wondering what's happening, whether it's joblib doing something (which it seems to be able to, but you're not using that functionality as far as I can tell), or something else.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momchil-flex okay, that makes sense to set inner_max_num_threads=1. Apparently, by default for "the 'loky' backend: by default each worker process will have environment variables set to allow a maximum of cpu_count() // n_jobs (which is 1 in my case with 12 CPUs and (12-1) njobs )so that the total number of threads used by all the workers does not exceed the number of CPUs of the host."

This is probably why it was working fine.

So, now I have something like this written down and it works the same as before

with parallel_config(backend="loky", inner_max_num_threads=1): results = Parallel(n_jobs=n_jobs)(delayed(self._solve_single_freq)( freq = freq, coords = coords, symmetry = symmetry) for freq in self.freqs )

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@momchil-flex okay tried it with crazy fine with resolution of min_steps_per_wvl= 1000 compared to 15 before (so around 4400 times larger number of grid points). So, this time it crashed saying system out of memory. We need to implement similar checks like you had in the backend then. Could we discuss this tomorrow?

results = msweb.run_batch(mode_solver_list, verbose = False)

assert any(isinstance(x,ModeSolverData) for x in results)
assert (results[i].n_eff.shape == (num_freqs, i+1) for i in range(num_of_sims))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be all(...)

results[i] = mode_solver_list[i].solve()

assert any(isinstance(x,ModeSolverData) for x in results)
assert (results[i].n_eff.shape == (num_freqs, i+1) for i in range(num_of_sims))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be all(...)

…changes to local mode solver, updated errors in test_mode_solver_web_run_batch test function
@prashkh
Copy link
Author

prashkh commented May 6, 2024

@tylerflex implemented all the feedback and ran both black and ruff according to the notion instructions. I did a new commit. Could you please review it. Also, I see that the black tool made a lot of small code changes. Hope this is okay. Let me know if everything is good now.

@tylerflex
Copy link
Collaborator

Hi @prashkh , I should have warned you about this. But we need to use a specific version of black as defined in the pyproject.toml. I think the version you used created a ton of extra changes which will unfortunately clutter our code history so we'll probably need to remove that commit and run with the correct version of black.

I can take a more detailed look tomorrow if you don't feel totally comfortable with it, sorry for the inconvenience :(

@prashkh
Copy link
Author

prashkh commented May 7, 2024

@tylerflex I tried to fix it by reverting to the commit before this one but not sure it worked. Might need your help tomorrow. Sorry about it. I should have used the current black and ruff versions.

@tylerflex
Copy link
Collaborator

Hey @prashkh, check out #1685 and see if this PR contains all the changes you'd like made. If so, we can try fixing this PR up to incorporate them, or even just merge that PR.

@tylerflex
Copy link
Collaborator

closing this PR as it seems we're going with #1685

@tylerflex tylerflex closed this May 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants