Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(Azure): multiprocessing #53

Draft
wants to merge 13 commits into
base: worker-pool-v4.0
Choose a base branch
from

Conversation

ericbutera
Copy link
Contributor

@ericbutera ericbutera commented Oct 25, 2023

Add multiprocessing support to Azure provider.

This PR should be merged into branch worker-pool-v4.0 when ready.

- aws_endpoint_url setting
- add max proc setting
- test scan_all pool fix
- pool can't be pickled, need to figure out self.pool issue
- introduced AwsScanContext to manage state of outer loop in an easier way inside workers
- so many TODOs for refactoring class state into context
- scan context everywhere
- introduce cloud events
- emit payloads
- hacky Aurora client
- research log line showing provider info
- add provider to log (aws only for now)
- change add_seed to use a list and submit_seed_payload
- change add_cloud_asset to use a map + submit_cloud_asset_payload
- rough and ready aurora client
- remove unused comment code
- add temp_sts_credential to ctx
@ericbutera ericbutera added the do not merge This PR is not ready to be merged label Oct 25, 2023
@ericbutera ericbutera self-assigned this Oct 25, 2023
for provider_setting in provider_settings.values():
# this is so confusing - plural settings to setting?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe for azure_settings in provider_settings.values()?
or for azure_entry in provider_settings.values()?
or for provider_entry in provider_settings.values()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing the context property name from provider_settings to provider_setting would make the most sense. It's not a big deal, but I tripped on it a couple of times. AWS and GCP are probably doing the same.


# TODO: figure out how to make this wait until scan is finished:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

** After writing this, # 2 is definitely better and is a fine solution. Figured I would leave my thought process though **


I don't think this solution is compatible with multiprocessing. Since we don't scan by region, but region is part of the label, options I see are:

  1. setting a ttl on the asset and only submitting what we find
    Pros: easy
    Cons: could leave seeds "stale" for up to 48 hours (it's probably ~24 hours now depending on attribution)
  2. One label will correspond to one process in this case, so we should be able to move the submission of empty payloads to labels_not_found within the single process. In this case it would be right after super().scan(**kwargs) in self.scan(). If we split up seeds/assets like we talked about for aws/gcp:
scan_all
scan_seeds()
  super.scan_seeds()
    get_seeds()
scan_cloud_assets()
  super.scan_cloud_assets()
    get_cloud_assets()

then it would would be at the end of scan_seeds() after it calls super.scan_seeds(), where "it" is

if self.scan_all_regions:
  for label_not_found in scan_context.possible_labels:
    self.delete_seeds_by_label(label_not_found)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's talk about this one some more.

My concern is that we will further split up how the connector works over time. I don't know that we can rely on an account or any level of region being in a single process over time. I could see each resource type being spawned off as well.

Another strategy could be to keep track of the "run id" or create a single value for the entire run. Upon completion, we could instruct items that aren't part of the current run should be removed.

The cloud provider change streams might be a better way to handle this prune routine as well.

It will be hard to manage our concurrency patterns with features like this over time. With large enough customers we might not be able to hold all of the possible labels in ram either. I frequently saw Out Of Memory (OOM) errors due to patterns like this while trying to build ingestion pipelines. Those instances had something like 16 GB of ram.

self.logger.info(
f"Scanning AWS account {self.account_number} in region {self.region}"
f"Scanning AWS - account:{scan_context.account_number} region:{scan_context.region}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe specify "scanning for seeds" here? otherwise we'll have the same log twice and not know which was for seeds vs cloud assets. not sure if that's a big deal

- credential passing
- possible labels + delete seeds by label
- cloud asset use label prefix
- healthcheck log errors if dry run enabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do not merge This PR is not ready to be merged
Development

Successfully merging this pull request may close these issues.

None yet

2 participants