feat(Azure): multiprocessing #53

ericbutera · 2023-10-25T19:17:24Z

Add multiprocessing support to Azure provider.

This PR should be merged into branch worker-pool-v4.0 when ready.

- aws_endpoint_url setting - add max proc setting - test scan_all pool fix

- pool can't be pickled, need to figure out self.pool issue - introduced AwsScanContext to manage state of outer loop in an easier way inside workers - so many TODOs for refactoring class state into context

- scan context everywhere

- introduce cloud events - emit payloads - hacky Aurora client - research log line showing provider info

- add provider to log (aws only for now) - change add_seed to use a list and submit_seed_payload - change add_cloud_asset to use a map + submit_cloud_asset_payload - rough and ready aurora client

- remove unused comment code - add temp_sts_credential to ctx

grace-murphy · 2023-10-25T22:21:59Z

src/censys/cloud_connectors/azure_connector/connector.py

        for provider_setting in provider_settings.values():
+            # this is so confusing - plural settings to setting?


maybe for azure_settings in provider_settings.values()?
or for azure_entry in provider_settings.values()?
or for provider_entry in provider_settings.values()?

I think changing the context property name from provider_settings to provider_setting would make the most sense. It's not a big deal, but I tripped on it a couple of times. AWS and GCP are probably doing the same.

grace-murphy · 2023-10-25T22:48:46Z

src/censys/cloud_connectors/azure_connector/connector.py


+                    # TODO: figure out how to make this wait until scan is finished:


** After writing this, # 2 is definitely better and is a fine solution. Figured I would leave my thought process though **

I don't think this solution is compatible with multiprocessing. Since we don't scan by region, but region is part of the label, options I see are:

setting a ttl on the asset and only submitting what we find
Pros: easy
Cons: could leave seeds "stale" for up to 48 hours (it's probably ~24 hours now depending on attribution)

One label will correspond to one process in this case, so we should be able to move the submission of empty payloads to labels_not_found within the single process. In this case it would be right after super().scan(**kwargs) in self.scan(). If we split up seeds/assets like we talked about for aws/gcp:

scan_all scan_seeds() super.scan_seeds() get_seeds() scan_cloud_assets() super.scan_cloud_assets() get_cloud_assets()

then it would would be at the end of scan_seeds() after it calls super.scan_seeds(), where "it" is

if self.scan_all_regions: for label_not_found in scan_context.possible_labels: self.delete_seeds_by_label(label_not_found)

Let's talk about this one some more.

My concern is that we will further split up how the connector works over time. I don't know that we can rely on an account or any level of region being in a single process over time. I could see each resource type being spawned off as well.

Another strategy could be to keep track of the "run id" or create a single value for the entire run. Upon completion, we could instruct items that aren't part of the current run should be removed.

The cloud provider change streams might be a better way to handle this prune routine as well.

It will be hard to manage our concurrency patterns with features like this over time. With large enough customers we might not be able to hold all of the possible labels in ram either. I frequently saw Out Of Memory (OOM) errors due to patterns like this while trying to build ingestion pipelines. Those instances had something like 16 GB of ram.

grace-murphy · 2023-10-25T22:57:15Z

src/censys/cloud_connectors/aws_connector/connector.py

        self.logger.info(
-            f"Scanning AWS account {self.account_number} in region {self.region}"
+            f"Scanning AWS - account:{scan_context.account_number} region:{scan_context.region}"


maybe specify "scanning for seeds" here? otherwise we'll have the same log twice and not know which was for seeds vs cloud assets. not sure if that's a big deal

- credential passing - possible labels + delete seeds by label - cloud asset use label prefix - healthcheck log errors if dry run enabled

ericbutera added 12 commits October 20, 2023 12:34

docs: architecture

e4da590

WIP

bbf6c8b

WIP

86912ce

- aws_endpoint_url setting - add max proc setting - test scan_all pool fix

WIP

74d2bfd

- pool can't be pickled, need to figure out self.pool issue - introduced AwsScanContext to manage state of outer loop in an easier way inside workers - so many TODOs for refactoring class state into context

WIP

dbfa3c2

- scan context everywhere

WIP

071599f

- introduce cloud events - emit payloads - hacky Aurora client - research log line showing provider info

WIP:

b482c3e

- add provider to log (aws only for now) - change add_seed to use a list and submit_seed_payload - change add_cloud_asset to use a map + submit_cloud_asset_payload - rough and ready aurora client

WIP: add payload source

7114156

fix: issues from rebase

60604fe

WIP:

0dd13f0

- remove unused comment code - add temp_sts_credential to ctx

build: poetry lock

4e626ef

feat: azure multiprocessing

417f856

ericbutera added the do not merge This PR is not ready to be merged label Oct 25, 2023

ericbutera self-assigned this Oct 25, 2023

grace-murphy reviewed Oct 25, 2023

View reviewed changes

refactor: azure

e9848fc

- credential passing - possible labels + delete seeds by label - cloud asset use label prefix - healthcheck log errors if dry run enabled

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(Azure): multiprocessing #53

feat(Azure): multiprocessing #53

ericbutera commented Oct 25, 2023 •

edited

grace-murphy Oct 25, 2023

ericbutera Oct 26, 2023

grace-murphy Oct 25, 2023

ericbutera Oct 26, 2023

grace-murphy Oct 25, 2023

		for provider_setting in provider_settings.values():
		# this is so confusing - plural settings to setting?


		# TODO: figure out how to make this wait until scan is finished:

feat(Azure): multiprocessing #53

Are you sure you want to change the base?

feat(Azure): multiprocessing #53

Conversation

ericbutera commented Oct 25, 2023 • edited

grace-murphy Oct 25, 2023

Choose a reason for hiding this comment

ericbutera Oct 26, 2023

Choose a reason for hiding this comment

grace-murphy Oct 25, 2023

Choose a reason for hiding this comment

ericbutera Oct 26, 2023

Choose a reason for hiding this comment

grace-murphy Oct 25, 2023

Choose a reason for hiding this comment

ericbutera commented Oct 25, 2023 •

edited