Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

load_as_spark using config as variable instead of file #483

Open
stevenayers-bge opened this issue Apr 24, 2024 · 7 comments
Open

load_as_spark using config as variable instead of file #483

stevenayers-bge opened this issue Apr 24, 2024 · 7 comments

Comments

@stevenayers-bge
Copy link

Hi all,

I want to use load_as_spark, but instead of saving the config credentials as a file, I want to pull them from a secrets manager (Azure Vault, AWS SecretsManager, Databricks Secret) instead of storing the file on the filesystem. I want to avoid saving the json config to the filesystem for security reasons.

For example:

delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")

df = delta_sharing.load_as_spark(delta_sharing_config + '#<share name>.<schema>.<table name>')

Is this possible?

Thanks

@aimtsou
Copy link

aimtsou commented Apr 25, 2024

@stevenayers-bge: Would you be satisfied with building the json object while bringing the secrets from key vault and any other configuration from application resource configuration or even from the keyvault?

@stevenayers
Copy link

@aimtsou initializing the rest client works fine, for example:

import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile

# GET SECRET VALUE
delta_sharing_config = dbutils.secrets.get(scope="delta-sharing", key="my-config")

# PASS SECRET CONFIG WITHOUT SAVING TO FILE
profile = DeltaSharingProfile.from_json(delta_sharing_config)
client = delta_sharing.SharingClient(profile)

# List tables available in share
client.list_all_tables()

The issue is when you go to read the data into a dataframe. You cannot use SharingClient or DataSharingProfile to authenticate when reading the dataframe. You have to pass the file in the load_as_spark function URL argument:

import delta_sharing

share_file = "/tmp/my-secret-config-stored-in-a-file" # security problem!

df = delta_sharing.load_as_spark(share_file + '#nep_test.reference.settlement_period_calendar')

@stevenayers
Copy link

@aimtsou @linzhou-db this old PR would solve the issue #103

@aimtsou
Copy link

aimtsou commented Apr 25, 2024

@stevenayers,

Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.

My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.

@stevenayers-bge
Copy link
Author

@aimtsou i don't mind how it's done 🙂 Im not sure your PR will help though. load_as_spark doesn't interact with the Python SharingClient.

load_as_spark passes the config file path to the JVM, and the file is read and parsed within the delta sharing scala library.

@stevenayers-bge
Copy link
Author

stevenayers-bge commented Apr 25, 2024

@stevenayers,

Now i get your point. Yes the load_as_spark function does not support SharingClient, so you are proposing something different from me, to pass the whole JSON stored in secure storage as a profile to Spark, I proposed to build the JSON object in python which is closer to @zsxwing suggests. I see 2 different implementations to be honest.

My question is what is your incentive for storing the whole file is a secret store? Usually we retrieve only the secret value. I mean in your case the person who has access to this databricks workspace has already access to the whole file since it is in the secret store.

I don't have an incentive to store the whole file in a secret manager, but it is often easier to store the whole json config in one secret.

It would be equally secure if you only stored the bearerToken in a secrets manager, but then you're storing some of the config on a filesystem, and some in a secrets manager, which is a bit messy.

All I'm proposing is that we can pass profile configuration via parameters, rather than passing in a filepath. So something like:

share_endpoint = ... pulled in from a secretsmanager or whatever
share_token = ... pulled in from a secretsmanager or whatever

delta_sharing.load_as_spark(
    share_name='<share-name>',
    schema_name='<schema name>',
    table_name='my_table',
    endpoint=share_endpoint,
    bearer_token=share_token
)

# or pass a client object?
client: SharingClient = ...
delta_sharing.load_as_spark(client=client, share_name=.... etc)

@aimtsou does that make sense?

@aimtsou
Copy link

aimtsou commented Apr 25, 2024

For me yes and as you said yes there is only a df.load(url) in load_as_spark so indeed it needs to be done on the scala side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants