Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBT python model passing date variables with different data types #583

Open
tade0726 opened this issue Feb 9, 2024 · 3 comments
Open

DBT python model passing date variables with different data types #583

tade0726 opened this issue Feb 9, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@tade0726
Copy link

tade0726 commented Feb 9, 2024

Describe the feature

I noticed two data types arrived in the Python notebook for date str.

The python notebook mentioned here is the notebook dbt-databricks summits to Databricks for running python models. There is a config object in that, was defined corrodingly based on vars configured in dbt_project.yml .

As in :

# This part is user provided model code
# you will need to copy the next section to run the code
# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args, **kwargs):
   refs = {}
   key = '.'.join(args)
   version = kwargs.get("v") or kwargs.get("version")
   if version:
       key += f".v{version}"
   dbt_load_df_function = kwargs.get("dbt_load_df_function")
   return dbt_load_df_function(refs[key])


def source(*args, dbt_load_df_function):
   sources = {"prod_staging.ciqfininstance": "`savana_prod`.`dbt_staging`.`ciqfininstance`"}
   key = '.'.join(args)
   return dbt_load_df_function(sources[key])


config_dict = {'start_date': datetime.date(2018, 7, 1), 'end_date': ''}


class config:
   def __init__(self, *args, **kwargs):
       pass

   @staticmethod
   def get(key, default=None):
       return config_dict.get(key, default)

class this:
   """dbt.this() or dbt.this.identifier"""
   database = "savana_prod"
   schema = "dbt_gold_pipelines"
   identifier = "int_date_table"
   
   def __repr__(self):
       return '`savana_prod`.`dbt_gold_pipelines`.`int_date_table`'


class dbtObj:
   def __init__(self, load_df_function) -> None:
       self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
       self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
       self.config = config
       self.this = this()
       self.is_incremental = False

# COMMAND ----------

# how to execute python model in notebook
# dbt = dbtObj(spark.table)
# df = model(dbt, spark)

Case 1:

When a date variable has been set up in dbt_project.yml, the Python notebook in Databricks will set that object as a DateTime object.

As in a config Python dictionary

{"start_date", datetime.date(2021, 1, 1)}

Case 2:

When a date variable passed with --vars "{'date': '2021-01-01'}" in dbt cmd ,

It arrives as

{"start_date", '2021-01-01'}

Describe alternatives you've considered

I suggest having just a string for the date object.

Additional context

I am implementing an integration with Dagster + Databricks cloud + dbt, and I need both kinds of configuration to work in the same expectation of returned variable; otherwise, it will be redundant code copying and pasting in different Python models for facilitating the logic to reconcile the difference.

As in dbt Python models can not reuse the same abstracted functions / etc for now.

Who will this benefit?

It will no longer require writing extra logic to distinguish the date types.

Integration with Dagster might depend on --var {} CLI to pass configuration. But, it would be easier to debug in the dbt project when we pass the identical variable setups with the yml file.

Are you interested in contributing to this feature?

Yes, I would love to.

@tade0726 tade0726 added the enhancement New feature or request label Feb 9, 2024
@leo-schick
Copy link

@tade0726 you can ensure that by doing a little code adjustment to case 1:

instead of

{"start_date", datetime.date(2021, 1, 1)}

write

{"start_date", datetime.date(2021, 1, 1).strftime("%d.%m.%Y")}

see also: Python Dates

@tade0726
Copy link
Author

tade0726 commented Feb 12, 2024

Hi @leo-schick ,

It is not related to the user-defined dictionary. It was a config template generated by dbt-databricks.

I also updated the issue description for clarity.

See example below:

# This part is user provided model code
# you will need to copy the next section to run the code
# COMMAND ----------
# this part is dbt logic for get ref work, do not modify

def ref(*args, **kwargs):
    refs = {}
    key = '.'.join(args)
    version = kwargs.get("v") or kwargs.get("version")
    if version:
        key += f".v{version}"
    dbt_load_df_function = kwargs.get("dbt_load_df_function")
    return dbt_load_df_function(refs[key])


def source(*args, dbt_load_df_function):
    sources = {"prod_staging.ciqfininstance": "`savana_prod`.`dbt_staging`.`ciqfininstance`"}
    key = '.'.join(args)
    return dbt_load_df_function(sources[key])


config_dict = {'start_date': datetime.date(2018, 7, 1), 'end_date': ''}


class config:
    def __init__(self, *args, **kwargs):
        pass

    @staticmethod
    def get(key, default=None):
        return config_dict.get(key, default)

class this:
    """dbt.this() or dbt.this.identifier"""
    database = "savana_prod"
    schema = "dbt_gold_pipelines"
    identifier = "int_date_table"
    
    def __repr__(self):
        return '`savana_prod`.`dbt_gold_pipelines`.`int_date_table`'


class dbtObj:
    def __init__(self, load_df_function) -> None:
        self.source = lambda *args: source(*args, dbt_load_df_function=load_df_function)
        self.ref = lambda *args, **kwargs: ref(*args, **kwargs, dbt_load_df_function=load_df_function)
        self.config = config
        self.this = this()
        self.is_incremental = False

# COMMAND ----------

# how to execute python model in notebook
# dbt = dbtObj(spark.table)
# df = model(dbt, spark)

@benc-db
Copy link
Collaborator

benc-db commented Feb 12, 2024

@tade0726 Please go ahead and submit a PR :). I agree that it makes most sense for the data type to be consistent regardless of source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants