Skip to content
This repository has been archived by the owner on May 17, 2024. It is now read-only.

Namespacing CLI options #860

Closed
wants to merge 19 commits into from
Closed

Namespacing CLI options #860

wants to merge 19 commits into from

Conversation

sar009
Copy link
Contributor

@sar009 sar009 commented Jan 25, 2024

The __main__.py is blotted with a lot of CLI variables impacting readability and loose typing. So namespacing the CLI options under the class CliOptions.

Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
@sar009
Copy link
Contributor Author

sar009 commented Jan 25, 2024

fixing #859

Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
@sungchun12
Copy link
Contributor

Can you add a fuller description on your approach in this PR? I'll plan to review this next week when it's ready!

Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
@sar009
Copy link
Contributor Author

sar009 commented Jan 25, 2024

@sungchun12 I have updated the description please have a look.

@sungchun12 sungchun12 mentioned this pull request Feb 5, 2024
@sungchun12 sungchun12 self-requested a review February 5, 2024 20:10
Copy link
Contributor

@sungchun12 sungchun12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I really like this approach and you did a great job of cleaning up the implementation emails and reducing kwargs bloat for function calls.

Once you update the data class for a pydantic class and update the relevant attribute calls, I'll give this another review and aim to merge this week!

@@ -14,7 +14,7 @@
def connect_to_table(
db_info: Union[str, dict],
table_name: Union[DbPath, str],
key_columns: str = ("id",),
key_columns: Tuple[str] = ("id",),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, good catch. You're reminding me that we need to add static type checking to this codebase

data_diff/config.py Outdated Show resolved Hide resolved


@dataclass
class CliOptions:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing all this really puts into perspective just how many CLI options are available that need validation. Can you refactor this using pydantic?

Until we have static type checking in place, runtime type validation will serve this better

quick pseudo-code

from typing import Optional, Literal, Dict, Union, Tuple
from pydantic import BaseModel


class CliOptions(BaseModel):
    bisection_factor: int
    bisection_threshold: int
    table_write_limit: int
    database1: Union[str, Dict, None] = None
    table1: Optional[str] = None
    database2: Union[str, Dict, None] = None
    table2: Optional[str] = None
    key_columns: Tuple[str, ...] = ()
    update_column: Optional[str] = None
    columns: Tuple[str, ...] = ()
    limit: Optional[int] = None
    materialize_to_table: Optional[str] = None
    min_age: Optional[str] = None
    max_age: Optional[str] = None
    stats: bool = False
    debug: bool = False
    json_output: bool = False
    verbose: bool = False
    version: bool = False
    interactive: bool = False
    no_tracking: bool = False
    case_sensitive: bool = False
    assume_unique_key: bool = False
    sample_exclusive_rows: bool = False
    materialize_all_rows: bool = False
    threads: Union[None, int, Literal["serial"]] = None
    threads1: Optional[int] = None
    threads2: Optional[int] = None
    threaded: bool = False
    where: Optional[str] = None
    algorithm: Literal["auto", "joindiff", "hashdiff"] = "auto"
    conf: Optional[str] = None
    run: Optional[str] = None
    dbt: bool = False
    cloud: bool = False
    dbt_profiles_dir: Optional[str] = None
    dbt_project_dir: Optional[str] = None
    select: Optional[str] = None
    state: Optional[str] = None
    prod_database: Optional[str] = None
    prod_schema: Optional[str] = None
    __conf__: Optional[Dict] = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will have to change the attribute calls for the rest of the files as a result

assert threaded
assert threads == 6
_set_threads(cli_options)
assert str(value_error.exception) == "Error: threads must be of type int, or value must be 'serial'."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pydantic will resolve this at runtime validation, but this custom error message is nice

sar009 and others added 12 commits February 7, 2024 11:38
Co-authored-by: Sung Won Chung <sungwonchung3@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
Signed-off-by: Sarad Mohanan <sarad.mohanan@gmail.com>
@sar009
Copy link
Contributor Author

sar009 commented Feb 20, 2024

@sungchun12 I have addressed the review comments please take a look

@sungchun12
Copy link
Contributor

@sar009 thank you! I'll review later this week.

@sungchun12
Copy link
Contributor

Mirror this PR for the full test suite to uncover any edge cases: #871

Copy link
Contributor

@sungchun12 sungchun12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, really good!

Just need new tests for these methods/functions given this codebase hasn't tested them before and we need to make sure the new args we're passing are doing exactly as expected.

  • _apply_config
  • apply_config_from_file
  • _print_result
  • _data_diff
  • create_schema
  • diff_schemas

f"Maximum number of rows to write when creating materialized or sample tables, per thread. "
f"Default={TABLE_WRITE_LIMIT}"
),
type=int,
metavar="COUNT",
)
@click.option(
"-j",
"--threads",
default=None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
default=None,
default=None,
type=int,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cli_options.dbt_project_dir = os.path.expanduser(cli_options.dbt_project_dir)

if cli_options.dbt:
dbt_diff(cli_options, log_status_handler=log_handlers.get("log_status_handler"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to verify how cli_options kwargs are being passed into this method


return new_kw
cli_options.run_args = run_args
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this makes sense instead of returning new_kw since cli_options stores config state now

from pydantic import BaseModel, PositiveInt


class CliOptions(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is bloated and will become bloated over time, but I'm willing to accept that tradeoff for better runtime validation of flag types and simplification of kwargs** across method calls. Overall, this PR has opened my eyes that a major refactor needs to happen across lots of the codebase for simpler data flows, but in the short-term, this approach is worth it.

Copy link
Contributor

This pull request has been marked as stale because it has been open for 60 days with no activity. If you would like the pull request to remain open, please comment on the pull request and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues/PRs that have gone stale label May 10, 2024
Copy link
Contributor

Although we are closing this pull request as stale, it's not gone forever. PRs can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.

@github-actions github-actions bot closed this May 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
stale Issues/PRs that have gone stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants