Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow __splink__df_concat to be computed without linker #2142

Open
RobinL opened this issue Apr 15, 2024 · 1 comment
Open

Allow __splink__df_concat to be computed without linker #2142

RobinL opened this issue Apr 15, 2024 · 1 comment
Labels

Comments

@RobinL
Copy link
Member

RobinL commented Apr 15, 2024

We plan to allow the user to do some forms of exploratory analysis without needing to create a linker (like profile_columns and various types of blocking analysis e.g. #2136 )

But this means that __splink__df_concat needs to be computed without the linker.

At the moment, this requires a lot of code that's confusing to read and will be repetitive:

tables = ensure_is_list(table_or_tables)
tables = db_api.process_input_tables(tables)
splink_df_dict = db_api.register_multiple_tables(tables)
input_dataframes = list(splink_df_dict.values())
input_aliases = list(splink_df_dict.keys())
input_columns = input_dataframes[0].columns_escaped
if not column_expressions:
column_expressions_raw = input_columns
else:
column_expressions_raw = ensure_is_list(column_expressions)
column_expressions = expressions_to_sql(column_expressions_raw)
pipeline = CTEPipeline(input_dataframes)
cols_to_select = ", ".join(input_columns)
template = """
select {cols_to_select}
from {table_name}
"""
sql_df_concat = " UNION ALL".join(
[
template.format(cols_to_select=cols_to_select, table_name=table_name)
for table_name in input_aliases
]
)
pipeline.enqueue_sql(sql_df_concat, "__splink__df_concat")

Issue can be addressed by removing the need for a linker to compute __splink__df_concat, giving us reusable code that can be used for profile_columns, blocking analysis etc.

@RobinL
Copy link
Member Author

RobinL commented Apr 15, 2024

We want vertically_concatenate_sql to be modified so it doesn't take a linker as an argument, but the functions like vertically_concatenate.compute_df_concat shuld still take the linker as an argument

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant