Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata cleanup #272

Draft
wants to merge 19 commits into
base: v0.2-integration
Choose a base branch
from
Draft

Metadata cleanup #272

wants to merge 19 commits into from

Conversation

tnixon
Copy link
Contributor

@tnixon tnixon commented Oct 25, 2022

This is the first big chunk of the v0.2 refactor work. This update includes a major reworking of the metadata structure for TSDF. I wanted to present it internally to the team as a PR to get eyes on this.

THIS IS A WORK IN PROGRESS
I have done quite a bit to get test code passing, but not all are working yet. There are still some loose ends to clean up and there will be more changes coming. The focus here should be on the new meta-data structure for TSDF, and its new constructors, and how these will simplify and streamline other code. I invite your reviews, questions and comments - let's get a good discussion going!

Also note - this is not going into master, just into the v0.2-integration branch. This is just a temporary place to integrate lots of changes and get things ready for a final merge to master when all changes are completed.

@lgtm-com
Copy link

lgtm-com bot commented Oct 25, 2022

This pull request introduces 6 alerts when merging ab1ff0c into 38ec63f - view on LGTM.com

new alerts:

  • 4 for Wrong name for an argument in a class instantiation
  • 2 for `__eq__` not overridden when adding attributes

@lgtm-com
Copy link

lgtm-com bot commented Oct 25, 2022

This pull request introduces 6 alerts when merging 1459ff5 into 38ec63f - view on LGTM.com

new alerts:

  • 4 for Wrong name for an argument in a class instantiation
  • 2 for `__eq__` not overridden when adding attributes

python/tempo/tsdf.py Show resolved Hide resolved
python/tempo/tsdf.py Show resolved Hide resolved
Comment on lines 80 to 81
if validate_schema:
self.ts_schema.validate(df.schema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the scenario where we would not want to validate the schema?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see there are some protected methods where we don't validate schema, but seems like exposing this arg could cause issues if set to False when users initialize a TSDF.

I also don't think it hurts to validate the schema each time we manipulate the underlying DF in any way, even protected args.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most TSDF transformer methods make some changes to the underlying DF and then return it wrapped in a new TSDF object. I think of this validation as primarily for end-users who might need guidance on how they're building a TSDF. Internal transformations should already be safe, so shouldn't require validation.

However, I'm open to doing validation on every constructor. I dont' think it'll be a hugely heavy function.

# If we see a string, we will proactively created a double
# version of the string timestamp for sorting purposes and
# rename to ts_col
@classmethod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use __withTransformedDF instead of a class method for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__withTransformedDF returns a TSDF, this returns a DataFrame. It's a helper to build compound-timestamp columns from multiple columns for special situations (timestamp + sub-sequence, string col -> parsed ts col, etc.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I'm struggling to understand why we'd instantiate an instance of TSDF to do a df -> df operation. Seems like this could be a static method other than it's protected. And if it's protected, do we want to expose it as a class method?

def __addColumnsFromOtherDF(self, other_cols: Sequence[str]):
return TSDF(df, ts_col=ts_col, series_ids=self.series_ids)

def __addColumnsFromOtherDF(self, other_cols):
"""
Add columns from some other DF as lit(None), as pre-step before union.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought! This could simplify things a lot here!

Comment on lines 483 to 487
if set(self.structural_cols).issubset(set(cols)):
return self.__withTransformedDF(self.df.select(*cols))
else:
raise Exception(
"In TSDF's select statement original ts_col, partitionCols and seq_col_stub(optional) must be present"
raise TSDFStructureChangeError(
"select that does not include all structural columns"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 this is super clean. Can you tag #248 in the PR descritption?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! definitely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also... couldn't we just short-cut select("*") by just returning self?
After all select("*") doesn't change anything...

dbutils.fs.ls("/")
return full_smry
# TODO: Can we raise something other than generic Exception?
# perhaps refactor to check for IS_DATABRICKS
except Exception:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except Exception:
except NameError:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought to narrow this assuming we are just trying to catch dbutils not being imported/installed

Comment on lines 259 to 260
Timeseries Index when we have a primary timeseries column and a secondary sequencing
column that indicates the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent cliffhanger 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤣

@tnixon tnixon marked this pull request as draft April 10, 2023 22:32
@CLAassistant
Copy link

CLAassistant commented Nov 27, 2023

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tsdf.select("*") throws Exception stating columns must be present when they are by nature of the projection
3 participants