Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Cross-validation support for multiple covariate matching #64

Closed

Conversation

mnarayan
Copy link
Collaborator

No description provided.

@mnarayan mnarayan linked an issue Feb 12, 2021 that may be closed by this pull request
Copy link
Collaborator

@arokem arokem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

def stratified_crossvalidate(
covariate_df,
pipeline_estimator=None,
cv_splitter=StratifiedShuffleSplit(n_splits=1, test_size=0.25),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might set this to None, and initialize below. Just to be on the safe side.


# Check covariate_df is a dataframe
if not isinstance(covariate_df, pd.DataFrame):
raise TypeError
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
raise TypeError
raise TypeError("`covariate_df` input needs to be a DataFrame object")

Is that crucial, by the way? Why not take arrays?

if not isinstance(covariate_df, pd.DataFrame):
raise TypeError

X = covariate_df.to_numpy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you can check if it's a dataframe, and if not, skip this line

if pipeline_estimator is None:
pipeline_estimator = make_pipeline(
KNeighborsTransformer(n_neighbors=10, mode="distance"),
Isomap(n_components=5, neighbors_algorithm="auto"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might expose n_neighbors and n_components as key-word arguments for the whole function, set to these values as default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not. If they want to customize, just pass your own pipeline_estimator

)

X_reduced = pipeline_estimator.fit_transform(X)
cluster_stratify = KMeans(n_clusters=5).fit_predict(X_reduced)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here : expose n_clusters

for foldno, (train_index, test_index) in enumerate(
cv_splitter.split(X, cluster_stratify)
):
print("Fold No: {}".format(foldno))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use logging instead of print?

A scikit-learn object for splitting the dataset into
train/test set such that splits are stratified

Yields
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it Returns these, not Yields

print(train_clusters)
print(train_freq)
print(test_clusters)
print(test_freq)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, logging?

@arokem
Copy link
Collaborator

arokem commented Feb 22, 2021

@mnarayan : any progress on this one? Looks like the build failures are just some minor linting.

Would you mind rebasing and adding tests?

@arokem
Copy link
Collaborator

arokem commented Nov 6, 2022

I believe this was superseded by #107

@arokem arokem closed this Nov 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement stratified cross-validation for multiple covariates
2 participants