Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Noorain pgm baseline #3

Open
wants to merge 36 commits into
base: main
Choose a base branch
from
Open

Noorain pgm baseline #3

wants to merge 36 commits into from

Conversation

noorains
Copy link
Collaborator

No description provided.

@Polichinel Polichinel self-assigned this Feb 29, 2024
Copy link
Collaborator

@Polichinel Polichinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of things are happening in this PR. Mostly, I'm confused as to whether this contains more than just the baseline models indicated by the title.

The task was to create a no-change baseline model and a zero baseline model, but a lot of other things appear to be happening here.

I need this PR to focus on these two baselines, or at least the readmes and the title should reflect what is going on otherwise.

The fact that there are no docstrings or comments does also not help me understand what is happening. I need more inline code documentation to parse what is happening without needing to run this on my own machine

Lastly, have you checked the documentation regarding the repository structure and naming? And if so, where? If we have something misleading around, it should be removed (yes, I realize that having text on both GitHub and Notion can result in one source not being updated...). Anyway, please make sure to consult the documentation before making a PR so that trivial issues like this are in order.

We'll set up a GitHub action set of rules soon enough to catch these things automatically, so we might as well get used to reviewing the documentation carefully (Sara will be setting up a wiki when she has a moment where she is not actively working on the repository code, and hopefully, you'll have an easier time finding what you need).

.gitignore Outdated
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you deleting these things in the gitignore? Sure a lot of the things are not currently relevant but many of the things you do delete could become relevant very quickly.

You can of course add to the gitignore all you want - to the extent that it does not interfere with other people's code, but there is usually no reason for you to delete stuff in here. If, for some reason we need to delete something in here, please raise the issue. As I see it this have nothing to do with anything else in this PR - but please correct me if I'm wrong?

So please undelete and add you own stuff - unless very good reason, then I'm all ears

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this comment. I assumed that some of the frameworks like Django, Scrapy, Flask and various file types were not relevant to our models and data pipeline. So I only kept basic files in .gitignore relevant to the model folder, Python, parquet, Jupyter. If all the others are relevant I can add them back for future. :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write a small paragraph about what the point of this model is. Why do we have a zero model? What does it mean that it is a baseline? Just you two or three sentences more and we're good

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. I'll update the readme.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll might change this format soon - but for now don't worry about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clear after the retreat that the config structure should be different. I'll present the structure on Monday, so unless this PR takes a long time to resolve I suggest you just leave the configs as is, for now. We'll make a new branch to fix those next week instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

@@ -0,0 +1,21 @@
from ..blue_sea.src.dataloaders.fetch_data_run_query import fetch_data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a big fan of implicit relative imports, such as from . import module. I think it is a bit more prudent that we are more explicit (yet still relative), for instance:

# Set the base path relative to the current script's location
src_path = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))

# Alternative version using pathlib's Path (potentially more best-practice)
# src_path = f"{Path(__file__).parent.parent}"

# Add the required directories to the system path
sys.path.insert(0, os.path.join(src_path, "architectures"))
sys.path.insert(0, os.path.join(src_path, "configs"))
sys.path.insert(0, os.path.join(src_path, "utils"))

# Now a module from, e.g., utils can be loaded directly
from utils import util_module

Note that if everyone, thinks I'm being silly here I am willing to discuss it, but currently this is my conviction

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for this comment. I will modify imports using sys if that is the requirement. :)

Comment on lines 6 to 22
import wandb
def evaluate_mse() -> float:
data = pd.read_parquet(
f"{Path(__file__).parent.parent.parent}/data/generated/forecasts.parquet")
mse_mean = 0
for i in range(1, 37):
mse_mean = mse_mean + mean_squared_error(data['ln_ged_sb_dep'], data[f'step_pred_{i}'])
print('The MSE is ', mse_mean/36)

#project = wandb_config.project_config['project']
#entity=wandb_config.project_config['entity']

#wandb_log(project, entity, mse_mean/36, 'mse_mean_36_months')
wandb.log({'Mean MSE 36 months': mse_mean/36})

return mse_mean/36

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a fixe W&B scheme/standard yet, so this is fine for now. But we'll have to converge at some point

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

"test", "predict", model['RunResult_test'].data)

predictions_test.to_parquet(
f"{Path(__file__).parent.parent.parent}/data/generated/{model['modelname']}_true_forecasts.parquet")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config in future - fine for now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

Comment on lines 27 to 28
model['RunResult_test'] = RunResult.retrain_or_retrieve(
retrain=bool(config.common_config['force_retrain']),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Whether or not this is a zero baseline or something completely different:) We'll go with a different solution here - please consult Xiolong - or his code in the stepshifter branch (if the beach is not there is is simply because it has been merge to main)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blue_sea is all zero baseline, yellow_duck is no change model. :) green_oracle which is referred here is a test model. I'll change the readme of green_oracle.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of hassle for a zero baseline model? I'm starting to suspect that is not what is happening here? Either the readme is wrong or?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blue_sea is all zero baseline, yellow_duck is no change model. :) green_oracle which is referred here is a test model. I'll change the readme of green_oracle.

Comment on lines 27 to 28
model['RunResult_calib'] = RunResult.retrain_or_retrieve(
retrain=bool(config.common_config['force_retrain']),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll go another way - please consult Xiaolong or his code in the stepshifter branch (if deleted, it is merged with main)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks! baseline models will not require this for now and so I'll take the latest trainer for future models.

@noorains
Copy link
Collaborator Author

Hi, thank you for the feedback. I think I have taken care of most of the comments and suggestions. This branch would require a check of the outputs from the models so that we are sure that the baseline models are forecasting in the way that we expect them to. If this can be checked then we are ready to merge this branch to main. I have provided with enough documentation and docstrings in this branch. I am sure that this branch can be cloned and run locally. Please give me a feedback if there is an issue running the branch and if the outputs are not as expected. This will also check that the branch have no issues running on clients which were not used for development. It can work well with Prefect workflow code, I have tested it using Prefect. Just use the main.py files in the model specific folder to run them through Prefect or through Terminal. Also, as decided in one of our meetings, the models store the forecasts locally and not on the prediction store. So, please look into the data/generated/ folder to download the predictions in the form of parquet file. Thanks!

@sarakallis
Copy link
Collaborator

sarakallis commented Apr 8, 2024

Morning! Here are some comments while we wait for Simon and Xiaolong to get back.

General comments here, more specific ones in code:

  • I wonder why there is __init__.py in every folder. I assume this would only needed for packages like Xiaolong's views_stepshift that we have in common_utils. The rest of us 3 also don't have these – so it should be redundant, no?
  • Please make sure that you adhere to the file naming conventions that are always most up-to-date in the main README (you have to toggle it now to see it). Changes will be needed in the py file names in dataloaders, evaluation, forecasting, artifacts, and configs.
  • Have a look at the new path solution outlined here.
  • I couldn't find wandb initialization in blue_sea -- could you re-add this to main.py? Same goes for yellow_duck, where it's in the evaluation script. On that note, we are implementing more evaluation metrics than MSE, outlined here. I'm also still working on implementing this, but if you have time it would be good to start looking through it as well :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants