Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload only predictions instead of trained model via run #1231

Open
ArturDev42 opened this issue Mar 21, 2023 · 9 comments
Open

Upload only predictions instead of trained model via run #1231

ArturDev42 opened this issue Mar 21, 2023 · 9 comments
Labels
Documentation Flow OpenML concept

Comments

@ArturDev42
Copy link

Description

My understanding is that in order to upload the results for a particular task, I need to create a run with a trained model and task such as follows:

task = openml.tasks.get_task(119) 
clf = ensemble.RandomForestClassifier() 
run = openml.runs.run_model_on_task(clf, task)

Would it be possible to upload only the prediction (e.g. as a csv file) for a given task? Instead of needing to upload a trained model. Like it is done in Kaggle competitions or challenges on https://eval.ai/.

@ArturDev42 ArturDev42 changed the title Upload only predictions instead of publishing run Upload only predictions instead of trained model via run Mar 21, 2023
@PGijsbers
Copy link
Collaborator

PGijsbers commented Mar 21, 2023

Yes / no. Currently all runs must be linked to a flow (which is created automatically within that function). Here is a tutorial on creating a custom flow and using that to upload run results without using run_model_on_task: https://openml.github.io/openml-python/main/examples/30_extended/custom_flow_.html#sphx-glr-examples-30-extended-custom-flow-py

Hopefully that should work for you :)

@ArturDev42
Copy link
Author

Thank you for the quick answer!

I am trying to follow the tutorial you referenced. I am using a flow that contains a KNeighborsClassifier (Flow ID: 90800, https://test.openml.org/f/90800) but I'm not sure how to define the parameter 'components' as listed here https://openml.github.io/openml-python/main/generated/openml.flows.OpenMLFlow.html#openml.flows.OpenMLFlow

I defined all other parameters required byopenml.flows.OpenMLFlow() according to the flow I would like to use.

How can I specify the used flow in components?

@ArturDev42
Copy link
Author

ArturDev42 commented Apr 4, 2023

Inspecting the JSON file from the autosklearn_flow provided in the tutorial I see a component identifier "component":{"identifier":"automl_tool","flow":. However, I do not see this for my own flow (https://test.openml.org/f/90800). Is this something I need to specify when creating a flow?

@LennartPurucker
Copy link
Contributor

Heyho,

So I have looked into this and have come to the following conclusion (@PGijsbers feel free to correct me):

You can specify components, but you do not need to.
You could leave the parameter empty with components=OrderedDict(), and this works as far as I can tell.

If you want to reference an already existing flow in your new flow, you would need to first get the flow and then reference it as part of the components:

some_flow = openml.flows.get_flow(X)
components=OrderedDict(some_key=some_flow)

However, I think you can only reference one sub-flow. While the Python API technically allows specifying multiple sub-flows (components=OrderedDict(some_key=some_flow, some_key2=some_flow2)), they are lost when uploading them to the server.

The API also calls this key component instead of components -- I am unsure if this is intended or if the Python API simply supports more than is currently specified by the server. In fact, the Python API supports reading flows with multiple components from the server, i.e., it could parse XML files with multiple components. In short, the server only accepts the first sub-flow and ignores any other sub-flow.

@ArturDev42, I hope this answers your questions. Feel free to ask any other questions regarding custom flows so that this issue may function as additional temporary documentation for custom flows.

@ArturDev42
Copy link
Author

Hi @LennartPurucker, thanks a lot for your response!

I had actually already tried components=OrderedDict() and creating the flow worked. If I wanted to reference an already existing flow, can I use any key of the flow? I used flow_id which I could also find in the JSON file from the flow, so components=OrderedDict(flow_id=...)

In my previous comment #1231 (comment) I was trying to use an already existing flow, because I thought this is necessary for a custom flow.

But for my specific use case, I am not sure why I would need an already existing flow. Bascially, I only want to compare predictions that can be uploaded (without a trained model) for a given task and compare with the ground truth from that task.

Is my understanding correct that I can create a custom flow without referencing any already existing flow and basically only use it to be able to upload run results for a given task? It seemed to work for me so far.

If yes, then I was wondering why it is still necessary to specify parameters and parameters_meta_info when creating the custom flow?

Thanks!

@LennartPurucker
Copy link
Contributor

If I wanted to reference an already existing flow, can I use any key of the flow?

I do not think the key itself would be enough, you would need to use openml.flows.get_flow(your_flow_key).

Is my understanding correct that I can create a custom flow without referencing any already existing flow and basically only use it to be able to upload run results for a given task? It seemed to work for me so far.

Yes, to my understanding, that should work, and I do not see a reason why it would fail. I think a lot of flows do not reference a subflow anyways. Your use case seems to be a good example of sharing prediction data without a trained model.

If yes, then I was wondering why it is still necessary to specify parameters and parameters_meta_info when creating the custom flow?

From a code perspective, if you set parameters=OrderedDict(), parameters_meta_info=OrderedDict(), it should work as well. From a data perspective, IMO, it would be good to include (some) parameters information, such that others know where the predictions came from. But this depends on your use case, and another user could technically filter such flows if they require parameter information. Moreover, so far, the server does not require these keys to be filled, so it might be considered optional (unless this is a bug, @PGijsbers might know more about this).

@ArturDev42
Copy link
Author

From a code perspective, if you set parameters=OrderedDict(), parameters_meta_info=OrderedDict(), it should work as well. From a data perspective, IMO, it would be good to include (some) parameters information, such that others know where the predictions came from. But this depends on your use case, and another user could technically filter such flows if they require parameter information. Moreover, so far, the server does not require these keys to be filled, so it might be considered optional (unless this is a bug, @PGijsbers might know more about this).

Can I also use an empty dict for parameter_settings when creating the run?

my_run = openml.runs.OpenMLRun(
    task_id=task_id,
    flow_id=flow_id,
    dataset_id=dataset_id,
    parameter_settings=OrderedDict(),
    data_content=predictions,
    tags=["custom_flow_test"],
    description_text="Run generated by the My Custom Flow#3 without any subflows.",
)

I get the following error when running it:

OpenMLServerException: https://test.openml.org/api/v1/xml/run/ returned code 202: 
Could not validate run xml by xsd - XML does not correspond to XSD schema. Error Element '{http://openml.org/openml}parameter_setting': Missing child element(s). 
Expected is ( {http://openml.org/openml}name ). on line 4 column 0.

@LennartPurucker
Copy link
Contributor

You would need to use parameter_settings=[] for the run. Then the run would relate to your flow without any parameter values.

If you define some hyperparameters with parameters for the flow, you could specify the values of these hyperparameters here.
Here is an example for this following the ideas from the custom flow example:

some_flow = openml.flows.OpenMLFlow(
    **general,
    parameters=OrderedDict(your_hp="1"),  # default value is 1 
    parameters_meta_info=OrderedDict(your_hp=OrderedDict(description="A hyperparameter", data_type="int")),
    components=OrderedDict(),
    model=None,
)
some_flow.publish()

my_run = openml.runs.OpenMLRun(
    task_id=task_id,
    flow_id=flow_id,
    dataset_id=dataset_id,
    parameter_settings=[OrderedDict([("oml:name", "your_hp"), ("oml:value", 2)])],  # set value to 2 for this run
    data_content=predictions,
    description_text="Run generated by the Custom Flow tutorial.",
)
my_run.publish()

@PGijsbers
Copy link
Collaborator

  1. Python allows multiple components because the xsd specifies this is legal. In my opinion, the server discarding those additional components is an error.
  2. Uploading predictions without any information about how they were created is rather useless from an experiment store perspective. I'd suggest either attempting to give some flow description (e.g., link to github). Otherwise maybe create one flow specific flow used for all uploaded runs whose name/meta-data makes it clear the origin of the predictions is unknown. In the latter case, we should probably standardise this somehow so that everyone can use this same flow for that same purpose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Flow OpenML concept
Projects
None yet
Development

No branches or pull requests

3 participants