Upload only predictions instead of trained model via run #1231

ArturDev42 · 2023-03-21T15:25:30Z

Description

My understanding is that in order to upload the results for a particular task, I need to create a run with a trained model and task such as follows:

task = openml.tasks.get_task(119) 
clf = ensemble.RandomForestClassifier() 
run = openml.runs.run_model_on_task(clf, task)

Would it be possible to upload only the prediction (e.g. as a csv file) for a given task? Instead of needing to upload a trained model. Like it is done in Kaggle competitions or challenges on https://eval.ai/.

The text was updated successfully, but these errors were encountered:

PGijsbers · 2023-03-21T15:30:10Z

Yes / no. Currently all runs must be linked to a flow (which is created automatically within that function). Here is a tutorial on creating a custom flow and using that to upload run results without using run_model_on_task: https://openml.github.io/openml-python/main/examples/30_extended/custom_flow_.html#sphx-glr-examples-30-extended-custom-flow-py

Hopefully that should work for you :)

ArturDev42 · 2023-04-04T14:06:39Z

Thank you for the quick answer!

I am trying to follow the tutorial you referenced. I am using a flow that contains a KNeighborsClassifier (Flow ID: 90800, https://test.openml.org/f/90800) but I'm not sure how to define the parameter 'components' as listed here https://openml.github.io/openml-python/main/generated/openml.flows.OpenMLFlow.html#openml.flows.OpenMLFlow

I defined all other parameters required byopenml.flows.OpenMLFlow() according to the flow I would like to use.

How can I specify the used flow in components?

ArturDev42 · 2023-04-04T14:36:43Z

Inspecting the JSON file from the autosklearn_flow provided in the tutorial I see a component identifier "component":{"identifier":"automl_tool","flow":. However, I do not see this for my own flow (https://test.openml.org/f/90800). Is this something I need to specify when creating a flow?

LennartPurucker · 2023-04-16T11:57:23Z

Heyho,

So I have looked into this and have come to the following conclusion (@PGijsbers feel free to correct me):

You can specify components, but you do not need to.
You could leave the parameter empty with components=OrderedDict(), and this works as far as I can tell.

If you want to reference an already existing flow in your new flow, you would need to first get the flow and then reference it as part of the components:

some_flow = openml.flows.get_flow(X)
components=OrderedDict(some_key=some_flow)

However, I think you can only reference one sub-flow. While the Python API technically allows specifying multiple sub-flows (components=OrderedDict(some_key=some_flow, some_key2=some_flow2)), they are lost when uploading them to the server.

The API also calls this key component instead of components -- I am unsure if this is intended or if the Python API simply supports more than is currently specified by the server. In fact, the Python API supports reading flows with multiple components from the server, i.e., it could parse XML files with multiple components. In short, the server only accepts the first sub-flow and ignores any other sub-flow.

@ArturDev42, I hope this answers your questions. Feel free to ask any other questions regarding custom flows so that this issue may function as additional temporary documentation for custom flows.

ArturDev42 · 2023-04-17T09:35:08Z

Hi @LennartPurucker, thanks a lot for your response!

I had actually already tried components=OrderedDict() and creating the flow worked. If I wanted to reference an already existing flow, can I use any key of the flow? I used flow_id which I could also find in the JSON file from the flow, so components=OrderedDict(flow_id=...)

In my previous comment #1231 (comment) I was trying to use an already existing flow, because I thought this is necessary for a custom flow.

But for my specific use case, I am not sure why I would need an already existing flow. Bascially, I only want to compare predictions that can be uploaded (without a trained model) for a given task and compare with the ground truth from that task.

Is my understanding correct that I can create a custom flow without referencing any already existing flow and basically only use it to be able to upload run results for a given task? It seemed to work for me so far.

If yes, then I was wondering why it is still necessary to specify parameters and parameters_meta_info when creating the custom flow?

Thanks!

LennartPurucker · 2023-04-17T10:47:23Z

If I wanted to reference an already existing flow, can I use any key of the flow?

I do not think the key itself would be enough, you would need to use openml.flows.get_flow(your_flow_key).

Is my understanding correct that I can create a custom flow without referencing any already existing flow and basically only use it to be able to upload run results for a given task? It seemed to work for me so far.

Yes, to my understanding, that should work, and I do not see a reason why it would fail. I think a lot of flows do not reference a subflow anyways. Your use case seems to be a good example of sharing prediction data without a trained model.

If yes, then I was wondering why it is still necessary to specify parameters and parameters_meta_info when creating the custom flow?

From a code perspective, if you set parameters=OrderedDict(), parameters_meta_info=OrderedDict(), it should work as well. From a data perspective, IMO, it would be good to include (some) parameters information, such that others know where the predictions came from. But this depends on your use case, and another user could technically filter such flows if they require parameter information. Moreover, so far, the server does not require these keys to be filled, so it might be considered optional (unless this is a bug, @PGijsbers might know more about this).

ArturDev42 · 2023-04-17T11:18:32Z

From a code perspective, if you set parameters=OrderedDict(), parameters_meta_info=OrderedDict(), it should work as well. From a data perspective, IMO, it would be good to include (some) parameters information, such that others know where the predictions came from. But this depends on your use case, and another user could technically filter such flows if they require parameter information. Moreover, so far, the server does not require these keys to be filled, so it might be considered optional (unless this is a bug, @PGijsbers might know more about this).

Can I also use an empty dict for parameter_settings when creating the run?

my_run = openml.runs.OpenMLRun(
    task_id=task_id,
    flow_id=flow_id,
    dataset_id=dataset_id,
    parameter_settings=OrderedDict(),
    data_content=predictions,
    tags=["custom_flow_test"],
    description_text="Run generated by the My Custom Flow#3 without any subflows.",
)

I get the following error when running it:

OpenMLServerException: https://test.openml.org/api/v1/xml/run/ returned code 202: 
Could not validate run xml by xsd - XML does not correspond to XSD schema. Error Element '{http://openml.org/openml}parameter_setting': Missing child element(s). 
Expected is ( {http://openml.org/openml}name ). on line 4 column 0.

LennartPurucker · 2023-04-17T11:36:24Z

You would need to use parameter_settings=[] for the run. Then the run would relate to your flow without any parameter values.

If you define some hyperparameters with parameters for the flow, you could specify the values of these hyperparameters here.
Here is an example for this following the ideas from the custom flow example:

some_flow = openml.flows.OpenMLFlow(
    **general,
    parameters=OrderedDict(your_hp="1"),  # default value is 1 
    parameters_meta_info=OrderedDict(your_hp=OrderedDict(description="A hyperparameter", data_type="int")),
    components=OrderedDict(),
    model=None,
)
some_flow.publish()

my_run = openml.runs.OpenMLRun(
    task_id=task_id,
    flow_id=flow_id,
    dataset_id=dataset_id,
    parameter_settings=[OrderedDict([("oml:name", "your_hp"), ("oml:value", 2)])],  # set value to 2 for this run
    data_content=predictions,
    description_text="Run generated by the Custom Flow tutorial.",
)
my_run.publish()

PGijsbers · 2023-04-25T09:26:33Z

Python allows multiple components because the xsd specifies this is legal. In my opinion, the server discarding those additional components is an error.
Uploading predictions without any information about how they were created is rather useless from an experiment store perspective. I'd suggest either attempting to give some flow description (e.g., link to github). Otherwise maybe create one flow specific flow used for all uploaded runs whose name/meta-data makes it clear the origin of the predictions is unknown. In the latter case, we should probably standardise this somehow so that everyone can use this same flow for that same purpose.

ArturDev42 changed the title ~~Upload only predictions instead of publishing run~~ Upload only predictions instead of trained model via run Mar 21, 2023

LennartPurucker added Flow OpenML concept Documentation labels Apr 16, 2023

LennartPurucker mentioned this issue Apr 16, 2023

Minor Documentation Fixes: TaskID for Example Custom Flow; Comment on Homepage; More documentation for components #1243

Merged

ArturDev42 mentioned this issue Apr 25, 2023

Provide test datasets where labels are hidden openml/OpenML#1188

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upload only predictions instead of trained model via run #1231

Upload only predictions instead of trained model via run #1231

ArturDev42 commented Mar 21, 2023

PGijsbers commented Mar 21, 2023 •

edited

ArturDev42 commented Apr 4, 2023

ArturDev42 commented Apr 4, 2023 •

edited

LennartPurucker commented Apr 16, 2023

ArturDev42 commented Apr 17, 2023

LennartPurucker commented Apr 17, 2023

ArturDev42 commented Apr 17, 2023

LennartPurucker commented Apr 17, 2023

PGijsbers commented Apr 25, 2023

Upload only predictions instead of trained model via run #1231

Upload only predictions instead of trained model via run #1231

Comments

ArturDev42 commented Mar 21, 2023

Description

PGijsbers commented Mar 21, 2023 • edited

ArturDev42 commented Apr 4, 2023

ArturDev42 commented Apr 4, 2023 • edited

LennartPurucker commented Apr 16, 2023

ArturDev42 commented Apr 17, 2023

LennartPurucker commented Apr 17, 2023

ArturDev42 commented Apr 17, 2023

LennartPurucker commented Apr 17, 2023

PGijsbers commented Apr 25, 2023

PGijsbers commented Mar 21, 2023 •

edited

ArturDev42 commented Apr 4, 2023 •

edited