Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues uploading to test server (production works fine) #1178

Open
PGijsbers opened this issue Feb 3, 2023 · 0 comments
Open

Issues uploading to test server (production works fine) #1178

PGijsbers opened this issue Feb 3, 2023 · 0 comments

Comments

@PGijsbers
Copy link
Contributor

I experienced issues uploading a dataset to the OpenML test server.
There seem to be no issues with the live server.
I figured I'd report it anyway because of the unusual behavior, but feel free to ignore.
The exact error I receive depends on the amount of data uploaded.
The full dataset in ARFF format is around 80mb, but even small subsamples (~5%) give errors.

Based on the example below, SUBSAMPLE_SIZE determines how many rows of the data are uploaded:

  • Uploading works fine when SUBSAMPLE_SIZE<7710.
  • Uploading gives an error with 7710 <= SUBSAMPLE_SIZE < 30434: openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/ returned code 130: Problem with file uploading - File dataset: Filesize is 0 bytes, which is not allowed.
  • Uploading gives a different error with 30434 <=SUBSAMPLE_SIZE:openml.exceptions.OpenMLServerException: https://test.openml.org/api/v1/xml/data/ returned code 102: No authentication (Please provide API key for all requests other than HTTP GET) - None
SUBSAMPLE_SIZE=-1

import openml
import numpy as np
import pickle
import os
import textwrap
from openml.datasets.functions import create_dataset

openml.config.start_using_configuration_for_example()

long_description = """
TALLO - a global tree allometry and crown architecture database.

This is the Tallo dataset described in Jucker et al. (2022) but recreated with Python scripts from Laurens Bliek.
The scripts can be found at https://github.com/lbliek/TALLO_ML/tree/TALLO_ML1.

The Tallo database (v1.0.0) is a collection of 498,838 georeferenced 
and taxonomically standardized records of individual trees for which stem diameter, 
height and/or crown radius have been measured. 
Data were compiled from 61,856 globally distributed sites and include measurements for 5,163 tree species (Jucker et al., 2022).

The constructed data set and associated meta-data is for use case 3 in the referenced paper: 
predicting tree height based on climate data and stem diameter.
This means a large portion of data is ignored by default (set as attributes to be ignored).

Note: Samples are taken from different sources spanning decades, multiple samples may be taken from distinct trees in the same approximate geographical location.
These relationships between samples are ignored when generating tasks on OpenML.

List with a description for each feature:

|Field|Description|
|---|---|
|tree_id|Unique tree identifier code|
|division|Major phylogenetic division (Angiosperm or Gymnosperm)|
|family|Family name|
|genus|Genus name|
|species|Species binomial name|
|latitude|Latitude (in decimal degrees)|
|longitude|Longitude (in decimal degrees)|
|stem_diameter_cm|Stem diameter (in cm). For multi-stemmed trees values for individual stems (Di) were pooled into a single value calculated as: sqrt(sum(Di^2)). Log-scaled (base 10).|
|height_m|Tree height (in m). Log-scaled (base 10).|
|crown_radius_m|Crown radius (in m)|
|height_outlier|Identifier for trees with height values flagged as outliers (Y = outlier; N = non-outlier)|
|crown_radius_outlier|Identifier for trees with crown radius values flagged as outliers (Y = outlier; N = non-outlier)|
|reference_id|Reference code corresponding to the data source from which a record was obtained (see 'Reference_look_up_table.csv' for details on data sources).|
|realm|"Biogeographic realm. Follows the classification of Olson et al. (2001) BioScience, 51, 933-938"|
|biome|"Biome type. Follows the classification of Olson et al. (2001) BioScience, 51, 933-938"|
|mean_annual_rainfall|Mean annual rainfall (in mm/yr). Values were obtained from the WorldClim2 database based on the geographic coordinates of the tree.|
|rainfall_seasonality|Rainfall seasonality (coefficent of variation). Values were obtained from the WorldClim2 database based on the geographic coordinates of the tree.|
|aridity_index|Aridity index (calculated as mean annual precipitation / potential evapotranspiration). Values were obtained from the Global Aridity Index and Potential Evapotranspiration Climate Database (v2) based on the geographic coordinates of the tree. Log-scaled (base 10).|
|mean_annual_temperature|Mean annual temperature (in degree C). Values were obtained from the WorldClim2 database based on the geographic coordinates of the tree.|
|maximum_temperature|Maximum temperature of the warmest month (in degree C). Values were obtained from the WorldClim2 database based on the geographic coordinates of the tree.|
|AT_AI| Ratio of 'mean annual temperature' over log-scaled 'aridity index'.|
"""
long_description = textwrap.dedent(long_description)

with open(os.path.expanduser('~/repositories/tmp/TALLO_ML/Tallo_data.pkl'), 'rb') as file:
    data = pickle.load(file)

data['aridity_index'] = np.log10(data['aridity_index'])
data['stem_diameter_cm'] = np.log10(data['stem_diameter_cm'])
data['AT_AI'] = data['mean_annual_temperature']/data['aridity_index']
data["height_m"] = np.log10(data["height_m"])

#data = data[["height_m", "stem_diameter_cm"]]
data = data.iloc[:SUBSAMPLE_SIZE, :]


dataset = create_dataset(
    # description meta-data
    name="Tallo",
    description=long_description,
    creator="Jucker et al.",
    contributor="Laurens Bliek",
    collection_date=None,
    language="English",
    licence="CC BY 4.0",
    citation='Jucker, Tommaso, et al. "Tallo: A global tree allometry and crown architecture database." Global change biology 28.17 (2022): 5254-5268.',
    original_data_url="https://zenodo.org/record/6637599#.Y9vAii8w35g",
    paper_url="https://onlinelibrary.wiley.com/doi/full/10.1111/gcb.16302",

    # data properties
    attributes="auto",
    data=data,
    default_target_attribute="height_m",
    ignore_attribute=[
        'reference_id', 'crown_radius_m', 'biome',
        'height_outlier', 'realm', 'species', 'crown_radius_outlier',
        'mean_annual_rainfall', 'longitude', 'division', 'biome_division',
        'genus', 'family', 'latitude'
    ],
    row_id_attribute="tree_id",

)

dataset.publish()
print(dataset.openml_url)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant