Backfil job fails when sharding on datetime[ns] #1291

FirefoxMetzger · 2023-01-31T08:48:34Z

Chances are that this is a user error on my part, though I couldn't work it out from the docs. Figured I'll ask here so that we can see if there is a way to improve the docs and/or if there is an issue.

I'm trying to create a sharded/partitioned feature group, which uses both a primary key and a partition key. While the feature group is created successfully, I can't seem to be able to insert data into it:

import pandas as pd
import numpy as np
import hopsworks
from datetime import datetime, timedelta

rng = np.random.default_rng(1234)

# insertion using numpy types works fine
np_data = (
    pd.DataFrame({
        "index": np.arange(10),
        "feature": np.arange(10, 0, -1),
    })
    .astype(np.int8)
    .assign(event_time=[datetime.now()+timedelta(seconds=int(x)) for x in rng.integers(0, 100, 10)])
)

ctx = hopsworks.login()
fs = ctx.get_feature_store()

feature_group = fs.get_or_create_feature_group(
    name="foo",
    version="1",
    description="an example",
    primary_key=["index"],
    partition_key=["event_time"]
)

feature_group.insert(np_data)  # FeatureStoreException

Here is a link to the (failed) backfill job: https://c.app.hopsworks.ai/p/16549/jobs/named/foo_1_offline_fg_backfill/executions
(I can also share the logs if necessary.)

moritzmeister · 2023-02-01T08:04:35Z

We would need to see the logs to know what's going on.
Are there any exceptions in the logs?

It doesn't look like it in your snippet, but are you also setting the "event_time" column, as "event time" of the feature group? There is a little guide here about the layout of feature groups together with event time.

FirefoxMetzger · 2023-02-02T08:02:35Z

Sure. Here are the logs. You can also replicate the behavior locally by running the snippet I shared above.
stderr_log.txt
stdout_log.txt

It doesn't look like it in your snippet, but are you also setting the "event_time" column, as "event time" of the feature group?

No, I'm not specifying an event time key. I didn't quite get what the implications of this are. In particular (1) does this refer to the time at which the entry into the feature group is made or can I specify it manually, (2) will it become a primary/unique key in the table, and (3) what does this mean for the table's sharding behavior.

The last part (3) is particularly interesting to me. BigQuery has the ability to select which shards to read, which reduces data processed (thus cost) and query time (less data to load). Ideally I would like to re-create this behavior in Hopsworks, and my attempt to doing so was to use a partition_key.

…ttings (#1291)

…ttings (logicalclocks#1291)

SirOibaf added a commit that referenced this issue Feb 24, 2023

[HWORKS-431] Git image version shoud not be hardcoded in Hopsworks se…

1ada34d

…ttings (#1291)

smkniazi pushed a commit to smkniazi/hopsworks that referenced this issue Apr 19, 2023

[HWORKS-431] Git image version shoud not be hardcoded in Hopsworks se…

0ba81ac

…ttings (logicalclocks#1291)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfil job fails when sharding on datetime[ns] #1291

Backfil job fails when sharding on datetime[ns] #1291

FirefoxMetzger commented Jan 31, 2023

moritzmeister commented Feb 1, 2023

FirefoxMetzger commented Feb 2, 2023

Backfil job fails when sharding on datetime[ns] #1291

Backfil job fails when sharding on datetime[ns] #1291

Comments

FirefoxMetzger commented Jan 31, 2023

moritzmeister commented Feb 1, 2023

FirefoxMetzger commented Feb 2, 2023