Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfil job fails when sharding on datetime[ns] #1291

Open
FirefoxMetzger opened this issue Jan 31, 2023 · 2 comments
Open

Backfil job fails when sharding on datetime[ns] #1291

FirefoxMetzger opened this issue Jan 31, 2023 · 2 comments

Comments

@FirefoxMetzger
Copy link

Chances are that this is a user error on my part, though I couldn't work it out from the docs. Figured I'll ask here so that we can see if there is a way to improve the docs and/or if there is an issue.

I'm trying to create a sharded/partitioned feature group, which uses both a primary key and a partition key. While the feature group is created successfully, I can't seem to be able to insert data into it:

import pandas as pd
import numpy as np
import hopsworks
from datetime import datetime, timedelta

rng = np.random.default_rng(1234)

# insertion using numpy types works fine
np_data = (
    pd.DataFrame({
        "index": np.arange(10),
        "feature": np.arange(10, 0, -1),
    })
    .astype(np.int8)
    .assign(event_time=[datetime.now()+timedelta(seconds=int(x)) for x in rng.integers(0, 100, 10)])
)

ctx = hopsworks.login()
fs = ctx.get_feature_store()

feature_group = fs.get_or_create_feature_group(
    name="foo",
    version="1",
    description="an example",
    primary_key=["index"],
    partition_key=["event_time"]
)

feature_group.insert(np_data)  # FeatureStoreException

Here is a link to the (failed) backfill job: https://c.app.hopsworks.ai/p/16549/jobs/named/foo_1_offline_fg_backfill/executions
(I can also share the logs if necessary.)

@moritzmeister
Copy link
Contributor

We would need to see the logs to know what's going on.
Are there any exceptions in the logs?

It doesn't look like it in your snippet, but are you also setting the "event_time" column, as "event time" of the feature group? There is a little guide here about the layout of feature groups together with event time.

@FirefoxMetzger
Copy link
Author

Sure. Here are the logs. You can also replicate the behavior locally by running the snippet I shared above.
stderr_log.txt
stdout_log.txt

It doesn't look like it in your snippet, but are you also setting the "event_time" column, as "event time" of the feature group?

No, I'm not specifying an event time key. I didn't quite get what the implications of this are. In particular (1) does this refer to the time at which the entry into the feature group is made or can I specify it manually, (2) will it become a primary/unique key in the table, and (3) what does this mean for the table's sharding behavior.

The last part (3) is particularly interesting to me. BigQuery has the ability to select which shards to read, which reduces data processed (thus cost) and query time (less data to load). Ideally I would like to re-create this behavior in Hopsworks, and my attempt to doing so was to use a partition_key.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants