Skip to content

v0.9.0

Compare
Choose a tag to compare
@sdreyer sdreyer released this 06 Jun 04:08
· 497 commits to main since this release
0c7f68a

What's New

Vector Database and Embedding Support

You can use Featureform to define and orchestrate data pipelines that generate embeddings. Featureform can write them into either Redis for nearest neighbor lookup. This also allows users to version, re-use, and manage embeddings declaratively.

Registering Redis for use as a Vector Store (it’s the same as registering it typically)

ff.register_redis(
        name = "redis",
        description = "Example inference store",
        team = "Featureform",
        host = "0.0.0.0",
        port = 6379,
)

A Pipeline to Generate Embeddings from Text

docs = spark.register_file(...)

@spark.df_transform(
	inputs=[docs],
)
def embed_docs():
	docs[“embedding”] = docs[“text”].map(lambda txt: openai.Embedding.create(
                    model="text-embedding-ada-002",
                    input=txt,
                )["data"]
	return docs

Defining and Versioning an Embedding

@ff.entity
def Article:
	embedding = ff.Embedding(embed_docs[[“id”, “embedding”]], dims=1024, vector_db=redis)

@ff.entity
class Article:
    embedding = ff.Embedding(
        embed_docs[["id", "embedding"]],
        dims=1024,
        variant="test-variant",
        vector_db=redis,
    )

Performing a Nearest Neighbor Lookup

client.Nearest(Article.embedding, “id_123”, 25)

Interact with Training Sets as Dataframes

You can already interact with sources as dataframes, this release adds the same functionality to training sets as well.

Interacting with a training set as Pandas

import featureform as ff

client = ff.Client(...)
df = client.training_set(“fraud”, “simple”).dataframe()
print(df.head())

Enhanced Scheduling across Offline Stores

Featureform supports Cron syntax for scheduling transformations to run. This release rebuffs this functionality to make it more stable and efficient, and also adds more verbose error messages.

A transformation that runs every hour on Snowflake

@snowflake.sql_transform(schedule=“0 * * * *”)
def avg_transaction_price()
	return “SELECT user, AVG(price) FROM {{transaction}} GROUP BY user”

Run Pandas Transformations on K8s with S3

Featureform schedules and runs your transformations for you. We support running Pandas directly, Featureform spins up a Kubernetes job to run it. This isn’t a replacement for distributed processing frameworks like Spark (which we also support), but it’s a great option for teams that are already using Pandas for production.

Defining our Pandas on Kubernetes Provider

aws_creds = ff.AWSCredentials(
        aws_access_key_id="<aws_access_key_id>",
        aws_secret_access_key="<aws_secret_access_key>",
)

s3 = ff.register_s3(
name="s3",
        credentials=aws_creds,
        bucket_path="<s3_bucket_path>",
        bucket_region="<s3_bucket_region>"
)

pandas_k8s = ff.register_k8s(
        name="k8s",
        description="Native featureform kubernetes compute",
        store=s3,
        team="featureform-team"
)

Registering a file in S3 and a Transformation on it

src = pandas_k8s.register_file(...)

@pandas_k8s.df_transform(inputs=[src])
def transform(src):
	return src.groupby("CustomerID")["TransactionAmount"].mean()