Looking into feast, the open source feature store, and whether there is support for using feast as an interface around parquet and or delta tables, with use in a pyspark batch inference databricks environment. I see that parquet is mentioned in the quickstart2, using parquet as the offline component and using sqlite as the online store component. The offline store component is described as intended for training. Maybe it can be useful for a batch inference case too?

I wonder what it means that the FileSource3 is described as “for development purposes only and is not optimized for production use.”.

However it does also look like a path-based SparkSource4 is also available .

from feast.infra.offline_stores.contrib.spark_offline_store.spark_source import (
    SparkSource,
)

my_spark_source = SparkSource(
    path=f"{CURRENT_DIR}/data/driver_hourly_stats",
    file_format="parquet",
    timestamp_field="event_timestamp",
    created_timestamp_column="created",
)

And liking that the offline spark store implements5 both get_historical_features and pull_latest_from_table_or_query, making me hopefful that this can be helpful for both batch inferencing and training needs.

Registry

There is a concept of a metadata store6,7, to define feature store views and entities. The registry can be used to list/retrieve/delete metadata either with a file (using protobuf, sqlite db) or sql ( postgresql). The metadata store registry is defined in a feature_store.yaml, with registry: data/registry.db or registry.pb for the file case or with a sql definition like:

project: <your project name>
provider: <provider name>
online_store: redis
offline_store: file
registry:
    registry_type: sql
    path: postgresql://postgres:mysecretpassword@127.0.0.1:55001/feast
    cache_ttl_seconds: 60

you can then programmatically refer to the registry, for listing, updates.

repo_config = RepoConfig(
    registry=RegistryConfig(path="gs://feast-test-gcs-bucket/registry.pb"),
    project="feast_demo_gcp",
    provider="gcp",
    offline_store="file",  # Could also be the OfflineStoreConfig e.g. FileOfflineStoreConfig
    online_store="null",  # Could also be the OnlineStoreConfig e.g. RedisOnlineStoreConfig
)
store = FeatureStore(config=repo_config)

And the docs7 discuss a separate registry for staging and production, where the latter is more locked down for editing.

But importantly also, this metadata store is recommended to be version controlled in git, across updates!

Adding features

Adding features is referred to as feature registration8. and again “Offline feature retrieval for batch predictions” is once again called out using get_historical_features and the same API call is also referred to for “Training data generation”, which logically makes a lot of sense!

Feast refers to Entities9 as groups of related features.

But features can be retrieved as Feature Views10 whether or not they are related to each other as logical entities.

References

  1. https://docs.feast.dev
  2. https://docs.feast.dev/v0.14-branch/getting-started/quickstart
  3. https://docs.feast.dev/v0.27-branch/reference/data-sources/file
  4. https://docs.feast.dev/v0.27-branch/reference/data-sources/spark
  5. https://docs.feast.dev/v0.27-branch/reference/offline-stores/spark#functionality-matrix
  6. https://docs.feast.dev/v0.27-branch/reference/offline-stores/spark#example
  7. https://docs.feast.dev/v0.27-branch/getting-started/concepts/registry
  8. https://docs.feast.dev/v0.27-branch/getting-started/concepts/overview#feature-registration-and-retrieval
  9. https://docs.feast.dev/v0.27-branch/getting-started/concepts/entity
  10. https://docs.feast.dev/v0.27-branch/getting-started/concepts/feature-view