Skip to main content

A Spicepod can contain one or more datasets referenced by relative path, or defined inline.

datasets

Inline example:

spicepod.yaml

datasets:
- from: spice.ai/eth.beacon.eigenlayer
name: strategy_manager_deposits
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb / sqlite / postgres
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental

spicepod.yaml

datasets:
- from: databricks:spiceai.datasets.specific_table
name: uniswap_eth_usd
params:
environment: prod
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental

Relative path example:

spicepod.yaml

datasets:
- ref: datasets/eth_recent_transactions

datasets/eth_recent_transactions/dataset.yaml

from: spice.ai/eth.recent_transactions
name: eth_recent_transactions
type: overwrite
acceleration:
enabled: true
refresh: 1h

from​

The from field is a string that represents the Uniform Resource Identifier (URI) for the dataset. This URI is composed of two parts: a prefix indicating the Data Connector to use to connect to the dataset, and the path to the dataset within the source.

The syntax for the from field is as follows:

from: <data_connector>:<path>

Where:

  • <data_connector>: The Data Connector to use to connect to the dataset

    Currently supported data connectors:

    If the Data Connector is not explicitly specified, it defaults to spiceai.

  • <path>: The path to the dataset within the source.

name​

The name of the dataset. This is used to reference the dataset in the pod manifest, as well as in external data sources.

time_column​

Optional. The name of the column that represents the temporal (time) ordering of the dataset.

Required to enable a retention policy on the dataset.

time_format​

Optional. The format of the time_column. The following values are supported:

  • unix_seconds - Default. Unix timestamp in seconds.
  • unix_millis - Unix timestamp in milliseconds.
  • ISO8601 - ISO 8601 format.
Current Limitations
  • String-based columns are assumed to be ISO8601 format.

acceleration​

Optional. Accelerate queries to the dataset by caching data locally.

acceleration.enabled​

Enable or disable acceleration, defaults to true.

acceleration.engine​

The acceleration engine to use, defaults to arrow. The following engines are supported:

  • arrow - Accelerated in-memory backed by Apache Arrow DataTables.
  • duckdb - Accelerated by an embedded DuckDB database.
  • postgres - Accelerated by a Postgres database.
  • sqlite - Accelerated by an embedded Sqlite database.

acceleration.mode​

Optional. The mode of acceleration. The following values are supported:

  • memory - Store acceleration data in-memory.
  • file - Store acceleration data in a file. Only supported for duckdb and sqlite acceleration engines.

mode is currently only supported for the duckdb engine.

acceleration.refresh_mode​

Optional. How to refresh the dataset. The following values are supported:

  • full - Refresh the entire dataset.
  • append - Append new data to the dataset.

acceleration.refresh_check_interval​

Optional. How often data should be refreshed. Only supported for full refresh_mode datasets. For append datasets, the refresh check interval not used.

See Duration

acceleration.refresh_sql​

Optional. Filters the data fetched from the source to be stored in the accelerator engine. Only supported for full refresh_mode datasets.

Must be of the form SELECT * FROM {name} WHERE {refresh_filter}. {name} is the dataset name declared above, {refresh_filter} is any SQL expression that can be used to filter the data, i.e. WHERE city = 'Seattle' to reduce the working set of data that is accelerated within Spice from the data source.

Current Limitations
  • The refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
  • Selecting a subset of columns isn't supported - the refresh SQL needs to start with SELECT * FROM {name}.
  • Queries for data that have been filtered out will not fall back to querying against the federated table.

acceleration.refresh_data_window​

Optional. A duration to filter dataset refresh source queries to recent data (duration into past from now). Requires time_column and time_format to also be configured. Only supported for full refresh mode datasets.

For example, refresh_data_window: 24h will include only records with a timestamp within the last 24 hours.

See Duration

acceleration.params​

Optional. Parameters to pass to the acceleration engine. The parameters are specific to the acceleration engine used.

acceleration.engine_secret​

Optional. The secret store key to use the acceleration engine connection credential. For supported data connectors, use spice login to store the secret.

acceleration.retention_check_enabled​

Optional. Enable or disable retention policy check, defaults to false.

acceleration.retention_period​

Optional. The retention period for the dataset. Combine with time_column and time_format to determine if the data should be retained or not.

Required when acceleration.retention_check_enabled is true.

See Duration

acceleration.retention_check_interval​

Optional. How often the retention policy should be checked.

Required when acceleration.retention_check_enabled is true.

See Duration