A Spicepod can contain one or more datasets referenced by relative path, or defined inline.
datasets
Inline example:
spicepod.yaml
datasets:
- from: spice.ai/eth.beacon.eigenlayer
name: strategy_manager_deposits
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb / sqlite / postgres
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental
spicepod.yaml
datasets:
- from: databricks:spiceai.datasets.specific_table
name: uniswap_eth_usd
params:
environment: prod
acceleration:
enabled: true
mode: memory # / file
engine: arrow # / duckdb
refresh_check_interval: 1h
refresh_mode: full / append # update / incremental
Relative path example:
spicepod.yaml
datasets:
- ref: datasets/eth_recent_transactions
datasets/eth_recent_transactions/dataset.yaml
from: spice.ai/eth.recent_transactions
name: eth_recent_transactions
type: overwrite
acceleration:
enabled: true
refresh: 1h
from
​
The from
field is a string that represents the Uniform Resource Identifier (URI) for the dataset. This URI is composed of two parts: a prefix indicating the Data Connector to use to connect to the dataset, and the path to the dataset within the source.
The syntax for the from
field is as follows:
from: <data_connector>:<path>
Where:
-
<data_connector>
: The Data Connector to use to connect to the datasetCurrently supported data connectors:
If the Data Connector is not explicitly specified, it defaults to
spiceai
. -
<path>
: The path to the dataset within the source.
name
​
The name of the dataset. This is used to reference the dataset in the pod manifest, as well as in external data sources.
time_column
​
Optional. The name of the column that represents the temporal (time) ordering of the dataset.
Required to enable a retention policy on the dataset.
time_format
​
Optional. The format of the time_column
. The following values are supported:
unix_seconds
- Default. Unix timestamp in seconds.unix_millis
- Unix timestamp in milliseconds.ISO8601
- ISO 8601 format.
- String-based columns are assumed to be ISO8601 format.
acceleration
​
Optional. Accelerate queries to the dataset by caching data locally.
acceleration.enabled
​
Enable or disable acceleration, defaults to true
.
acceleration.engine
​
The acceleration engine to use, defaults to arrow
. The following engines are supported:
arrow
- Accelerated in-memory backed by Apache Arrow DataTables.duckdb
- Accelerated by an embedded DuckDB database.postgres
- Accelerated by a Postgres database.sqlite
- Accelerated by an embedded Sqlite database.
acceleration.mode
​
Optional. The mode of acceleration. The following values are supported:
memory
- Store acceleration data in-memory.file
- Store acceleration data in a file. Only supported forduckdb
andsqlite
acceleration engines.
mode
is currently only supported for the duckdb
engine.
acceleration.refresh_mode
​
Optional. How to refresh the dataset. The following values are supported:
full
- Refresh the entire dataset.append
- Append new data to the dataset.
acceleration.refresh_check_interval
​
Optional. How often data should be refreshed. Only supported for full
refresh_mode datasets. For append
datasets, the refresh check interval not used.
See Duration
acceleration.refresh_sql
​
Optional. Filters the data fetched from the source to be stored in the accelerator engine. Only supported for full
refresh_mode datasets.
Must be of the form SELECT * FROM {name} WHERE {refresh_filter}
. {name}
is the dataset name declared above, {refresh_filter}
is any SQL expression that can be used to filter the data, i.e. WHERE city = 'Seattle'
to reduce the working set of data that is accelerated within Spice from the data source.
- The refresh SQL only supports filtering data from the current dataset - joining across other datasets is not supported.
- Selecting a subset of columns isn't supported - the refresh SQL needs to start with
SELECT * FROM {name}
. - Queries for data that have been filtered out will not fall back to querying against the federated table.
acceleration.refresh_data_window
​
Optional. A duration to filter dataset refresh source queries to recent data (duration into past from now). Requires time_column
and time_format
to also be configured. Only supported for full
refresh mode datasets.
For example, refresh_data_window: 24h
will include only records with a timestamp within the last 24 hours.
See Duration
acceleration.params
​
Optional. Parameters to pass to the acceleration engine. The parameters are specific to the acceleration engine used.
acceleration.engine_secret
​
Optional. The secret store key to use the acceleration engine connection credential. For supported data connectors, use spice login
to store the secret.
acceleration.retention_check_enabled
​
Optional. Enable or disable retention policy check, defaults to false
.
acceleration.retention_period
​
Optional. The retention period for the dataset. Combine with time_column
and time_format
to determine if the data should be retained or not.
Required when acceleration.retention_check_enabled
is true
.
See Duration
acceleration.retention_check_interval
​
Optional. How often the retention policy should be checked.
Required when acceleration.retention_check_enabled
is true
.
See Duration