S3 Data Connector
The S3 Data Connector enables federated SQL query on files stored in S3 or S3-compatible systems (e.g. MinIO, Cloudflare R2).
If a folder is provided, all child files will be loaded.
File formats are specified using the file_format
parameter, as described in Object Store File Formats.
Example spicepod.yml
:
datasets:
# Using access keys
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_auth: key
s3_key: ${secrets:S3_KEY}
s3_secret: ${secrets:S3_SECRET}
# Using IAM roles or Kubernetes service accounts with assigned IAM roles
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset2.parquet
name: cool_dataset2
params:
s3_auth: iam_role
# Using a public bucket
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
Dataset Schema Reference​
from
​
The S3-compatible URI to a folder or object in form from: s3://<bucket>/<file>
Example: from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name
​
The dataset name.
Example: name: cool_dataset
params
​
file_format
: Specifies the data file format. Required if the format cannot be inferred by from thefrom
path.parquet
: Parquet file format.csv
: CSV file format.
s3_endpoint
: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g.s3_endpoint: https://my.minio.server
s3_region
: Region of the S3 bucket, if region specific. Default value isus-east-1
E.g.s3_region: us-east-1
client_timeout
: Specifies timeout for S3 operations. Default value is30s
E.g.client_timeout: 60s
hive_partitioning_enabled
: Enable partitioning using hive-style partitioning from the folder structure. Defaults tofalse
More CSV related parameters can be configured, see CSV Parameters
Auth​
Optional for public endpoints. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_dremio_pass}
.
s3_auth
: (Optional) The authentication method to use. Values arepublic
,key
andiam_role
. Defaults topublic
ifs3_key
ands3_secret
are not provided, otherwise defaults tokey
.s3_key
: The access key (e.g.AWS_ACCESS_KEY_ID
for AWS)s3_secret
: The secret key (e.g.AWS_SECRET_ACCESS_KEY
for AWS)
For non-public buckets, s3_auth: key
or s3_auth: iam_role
is required. s3_auth: iam_role
will use the AWS IAM role of the currently running instance. The following IAM policy shows the least privileged policy required for the S3 connector:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets/*"
}
]
}
- Env
- Kubernetes
- Keyring
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE \
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
spice run
# Or using the CLI to configure the secrets into an `.env` file
spice login s3 -k AKIAIOSFODNN7EXAMPLE -s wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
.env
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
spicepod.yaml
version: v1beta1
kind: Spicepod
name: spice-app
secrets:
- from: env
name: env
datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${env:SPICE_S3_KEY}
s3_secret: ${env:SPICE_S3_SECRET}
Learn more about Env Secret Store.
kubectl create secret generic s3 \
--from-literal=key='AKIAIOSFODNN7EXAMPLE' \
--from-literal=secret='wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY'
spicepod.yaml
version: v1beta1
kind: Spicepod
name: spice-app
secrets:
- from: kubernetes:s3
name: s3
datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${s3:key}
s3_secret: ${s3:secret}
Learn more about Kubernetes Secret Store.
Add new keychain entries (macOS) for the key and secret:
# Add Key to keychain
security add-generic-secret -l "S3 Key" \
-a spiced -s spice_s3_key \
-w AKIAIOSFODNN7EXAMPLE
# Add Secret to keychain
security add-generic-secret -l "S3 Secret" \
-a spiced -s spice_s3_secret \
-w wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
spicepod.yaml
version: v1beta1
kind: Spicepod
name: spice-app
secrets:
- from: keyring
name: keyring
datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${keyring:spice_s3_key}
s3_secret: ${keyring:spice_s3_secret}
Learn more about Keyring Secret Store.
Examples​
MinIO Example​
Create a dataset named cool_dataset
from a Parquet file stored in MinIO.
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: https://my.minio.server
s3_region: 'us-east-1' # Best practice for Minio
S3 Public Example​
Create a dataset named taxi_trips
from a public S3 folder.
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
Hive Partitioning Example​
Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.
For example, a dataset partitioned by year, month, and day might have a directory structure like:
s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled
is set to true
.
version: v1beta1
kind: Spicepod
name: hive_data
datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: true