Skip to main content

S3 Data Connector

The S3 Data Connector enables federated SQL query on files stored in S3 or S3-compatible systems (e.g. MinIO, Cloudflare R2).

If a folder is provided, all child files will be loaded.

File formats are specified using the file_format parameter, as described in Object Store File Formats.

Example spicepod.yml:

datasets:
# Using access keys
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_auth: key
s3_key: ${secrets:S3_KEY}
s3_secret: ${secrets:S3_SECRET}

# Using IAM roles or Kubernetes service accounts with assigned IAM roles
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset2.parquet
name: cool_dataset2
params:
s3_auth: iam_role

# Using a public bucket
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet

Dataset Schema Reference​

from​

The S3-compatible URI to a folder or object in form from: s3://<bucket>/<file>

Example: from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet

name​

The dataset name.

Example: name: cool_dataset

params​

  • file_format: Specifies the data file format. Required if the format cannot be inferred by from the from path.
    • parquet: Parquet file format.
    • csv: CSV file format.
  • s3_endpoint: The S3 endpoint, or equivalent (e.g. MinIO endpoint), for the S3-compatible storage. Defaults to region endpoint. E.g. s3_endpoint: https://my.minio.server
  • s3_region: Region of the S3 bucket, if region specific. Default value is us-east-1 E.g. s3_region: us-east-1
  • client_timeout: Specifies timeout for S3 operations. Default value is 30s E.g. client_timeout: 60s
  • hive_partitioning_enabled: Enable partitioning using hive-style partitioning from the folder structure. Defaults to false

More CSV related parameters can be configured, see CSV Parameters

Auth​

Optional for public endpoints. Use the secret replacement syntax to load the password from a secret store, e.g. ${secrets:my_dremio_pass}.

  • s3_auth: (Optional) The authentication method to use. Values are public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.
  • s3_key: The access key (e.g. AWS_ACCESS_KEY_ID for AWS)
  • s3_secret: The secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)

For non-public buckets, s3_auth: key or s3_auth: iam_role is required. s3_auth: iam_role will use the AWS IAM role of the currently running instance. The following IAM policy shows the least privileged policy required for the S3 connector:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::yourcompany-bucketname-datasets/*"
}
]
}
SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE \
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY \
spice run
# Or using the CLI to configure the secrets into an `.env` file
spice login s3 -k AKIAIOSFODNN7EXAMPLE -s wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

.env

SPICE_S3_KEY=AKIAIOSFODNN7EXAMPLE
SPICE_S3_SECRET=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

spicepod.yaml

version: v1beta1
kind: Spicepod
name: spice-app

secrets:
- from: env
name: env

datasets:
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_region: us-east-1
s3_key: ${env:SPICE_S3_KEY}
s3_secret: ${env:SPICE_S3_SECRET}

Learn more about Env Secret Store.

Examples​

MinIO Example​

Create a dataset named cool_dataset from a Parquet file stored in MinIO.

- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: https://my.minio.server
s3_region: 'us-east-1' # Best practice for Minio

S3 Public Example​

Create a dataset named taxi_trips from a public S3 folder.

- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet

Hive Partitioning Example​

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet

Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

version: v1beta1
kind: Spicepod
name: hive_data

datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: true