Skip to main content

S3 Data Connector

The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).

If a folder path is specified as the dataset source, all files within the folder will be loaded.

File formats are specified using the file_format parameter, as described in Object Store File Formats.

datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet

Configuration​

from​

S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>

Example: from: s3://my-bucket/path/to/file.parquet

name​

The dataset name. This will be used as the table name within Spice.

Example:

datasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+

params​

Parameter NameDescription
file_formatSpecifies the data format. Required if it cannot be inferred from the object URI. Options: parquet, csv, json.
s3_endpointS3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server
s3_regionS3 bucket region. Default: us-east-1.
client_timeoutTimeout for S3 operations. Default: 30s.
hive_partitioning_enabledEnable partitioning using hive-style partitioning from the folder structure. Defaults to false
s3_authAuthentication type. Options: public, key and iam_role. Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key.
s3_keyAccess key (e.g. AWS_ACCESS_KEY_ID for AWS)
s3_secretSecret key (e.g. AWS_SECRET_ACCESS_KEY for AWS)
allow_httpAllow insecure HTTP connections to s3_endpoint. Defaults to false

For additional CSV parameters, see CSV Parameters

Authentication​

No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth to iam_role. If using iam_role, the AWS IAM role of the running instance is used.

Minimum IAM policy for S3 access:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}

Examples​

Public bucket Example​

Create a dataset named taxi_trips from a public S3 folder.

- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet

MinIO Example​

Create a dataset named cool_dataset from a Parquet file stored in MinIO.

- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: http://my.minio.server
s3_region: 'us-east-1' # Best practice for MinIO
allow_http: true

Hive Partitioning Example​

Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.

For example, a dataset partitioned by year, month, and day might have a directory structure like:

s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet

Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled is set to true.

version: v1beta1
kind: Spicepod
name: hive_data

datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: true

Secrets​

Spice supports three types of secret stores:

Explore the different options to manage sensitive data securely.