S3 Data Connector
The S3 Data Connector enables federated SQL querying on files stored in S3 or S3-compatible systems (e.g., MinIO, Cloudflare R2).
If a folder path is specified as the dataset source, all files within the folder will be loaded.
File formats are specified using the file_format
parameter, as described in Object Store File Formats.
datasets:
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
Configuration​
from
​
S3-compatible URI to a folder or file, in the format s3://<bucket>/<path>
Example: from: s3://my-bucket/path/to/file.parquet
name
​
The dataset name. This will be used as the table name within Spice.
Example:
datasets:
- from: s3://s3-bucket-name/taxi_sample.csv
name: cool_dataset
params:
file_format: csv
SELECT COUNT(*) FROM cool_dataset;
+----------+
| count(*) |
+----------+
| 6001215 |
+----------+
params
​
Parameter Name | Description |
---|---|
file_format | Specifies the data format. Required if it cannot be inferred from the object URI. Options: parquet , csv , json . |
s3_endpoint | S3 endpoint URL (e.g., for MinIO). Default is the region endpoint. E.g. s3_endpoint: https://my.minio.server |
s3_region | S3 bucket region. Default: us-east-1 . |
client_timeout | Timeout for S3 operations. Default: 30s . |
hive_partitioning_enabled | Enable partitioning using hive-style partitioning from the folder structure. Defaults to false |
s3_auth | Authentication type. Options: public , key and iam_role . Defaults to public if s3_key and s3_secret are not provided, otherwise defaults to key . |
s3_key | Access key (e.g. AWS_ACCESS_KEY_ID for AWS) |
s3_secret | Secret key (e.g. AWS_SECRET_ACCESS_KEY for AWS) |
allow_http | Allow insecure HTTP connections to s3_endpoint . Defaults to false |
For additional CSV parameters, see CSV Parameters
Authentication​
No authentication is required for public endpoints. For private buckets, set s3_auth to key or iam_role. For Kubernetes Service Accounts with assigned IAM roles, set s3_auth
to iam_role
. If using iam_role, the AWS IAM role of the running instance is used.
Minimum IAM policy for S3 access:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:ListBucket"],
"Resource": "arn:aws:s3:::company-bucketname-datasets"
},
{
"Effect": "Allow",
"Action": ["s3:GetObject"],
"Resource": "arn:aws:s3:::company-bucketname-datasets/*"
}
]
}
Examples​
Public bucket Example​
Create a dataset named taxi_trips
from a public S3 folder.
- from: s3://spiceai-demo-datasets/taxi_trips/2024/
name: taxi_trips
params:
file_format: parquet
MinIO Example​
Create a dataset named cool_dataset
from a Parquet file stored in MinIO.
- from: s3://s3-bucket-name/path/to/parquet/cool_dataset.parquet
name: cool_dataset
params:
s3_endpoint: http://my.minio.server
s3_region: 'us-east-1' # Best practice for MinIO
allow_http: true
Hive Partitioning Example​
Hive partitioning is a data organization technique that improves query performance by storing data in a hierarchical directory structure based on partition column values. This allows for efficient data retrieval by skipping unnecessary data scans.
For example, a dataset partitioned by year, month, and day might have a directory structure like:
s3://bucket/dataset/year=2024/month=03/day=15/data_file.parquet
s3://bucket/dataset/year=2024/month=03/day=16/data_file.parquet
Spice can automatically infer these partition columns from the directory structure when hive_partitioning_enabled
is set to true
.
version: v1beta1
kind: Spicepod
name: hive_data
datasets:
- from: s3://spiceai-public-datasets/hive_partitioned_data/
name: hive_data_infer
params:
file_format: parquet
hive_partitioning_enabled: true
Secrets​
Spice supports three types of secret stores:
Explore the different options to manage sensitive data securely.