Data Connectors
Data Connectors provide connections to databases, data warehouses, and data lakes for federated SQL queries and data replication.
Currently supported Data Connectors include:
Name | Description | Status | Protocol/Format | Refresh Modes | Supports Ingestion | Supports Documents |
---|---|---|---|---|---|---|
abfs | Azure BlobFS | Alpha | Parquet, CSV | append , full | Roadmap | ✅ |
clickhouse | Clickhouse | Alpha | append , full | ❌ | ❌ | |
databricks | Databricks | Beta | Spark Connect S3 / Delta Lake | append , full | Roadmap | ❌ |
debezium | Debezium | Alpha | CDC, Kafka | append , full , changes | ❌ | ❌ |
delta_lake | Delta Lake | Beta | Delta Lake | append , full | Roadmap | ❌ |
dremio | Dremio | Alpha | Arrow Flight SQL | append , full | ❌ | ❌ |
file | File | Alpha | Parquet, CSV | append , full | Roadmap | ✅ |
flightsql | FlightSQL | Beta | Arrow Flight SQL | append , full | ❌ | ❌ |
ftp , sftp | FTP/SFTP | Alpha | Parquet, CSV | append , full | ❌ | ✅ |
github | GitHub | Beta | GraphQL, REST | append , full | ❌ | ❌ |
graphql | GraphQL | Alpha | GraphQL | append , full | ❌ | ❌ |
http , https | HTTP(s) | Alpha | Parquet, CSV | append , full | ❌ | ❌ |
mssql | MS SQL Server | Alpha | Tabular Data Stream (TDS) | append , full | ❌ | ❌ |
mysql | MySQL | Beta | append , full | Roadmap | ❌ | |
odbc | ODBC | Beta | append , full | ❌ | ❌ | |
postgres | PostgreSQL | Beta | append , full | Roadmap | ❌ | |
s3 | S3 | Beta | Parquet, CSV | append , full | Roadmap | ✅ |
sharepoint | SharePoint | Alpha | append , full | ❌ | ✅ | |
snowflake | Snowflake | Alpha | Arrow | append , full | Roadmap | ❌ |
spiceai | Spice.ai | Beta | Arrow Flight | append , full | ✅ | ❌ |
spark | Spark | Alpha | Spark Connect | append , full | ❌ | ❌ |
Object Store File Formats
For data connectors that are object store compatible, if a folder is provided, the file format must be specified with params.file_format
.
If a file is provided, the file format will be inferred, and params.file_format
is unnecessary.
File formats currently supported are:
Name | Parameter | Supported | Is Document Format |
---|---|---|---|
Apache Parquet | file_format: parquet | ✅ | ❌ |
CSV | file_format: csv | ✅ | ❌ |
Apache Iceberg | file_format: iceberg | Roadmap | ❌ |
JSON | file_format: json | Roadmap | ❌ |
Microsoft Excel | file_format: xlsx | Roadmap | ❌ |
Markdown | file_format: md | ✅ | ✅ |
Text | file_format: txt | ✅ | ✅ |
file_format: pdf | Alpha | ✅ | |
Microsoft Word | file_format: docx | Alpha | ✅ |
File formats support additional parameters in the params
(like csv_has_header
) described in File Formats
If a format is a document format, each file will be treated as a document, as per document support below.
Document formats in Alpha (e.g. pdf, docx) may not parse all structure or text from the underlying documents correctly.
Document Support
If a Data Connector supports documents, when the appropriate file format is specified (see above), each file will be treated as a row in the table, with the contents of the file within the content
column. Additional columns will exist, dependent on the data connector.
Example
Consider a local filesystem
>>> ls -la
total 232
drwxr-sr-x@ 22 jeadie staff 704 30 Jul 13:12 .
drwxr-sr-x@ 18 jeadie staff 576 30 Jul 13:12 ..
-rw-r--r--@ 1 jeadie staff 1329 15 Jan 2024 DR-000-Template.md
-rw-r--r--@ 1 jeadie staff 4966 11 Aug 2023 DR-001-Dremio-Architecture.md
-rw-r--r--@ 1 jeadie staff 2307 28 Jul 2023 DR-002-Data-Completeness.md
And the spicepod
datasets:
- name: my_documents
from: file:docs/decisions/
params:
file_format: md
A Document table will be created.
>>> SELECT * FROM my_documents LIMIT 3
+----------------------------------------------------+--------------------------------------------------+
| location | content |
+----------------------------------------------------+--------------------------------------------------+
| Users/docs/decisions/DR-000-Template.md | # DR-000: DR Template |
| | **Date:** <> |
| | **Decision Makers:** |
| | - @<> |
| | - @<> |
| | ... |
| Users/docs/decisions/DR-001-Dremio-Architecture.md | # DR-001: Add "Cached" Dremio Dataset |
| | |
| | ## Context |
| | |
| | We use [Dremio](https://www.dremio.com/) to p... |
| Users/docs/decisions/DR-002-Data-Completeness.md | # DR-002: Append-Only Data Completeness |
| | |
| | ## Context |
| | |
| | Our Ethereum append-only dataset is incomple... |
+----------------------------------------------------+--------------------------------------------------+
Data Connector Docs
📄️ Azure BlobFS Data Connector
Azure BlobFS Data Connector Documentation
📄️ Clickhouse Data Connector
Clickhouse Data Connector Documentation
📄️ Databricks Data Connector
Databricks Data Connector Documentation
📄️ Debezium Data Connector
Debezium Data Connector Documentation
📄️ Delta Lake Data Connector
Delta Lake Data Connector Documentation
📄️ Dremio Data Connector
Dremio Data Connector Documentation
📄️ DuckDB Data Connector
DuckDB Data Connector Documentation
📄️ File Data Connector
File Data Connector Documentation
📄️ Flight SQL Data Connector
Flight SQL Data Connector Documentation
📄️ FTP/SFTP Data Connector
FTP/SFTP Data Connector Documentation
📄️ GitHub Data Connector
GitHub Data Connector Documentation
📄️ GraphQL Data Connector
GraphQL Data Connector Documentation
📄️ HTTP(s) Data Connector
HTTP(s) Data Connector Documentation
📄️ Microsoft SQL Server
Microsoft SQL Server Data Connector
📄️ MySQL Data Connector
MySQL Data Connector Documentation
📄️ ODBC Data Connector
ODBC Data Connector Documentation
📄️ PostgreSQL Data Connector
PostgreSQL Data Connector Documentation
📄️ S3 Data Connector
S3 Data Connector Documentation
📄️ SharePoint Data Connector
SharePoint Data Connector Documentation
📄️ Snowflake Data Connector
Snowflake Data Connector Documentation
📄️ Apache Spark Connector
Apache Spark Connector Documentation
📄️ Spice.ai Data Connector
Spice.ai Data Connector Documentation