S3
The S3 connector allows you to create a Shaped model directly from a set of Parquet, CSV, TSV, or JSONL files within an S3 bucket.
Shaped fetches data from the given S3 bucket periodically each time a new file is added to the buicket. To ensure your model is trained on the most recent data, make sure you push the latest data to S3 periodically.
In order for Shaped to incrementally load data from S3, the files must be lexicographically sorted by name, and the file names must be unique. The simplest way to achieve this is to include a timestamp suffix in the file names, such that the most recent files are lexicographically last.
Resource Access
Shaped needs access to the S3 bucket that contains your files. This can be done by modifying the bucket policy to grant explicit read access to the Shaped AWS Customer Data Access IAM User.
To grant access:
Update the bucket policy attached to your S3 bucket to grant the following permissions:
- s3:GetObject
- s3:ListBucket
In this bucket policy, grant the Shaped AWS Customer Data Access IAM User permissions to access the bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{shaped_account_id}:user/CustomerDataAccessUser"
},
"Action": [
"s3:GetObject",
"s3:ListBucket"
],
"Resources": [
"arn:aws:s3:::{your_bucket}",
"arn:aws:s3:::{your_bucket}/*"
]
}
]
}
The Shaped AWS Account ID is available on request from the Shaped team.
Dataset Configuration
Required fields
Field | Example | Description |
---|---|---|
schema_type | CUSTOM | Specifies the connector schema type, in this case "CUSTOM". |
column_schema | The schema of the data contained within files. | |
s3_path | s3://shaped-data/product/ | The path to the directory within the bucket, with the format s3://{bucket}/{path}/ . Note: If a path is provided, we will expand the path to include all files within the directory. If a specific glob pattern is provided, this will be used to filter the files, i.e. s3://shaped-data/product/*.parquet.gz . |
s3_format | PARQUET | The type of file to sync from S3 into Shaped, currently supports PARQUET , CSV , TSV and JSON . |