Apache Iceberg
Preparation
To allow Shaped to connect to your Iceberg table, you need to grant Shaped’s AWS service account read-only access to your data lake. You can do this through the AWS console or with the following steps:
- Contact us for our service account via email.
- Grant our service account permission to access your data lake via the appropriate IAM role.
For example, if your Iceberg table is stored in S3, you can grant our service account the
s3:GetObject
ands3:ListBucket
permissions on the relevant S3 bucket.
Dataset Configuration
Required fields
Field | Example | Description |
---|---|---|
schema_type | ICEBERG | Specifies the connector schema type, in this case "ICEBERG". |
catalog_type | glue or hive | Specifies the type of the Iceberg catalog. |
catalog_name | my_glue_catalog or my_hive_catalog | Specifies the name of the Iceberg catalog. |
table_name | my_iceberg_table | Specifies the name of the Iceberg table. |
Optional fields
Field | Example | Description |
---|---|---|
aws_role_arn | arn:aws:iam::123456789012:role/my_role | Specifies the ARN of an AWS role to assume when accessing the Iceberg table. This is required if the Iceberg table is stored in a secure location, such as an S3 bucket with restricted access. |
aws_region | us-east-1 | Specifies the AWS region where the Iceberg table is located. This is required if the Iceberg table is stored in a region other than the default region for your AWS account. |
unique_keys | ["productId"] | Specify a list of columns that uniquely identify a row in the table, if duplicate rows are inserted with these keys, the latest row will be used. |
batch_size | 10000 | Specifies the number of records to fetch in each batch. The default value is 10000. |
Dataset Creation Example
Below is an example of an Iceberg dataset connector configuration:
name: my_iceberg_dataset
schema_type: ICEBERG
catalog_type: glue
catalog_name: my_glue_catalog
table_name: my_iceberg_table
aws_role_arn: arn:aws:iam::123456789012:role/my_role
aws_region: us-east-1
The following payload will create an Iceberg dataset and begin syncing data from Shaped using the Shaped CLI.
shaped create-dataset --file dataset.yaml