Skip to main content

Feature types

Shaped supports a wide range of feature types for training and scoring models. Feature types determine how data is processed, encoded, and used in model training. You can specify feature types explicitly using schema_override in your engine configuration, or let Shaped infer types automatically.

Specifying feature types

Use schema_override in the data block to specify how columns should be interpreted:

data:
item_table:
name: movies
type: table
schema_override:
item:
id: item_id
features:
- name: genre
type: Sequence[TextCategory]
- name: poster_url
type: Image
- name: movie_title
type: Text
- name: movie_age
type: Numerical
- name: primary_genre
type: TextCategory
created_at: created_at

If you don't specify a column in schema_override, Shaped will infer the type automatically based on the column data.

Supported feature types

Basic types

Category: Discrete categorical values (e.g., product categories, event types). Use for low-cardinality categorical data.

TextCategory: Text-based categorical values (e.g., product names, user segments). Similar to Category but optimized for text strings.

LowCardinalityCategory: High-cardinality categorical values that are still treated as categories (e.g., user IDs, item IDs when used as features).

Numerical: Continuous numeric values (e.g., price, rating, count).

Boolean: Binary true/false values.

Binary: Binary numeric values (0 or 1).

Timestamp: Temporal data such as timestamps and dates (e.g., created_at, last_signin).

Text: Unstructured text data (e.g., descriptions, bios, reviews). Text features can be used to create embeddings for semantic search.

Url: URL strings. Often used for image or media URLs.

Sequence types

Sequence types represent ordered lists of values:

Sequence[Category]: Ordered list of categorical values (e.g., ['action', 'drama', 'thriller']).

Sequence[TextCategory]: Ordered list of text-based categories.

Sequence[Text]: Ordered list of text strings (e.g., tags, keywords).

Sequence[Numerical]: Ordered list of numeric values.

Sequence[Binary]: Ordered list of binary values.

Set types

Set types represent unordered collections of unique values:

Set[Category]: Unordered set of categorical values.

Set[TextCategory]: Unordered set of text-based categories.

Set[Text]: Unordered set of text strings.

Set[Numerical]: Unordered set of numeric values.

Set[Binary]: Unordered set of binary values.

Media types

Media types are processed into embeddings using pre-trained models:

Image: Image data (e.g., product images, content thumbnails). Can be specified as URLs or base64-encoded strings. Processed using vision models like CLIP.

Audio: Audio data. Processed using audio understanding models.

Video: Video data. Processed using video understanding models.

Other types

Vector: Pre-computed vector embeddings. Use when you have existing embeddings you want to use directly.

Id: Entity identifier columns (e.g., item_id, user_id). Specified separately in the id field of schema_override.

Feature type inference

If you don't specify feature types in schema_override, Shaped automatically infers types based on:

  • Column data types from your source tables
  • Data patterns and cardinality
  • Column names (e.g., columns named created_at are inferred as Timestamp)

Explicitly specifying types in schema_override is recommended when:

  • You want to override automatic inference
  • You need to specify sequence or set types
  • You want to ensure media types (Image, Audio, Video) are processed correctly
  • You're working with complex data structures

Using feature types in embeddings

Text and media features can be used to create embeddings in the index block:

index:
embeddings:
- name: text_embedding
encoder:
type: hugging_face
model_name: sentence-transformers/modernbert
item_fields:
- title
- description
- name: image_embedding
encoder:
type: hugging_face
model_name: openai/clip-vit-base-patch32
item_fields:
- image_url

Text features specified as Text type can be used with text embedding models. Image features specified as Image type can be used with vision models like CLIP.

Example: Complete schema override

data:
item_table:
name: products
type: table
schema_override:
item:
id: item_id
features:
- name: title
type: Text
- name: description
type: Text
- name: category
type: TextCategory
- name: tags
type: Set[TextCategory]
- name: price
type: Numerical
- name: image_url
type: Image
- name: in_stock
type: Boolean
- name: rating
type: Numerical
created_at: created_at
user:
id: user_id
features:
- name: age
type: Numerical
- name: location
type: TextCategory
- name: preferences
type: Set[TextCategory]
interaction:
features:
- name: event_type
type: Category