Feature types

Shaped supports a wide range of feature types for training and scoring models. Feature types determine how data is processed, encoded, and used in model training. You can specify feature types explicitly using schema_override in your engine configuration, or let Shaped infer types automatically.

Specifying feature types

Use schema_override in the data block to specify how columns should be interpreted:

data:
  item_table:
    name: movies
    type: table
  schema_override:
    item:
      id: item_id
      features:
        - name: genre
          type: Sequence[TextCategory]
        - name: poster_url
          type: Image
        - name: movie_title
          type: Text
        - name: movie_age
          type: Numerical
        - name: primary_genre
          type: TextCategory
      created_at: created_at

If you don't specify a column in schema_override, Shaped will infer the type automatically based on the column data.

Supported feature types

Basic types

Category: Discrete categorical values (e.g., product categories, event types). Use for low-cardinality categorical data.

TextCategory: Text-based categorical values (e.g., product names, user segments). Similar to Category but optimized for text strings.

LowCardinalityCategory: High-cardinality categorical values that are still treated as categories (e.g., user IDs, item IDs when used as features).

Numerical: Continuous numeric values (e.g., price, rating, count).

Boolean: Binary true/false values.

Binary: Binary numeric values (0 or 1).

Timestamp: Temporal data such as timestamps and dates (e.g., created_at, last_signin).

Text: Unstructured text data (e.g., descriptions, bios, reviews). Text features can be used to create embeddings for semantic search.

Url: URL strings. Often used for image or media URLs.

Sequence types

Sequence types represent ordered lists of values:

Sequence[Category]: Ordered list of categorical values (e.g., ['action', 'drama', 'thriller']).

Sequence[TextCategory]: Ordered list of text-based categories.

Sequence[Text]: Ordered list of text strings (e.g., tags, keywords).

Sequence[Numerical]: Ordered list of numeric values.

Sequence[Binary]: Ordered list of binary values.

Set types

Set types represent unordered collections of unique values:

Set[Category]: Unordered set of categorical values.

Set[TextCategory]: Unordered set of text-based categories.

Set[Text]: Unordered set of text strings.

Set[Numerical]: Unordered set of numeric values.

Set[Binary]: Unordered set of binary values.

Media types

Media types are processed into embeddings using pre-trained models:

Image: Image data (e.g., product images, content thumbnails). Can be specified as URLs or base64-encoded strings. Processed using vision models like CLIP.

Audio: Audio data. Processed using audio understanding models.

Video: Video data. Processed using video understanding models.

Other types

Vector: Pre-computed vector embeddings. Use when you have existing embeddings you want to use directly.

Id: Entity identifier columns (e.g., item_id, user_id). Specified separately in the id field of schema_override.

Feature type inference

If you don't specify feature types in schema_override, Shaped automatically infers types based on:

Column data types from your source tables
Data patterns and cardinality
Column names (e.g., columns named created_at are inferred as Timestamp)

Explicitly specifying types in schema_override is recommended when:

You want to override automatic inference
You need to specify sequence or set types
You want to ensure media types (Image, Audio, Video) are processed correctly
You're working with complex data structures

Using feature types in embeddings

Text and media features can be used to create embeddings in the index block:

index:
  embeddings:
    - name: text_embedding
      encoder:
        type: hugging_face
        model_name: sentence-transformers/modernbert
        batch_size: 256
        item_fields:
          - title
          - description
    - name: image_embedding
      encoder:
        type: hugging_face
        model_name: openai/clip-vit-base-patch32
        batch_size: 32
        item_fields:
          - image_url

Text features specified as Text type can be used with text embedding models. Image features specified as Image type can be used with vision models like CLIP.

The batch_size parameter controls how many items are processed in each batch during embedding generation. It's recommended to use larger batch sizes (e.g., 256) for text embeddings and smaller batch sizes (e.g., 32) for image embeddings.

Example: Complete schema override

data:
  item_table:
    name: products
    type: table
  schema_override:
    item:
      id: item_id
      features:
        - name: title
          type: Text
        - name: description
          type: Text
        - name: category
          type: TextCategory
        - name: tags
          type: Set[TextCategory]
        - name: price
          type: Numerical
        - name: image_url
          type: Image
        - name: in_stock
          type: Boolean
        - name: rating
          type: Numerical
      created_at: created_at
    user:
      id: user_id
      features:
        - name: age
          type: Numerical
        - name: location
          type: TextCategory
        - name: preferences
          type: Set[TextCategory]
    interaction:
      features:
        - name: event_type
          type: Category
      created_at: created_at

Specifying feature types​

Supported feature types​

Basic types​

Sequence types​

Set types​

Media types​

Other types​

Feature type inference​

Using feature types in embeddings​

Example: Complete schema override​