Defining signals

Signal Engine is a declarative feature engineering system for GBDT (Gradient Boosted Decision Tree) models. Instead of writing custom feature transforms, you define signals — typed, composable feature definitions — that the engine resolves automatically at both training and scoring time.

Signal Engine is configured through the feature_definitions field on the gbdt model policy. For when to use GBDT versus other models, see Choose a model.

Quick start

Add a feature_definitions list to a gbdt model in your engine config. Each entry is a signal with a type, a name, and type-specific parameters:

training:
  models:
    - name: click_score
      policy_type: gbdt
      feature_definitions:
        - type: lookup
          name: item_price
          input: item.price

        - type: lookup
          name: user_age
          input: user.age

        - type: aggregation
          name: clicks_7d
          input: user_id
          aggregation_fn: count
          group_by: [user_id]
          window: 7d

When feature_definitions is omitted, Signal Engine generates a reasonable default: lookup signals for every numeric user and item column plus interaction count aggregations over 7-day and 30-day windows.

Column references in signals use a fixed convention: user columns as user.<column_name>, item columns as item.<column_name>, and spine (interaction) columns by their table name (e.g. user_id, item_id, created_at, label).

How it works

At training time, Signal Engine joins your interaction spine with user and item metadata tables, then resolves every signal in order. Aggregation signals only see data before the current row to prevent time leakage.
At scoring time, Signal Engine resolves the same signals using precomputed state stored in RocksDB, so features are available in real-time with no recomputation.

tip

Aggregations only use data from before the current row, so time-windowed features cannot see future events. This prevents time leakage in both training and scoring.

Signals can reference the output of earlier signals by name, so you can chain transforms:

feature_definitions:
  - type: lookup
    name: raw_price
    input: item.price

  - type: transform
    name: log_price
    input: raw_price
    method: log1p

Signal types

lookup

Direct column access from user, item, or interaction tables.

- type: lookup
  name: item_price
  input: item.price

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Column reference (e.g. `user.age`, `item.price`)
`precomputed`	no	Store in RocksDB for online scoring (default `false`)

expression

Arbitrary arithmetic expression over columns or other signals.

- type: expression
  name: price_per_rating
  expr: item.price / item.avg_rating

Field	Required	Description
`name`	yes	Output feature name
`expr`	yes	Expression string (e.g. `item.price * 2`)
`precomputed`	no	Default `false`

ratio

Safe division with optional additive smoothing to avoid division by zero.

- type: ratio
  name: click_rate
  numerator: clicks_7d
  denominator: impressions_7d
  smooth: 1.0

Field	Required	Description
`name`	yes	Output feature name
`numerator`	yes	Numerator column or signal
`denominator`	yes	Denominator column or signal
`smooth`	no	Additive smoothing constant (default `0`)
`precomputed`	no	Default `false`

transform

Unary mathematical transform.

- type: transform
  name: log_price
  input: item.price
  method: log1p

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column or signal
`method`	yes	One of `log1p`, `log`, `sqrt`, `abs`
`precomputed`	no	Default `false`

cast

Explicit type casting.

- type: cast
  name: year_int
  input: item.year
  target_type: int

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column or signal
`target_type`	yes	One of `int`, `int32`, `int64`, `float`, `float32`, `float64`, `str`, `bool`
`precomputed`	no	Default `false`

clip

Clip values to a [min, max] range.

- type: clip
  name: clipped_price
  input: item.price
  min: 0
  max: 1000

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column or signal
`min`	yes	Lower bound
`max`	yes	Upper bound
`precomputed`	no	Default `false`

normalization

Normalize values using standard scaling or min-max scaling. The scaler parameters (mean/std or min/max) must be provided explicitly.

- type: normalization
  name: price_scaled
  input: item.price
  method: standard_scaler
  mean: 49.99
  std: 25.0

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column or signal
`method`	yes	`standard_scaler` or `min_max_scaler`
`mean`	conditional	Required for `standard_scaler`
`std`	conditional	Required for `standard_scaler`
`min`	conditional	Required for `min_max_scaler`
`max`	conditional	Required for `min_max_scaler`
`precomputed`	no	Default `false`

bucket

Discretize continuous values into buckets. When boundaries is empty, quartile boundaries are computed at training time.

- type: bucket
  name: price_bucket
  input: item.price
  boundaries: [10, 50, 100, 500]
  output: ordinal_index

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column or signal
`boundaries`	no	Sorted list of bucket boundaries (default: auto quartiles)
`output`	no	`ordinal_index` (default) or `one_hot`
`precomputed`	no	Default `false`

multi_hot

Multi-hot encoding for categorical or list columns.

- type: multi_hot
  name: genres_encoded
  input: item.genres
  vocab: [action, comedy, drama, horror, sci-fi]

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Categorical or list column
`vocab`	no	Fixed vocabulary list (inferred if omitted)
`precomputed`	no	Default `false`

aggregation

Time-windowed aggregation over interaction history. Aggregations are precomputed by default and stored in RocksDB for real-time scoring.

- type: aggregation
  name: purchases_30d
  input: label
  aggregation_fn: count
  group_by: [user_id]
  window: 30d
  filter: "event_type = purchase"

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Column to aggregate
`aggregation_fn`	yes	`count`, `sum`, `avg`, `min`, `max`, or `count_distinct`
`group_by`	yes	List of columns to group by (e.g. `[user_id]`)
`window`	no	Time window: `<number><unit>` where unit is `d`, `h`, `m`, or `s`
`filter`	no	Filter expression (e.g. `event_type = purchase`)
`explode`	no	Column to explode before aggregation
`precomputed`	no	Default `true`

tip

For a simple row count per group (e.g. "number of interactions per user"), use the group key as input with aggregation_fn: count — e.g. input: user_id, group_by: [user_id].

time_since_last

Time elapsed since the last interaction matching optional filter criteria.

- type: time_since_last
  name: days_since_last_click
  group_by: user_id
  unit: days
  filter: "event_type = click"

Field	Required	Description
`name`	yes	Output feature name
`group_by`	yes	Group-by column
`unit`	yes	`days`, `hours`, `minutes`, or `seconds`
`input`	no	Timestamp column (defaults to spine timestamp)
`filter`	no	Filter expression
`precomputed`	no	Default `false`

cyclic_time

Cyclic sin/cos encoding for timestamp components. Produces two output columns: {name}_sin and {name}_cos.

- type: cyclic_time
  name: hour_of_day
  input: created_at
  component: hour

Field	Required	Description
`name`	yes	Output column name prefix
`input`	yes	Timestamp column
`component`	yes	`hour`, `dayofweek`, `day`, `month`, or `minute`
`precomputed`	no	Default `false`

time_component

Scalar extraction of a timestamp component.

- type: time_component
  name: day_of_week
  input: created_at
  component: dayofweek

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Timestamp column
`component`	yes	`dayofweek`, `day`, `month`, `year`, `hour`, or `minute`
`precomputed`	no	Default `false`

cross

Hashed interaction or dot product of multiple columns.

- type: cross
  name: user_category
  inputs: [user_id, item.category]
  method: hash
  buckets: 10000

Field	Required	Description
`name`	yes	Output feature name
`inputs`	yes	List of columns to cross (minimum 2)
`method`	no	`hash` (default) or `dot_product`
`buckets`	conditional	Required when `method` is `hash`
`precomputed`	no	Default `false`

factorize

Map categorical values to integer indices.

- type: factorize
  name: category_id
  input: item.category
  hash_bucket_size: 1000

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Categorical column
`vocab`	conditional	Fixed vocabulary list (provide `vocab` or `hash_bucket_size`)
`hash_bucket_size`	conditional	Hash bucket size for unknown values
`precomputed`	no	Default `true`

lag

Value of a column N interactions ago.

- type: lag
  name: prev_item_price
  input: item.price
  group_by: user_id
  amount: 1

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column
`group_by`	yes	Group-by column
`amount`	yes	Number of steps to lag
`precomputed`	no	Default `false`

diff

Difference between the current value and the value N steps ago.

- type: diff
  name: price_change
  input: item.price
  group_by: user_id
  amount: 1

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Input column
`group_by`	yes	Group-by column
`amount`	yes	Number of steps to diff
`precomputed`	no	Default `false`

vector_similarity

Cosine or dot-product similarity between two embedding vectors. Requires vector store tables to be configured on the engine.

- type: vector_similarity
  name: user_item_sim
  query: user.embedding
  candidate: item.embedding
  method: cosine

Field	Required	Description
`name`	yes	Output feature name
`query`	yes	Query vector column
`candidate`	yes	Candidate vector column
`method`	no	`cosine` (default) or `dot_product`
`precomputed`	no	Default `true`

warning

Vector signals (vector_similarity, vector_aggregation) require vector store tables to be configured on your engine (e.g. via your embedding or index configuration). Without them, resolution will fail at train or score time.

sequence

Extract a fixed-length sequence of IDs from interaction history.

- type: sequence
  name: recent_items
  input: item_id
  group_by: user_id
  max_len: 50
  window: 30d

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	ID column to collect
`group_by`	yes	Group-by column
`max_len`	yes	Maximum sequence length
`window`	no	Time window
`padding`	no	Padding value for shorter sequences
`filter`	no	Filter expression
`precomputed`	no	Default `true`

vector_aggregation

Aggregate embedding vectors over time windows using mean, sum, or max pooling.

- type: vector_aggregation
  name: user_embedding_avg
  input: item.embedding
  group_by: user_id
  op: mean
  window: 30d

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	Vector column
`group_by`	yes	Group-by column
`op`	yes	`mean`, `sum`, or `max`
`window`	no	Time window
`precomputed`	no	Default `true`

list_op

Operations on list-valued columns.

- type: list_op
  name: num_tags
  input: item.tags
  op: len

Field	Required	Description
`name`	yes	Output feature name
`input`	yes	List column
`op`	yes	`len`, `jaccard_index`, or `{"contains": "value"}`
`precomputed`	no	Default `false`

vector_flatten

Flatten a vector column into individual scalar columns ({name}_0, {name}_1, ...).

- type: vector_flatten
  name: embedding
  input: item.embedding
  dim: 64

Field	Required	Description
`name`	yes	Output column name prefix
`input`	yes	Vector column
`dim`	no	Fixed dimension (inferred from data if omitted)
`precomputed`	no	Default `false`

geo_distance

Haversine distance between two geographic points.

- type: geo_distance
  name: distance_km
  lat1: user.latitude
  lon1: user.longitude
  lat2: item.latitude
  lon2: item.longitude
  unit: km

Field	Required	Description
`name`	yes	Output feature name
`lat1`, `lon1`	yes	First point (latitude, longitude columns)
`lat2`, `lon2`	yes	Second point (latitude, longitude columns)
`unit`	no	`km` (default) or `miles`
`precomputed`	no	Default `false`

geo_hash

Encode geographic coordinates as a geohash string.

- type: geo_hash
  name: user_geohash
  lat: user.latitude
  lon: user.longitude
  precision: 6

Field	Required	Description
`name`	yes	Output feature name
`lat`	yes	Latitude column
`lon`	yes	Longitude column
`precision`	yes	Geohash precision (1–12)
`precomputed`	no	Default `false`

Hidden signals

Any signal can set hide: true to compute the signal without including it in the final feature set. Hidden signals are still available as inputs to other signals:

feature_definitions:
  - type: lookup
    name: raw_price
    input: item.price
    hide: true

  - type: bucket
    name: price_tier
    input: raw_price
    boundaries: [10, 50, 100]

Here raw_price is computed and fed into price_tier, but only price_tier appears as a model feature.

Precomputed signals

Signals marked with precomputed: true have their values stored in RocksDB during training. At scoring time, these precomputed values are read directly instead of being recomputed, enabling real-time feature resolution.

Aggregation, factorize, sequence, vector_similarity, and vector_aggregation signals default to precomputed: true because they require historical state that isn't available at scoring time.

note

If the GBDT config sets use_session_interactions: false, session interactions are not passed to the engine at scoring time. Time-windowed aggregations then rely only on precomputed state; very recent behavior won't be reflected until the next training run updates state.

End-to-end example

This example builds a GBDT scoring model for an e-commerce recommendation engine that combines user attributes, item attributes, behavioral aggregations, time features, and embedding similarity.

Engine configuration

data:
  item_table:
    name: products
    type: table
  user_table:
    name: users
    type: table
  interaction_table:
    name: interactions
    type: table

training:
  models:
    - name: purchase_score
      policy_type: gbdt
      objective: binary
      feature_definitions:
        # User features
        - type: lookup
          name: user_age
          input: user.age
        - type: lookup
          name: user_account_days
          input: user.account_age_days

        # Item features
        - type: lookup
          name: item_price
          input: item.price
        - type: transform
          name: log_price
          input: item_price
          method: log1p
        - type: bucket
          name: price_tier
          input: item_price
          boundaries: [10, 25, 50, 100, 250]

        # Behavioral aggregations
        - type: aggregation
          name: views_7d
          input: user_id
          aggregation_fn: count
          group_by: [user_id]
          window: 7d
        - type: aggregation
          name: purchases_30d
          input: user_id
          aggregation_fn: count
          group_by: [user_id]
          window: 30d
          filter: "event_type = purchase"
        - type: ratio
          name: purchase_rate
          numerator: purchases_30d
          denominator: views_7d
          smooth: 1.0

        # Time features
        - type: cyclic_time
          name: hour
          input: created_at
          component: hour
        - type: time_since_last
          name: days_since_purchase
          group_by: user_id
          unit: days
          filter: "event_type = purchase"

        # Cross features
        - type: cross
          name: user_category
          inputs: [user_id, item.category]
          method: hash
          buckets: 50000

        # Geo features
        - type: geo_distance
          name: delivery_distance
          lat1: user.latitude
          lon1: user.longitude
          lat2: item.warehouse_lat
          lon2: item.warehouse_lon
          unit: km

queries:
  product_ranking:
    query:
      type: rank
      from: item
      retrieve:
        - type: column_order
          columns:
            - name: _derived_popular_rank
              ascending: true
          limit: 1000
      score:
        type: score_ensemble
        value_model: purchase_score
        input_user_id: $parameters.user_id
        input_interactions_item_ids: $parameters.interaction_item_ids
      limit: 50
    parameters:
      user_id:
        default: null

Quick start​

How it works​

Signal types​

lookup​

expression​

ratio​

transform​

cast​

clip​

normalization​

bucket​

multi_hot​

aggregation​

time_since_last​

cyclic_time​

time_component​

cross​

factorize​

lag​

diff​

vector_similarity​

sequence​

vector_aggregation​

list_op​

vector_flatten​

geo_distance​

geo_hash​

Hidden signals​

Precomputed signals​

End-to-end example​

Engine configuration​