Skip to main content

Defining signals

Signal Engine is a declarative feature engineering system for GBDT (Gradient Boosted Decision Tree) models. Instead of writing custom feature transforms, you define signals — typed, composable feature definitions — that the engine resolves automatically at both training and scoring time.

Signal Engine is configured through the feature_definitions field on the gbdt model policy. For when to use GBDT versus other models, see Choose a model.

Quick start

Add a feature_definitions list to a gbdt model in your engine config. Each entry is a signal with a type, a name, and type-specific parameters:

training:
models:
- name: click_score
policy_type: gbdt
feature_definitions:
- type: lookup
name: item_price
input: item.price

- type: lookup
name: user_age
input: user.age

- type: aggregation
name: clicks_7d
input: user_id
aggregation_fn: count
group_by: [user_id]
window: 7d

When feature_definitions is omitted, Signal Engine generates a reasonable default: lookup signals for every numeric user and item column plus interaction count aggregations over 7-day and 30-day windows.

Column references in signals use a fixed convention: user columns as user.<column_name>, item columns as item.<column_name>, and spine (interaction) columns by their table name (e.g. user_id, item_id, created_at, label).

How it works

  1. At training time, Signal Engine joins your interaction spine with user and item metadata tables, then resolves every signal in order. Aggregation signals only see data before the current row to prevent time leakage.
  2. At scoring time, Signal Engine resolves the same signals using precomputed state stored in RocksDB, so features are available in real-time with no recomputation.
tip

Aggregations only use data from before the current row, so time-windowed features cannot see future events. This prevents time leakage in both training and scoring.

Signals can reference the output of earlier signals by name, so you can chain transforms:

feature_definitions:
- type: lookup
name: raw_price
input: item.price

- type: transform
name: log_price
input: raw_price
method: log1p

Signal types

lookup

Direct column access from user, item, or interaction tables.

- type: lookup
name: item_price
input: item.price
FieldRequiredDescription
nameyesOutput feature name
inputyesColumn reference (e.g. user.age, item.price)
precomputednoStore in RocksDB for online scoring (default false)

expression

Arbitrary arithmetic expression over columns or other signals.

- type: expression
name: price_per_rating
expr: item.price / item.avg_rating
FieldRequiredDescription
nameyesOutput feature name
expryesExpression string (e.g. item.price * 2)
precomputednoDefault false

ratio

Safe division with optional additive smoothing to avoid division by zero.

- type: ratio
name: click_rate
numerator: clicks_7d
denominator: impressions_7d
smooth: 1.0
FieldRequiredDescription
nameyesOutput feature name
numeratoryesNumerator column or signal
denominatoryesDenominator column or signal
smoothnoAdditive smoothing constant (default 0)
precomputednoDefault false

transform

Unary mathematical transform.

- type: transform
name: log_price
input: item.price
method: log1p
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column or signal
methodyesOne of log1p, log, sqrt, abs
precomputednoDefault false

cast

Explicit type casting.

- type: cast
name: year_int
input: item.year
target_type: int
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column or signal
target_typeyesOne of int, int32, int64, float, float32, float64, str, bool
precomputednoDefault false

clip

Clip values to a [min, max] range.

- type: clip
name: clipped_price
input: item.price
min: 0
max: 1000
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column or signal
minyesLower bound
maxyesUpper bound
precomputednoDefault false

normalization

Normalize values using standard scaling or min-max scaling. The scaler parameters (mean/std or min/max) must be provided explicitly.

- type: normalization
name: price_scaled
input: item.price
method: standard_scaler
mean: 49.99
std: 25.0
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column or signal
methodyesstandard_scaler or min_max_scaler
meanconditionalRequired for standard_scaler
stdconditionalRequired for standard_scaler
minconditionalRequired for min_max_scaler
maxconditionalRequired for min_max_scaler
precomputednoDefault false

bucket

Discretize continuous values into buckets. When boundaries is empty, quartile boundaries are computed at training time.

- type: bucket
name: price_bucket
input: item.price
boundaries: [10, 50, 100, 500]
output: ordinal_index
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column or signal
boundariesnoSorted list of bucket boundaries (default: auto quartiles)
outputnoordinal_index (default) or one_hot
precomputednoDefault false

multi_hot

Multi-hot encoding for categorical or list columns.

- type: multi_hot
name: genres_encoded
input: item.genres
vocab: [action, comedy, drama, horror, sci-fi]
FieldRequiredDescription
nameyesOutput feature name
inputyesCategorical or list column
vocabnoFixed vocabulary list (inferred if omitted)
precomputednoDefault false

aggregation

Time-windowed aggregation over interaction history. Aggregations are precomputed by default and stored in RocksDB for real-time scoring.

- type: aggregation
name: purchases_30d
input: label
aggregation_fn: count
group_by: [user_id]
window: 30d
filter: "event_type = purchase"
FieldRequiredDescription
nameyesOutput feature name
inputyesColumn to aggregate
aggregation_fnyescount, sum, avg, min, max, or count_distinct
group_byyesList of columns to group by (e.g. [user_id])
windownoTime window: <number><unit> where unit is d, h, m, or s
filternoFilter expression (e.g. event_type = purchase)
explodenoColumn to explode before aggregation
precomputednoDefault true
tip

For a simple row count per group (e.g. "number of interactions per user"), use the group key as input with aggregation_fn: count — e.g. input: user_id, group_by: [user_id].


time_since_last

Time elapsed since the last interaction matching optional filter criteria.

- type: time_since_last
name: days_since_last_click
group_by: user_id
unit: days
filter: "event_type = click"
FieldRequiredDescription
nameyesOutput feature name
group_byyesGroup-by column
unityesdays, hours, minutes, or seconds
inputnoTimestamp column (defaults to spine timestamp)
filternoFilter expression
precomputednoDefault false

cyclic_time

Cyclic sin/cos encoding for timestamp components. Produces two output columns: {name}_sin and {name}_cos.

- type: cyclic_time
name: hour_of_day
input: created_at
component: hour
FieldRequiredDescription
nameyesOutput column name prefix
inputyesTimestamp column
componentyeshour, dayofweek, day, month, or minute
precomputednoDefault false

time_component

Scalar extraction of a timestamp component.

- type: time_component
name: day_of_week
input: created_at
component: dayofweek
FieldRequiredDescription
nameyesOutput feature name
inputyesTimestamp column
componentyesdayofweek, day, month, year, hour, or minute
precomputednoDefault false

cross

Hashed interaction or dot product of multiple columns.

- type: cross
name: user_category
inputs: [user_id, item.category]
method: hash
buckets: 10000
FieldRequiredDescription
nameyesOutput feature name
inputsyesList of columns to cross (minimum 2)
methodnohash (default) or dot_product
bucketsconditionalRequired when method is hash
precomputednoDefault false

factorize

Map categorical values to integer indices.

- type: factorize
name: category_id
input: item.category
hash_bucket_size: 1000
FieldRequiredDescription
nameyesOutput feature name
inputyesCategorical column
vocabconditionalFixed vocabulary list (provide vocab or hash_bucket_size)
hash_bucket_sizeconditionalHash bucket size for unknown values
precomputednoDefault true

lag

Value of a column N interactions ago.

- type: lag
name: prev_item_price
input: item.price
group_by: user_id
amount: 1
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column
group_byyesGroup-by column
amountyesNumber of steps to lag
precomputednoDefault false

diff

Difference between the current value and the value N steps ago.

- type: diff
name: price_change
input: item.price
group_by: user_id
amount: 1
FieldRequiredDescription
nameyesOutput feature name
inputyesInput column
group_byyesGroup-by column
amountyesNumber of steps to diff
precomputednoDefault false

vector_similarity

Cosine or dot-product similarity between two embedding vectors. Requires vector store tables to be configured on the engine.

- type: vector_similarity
name: user_item_sim
query: user.embedding
candidate: item.embedding
method: cosine
FieldRequiredDescription
nameyesOutput feature name
queryyesQuery vector column
candidateyesCandidate vector column
methodnocosine (default) or dot_product
precomputednoDefault true
warning

Vector signals (vector_similarity, vector_aggregation) require vector store tables to be configured on your engine (e.g. via your embedding or index configuration). Without them, resolution will fail at train or score time.


sequence

Extract a fixed-length sequence of IDs from interaction history.

- type: sequence
name: recent_items
input: item_id
group_by: user_id
max_len: 50
window: 30d
FieldRequiredDescription
nameyesOutput feature name
inputyesID column to collect
group_byyesGroup-by column
max_lenyesMaximum sequence length
windownoTime window
paddingnoPadding value for shorter sequences
filternoFilter expression
precomputednoDefault true

vector_aggregation

Aggregate embedding vectors over time windows using mean, sum, or max pooling.

- type: vector_aggregation
name: user_embedding_avg
input: item.embedding
group_by: user_id
op: mean
window: 30d
FieldRequiredDescription
nameyesOutput feature name
inputyesVector column
group_byyesGroup-by column
opyesmean, sum, or max
windownoTime window
precomputednoDefault true

list_op

Operations on list-valued columns.

- type: list_op
name: num_tags
input: item.tags
op: len
FieldRequiredDescription
nameyesOutput feature name
inputyesList column
opyeslen, jaccard_index, or {"contains": "value"}
precomputednoDefault false

vector_flatten

Flatten a vector column into individual scalar columns ({name}_0, {name}_1, ...).

- type: vector_flatten
name: embedding
input: item.embedding
dim: 64
FieldRequiredDescription
nameyesOutput column name prefix
inputyesVector column
dimnoFixed dimension (inferred from data if omitted)
precomputednoDefault false

geo_distance

Haversine distance between two geographic points.

- type: geo_distance
name: distance_km
lat1: user.latitude
lon1: user.longitude
lat2: item.latitude
lon2: item.longitude
unit: km
FieldRequiredDescription
nameyesOutput feature name
lat1, lon1yesFirst point (latitude, longitude columns)
lat2, lon2yesSecond point (latitude, longitude columns)
unitnokm (default) or miles
precomputednoDefault false

geo_hash

Encode geographic coordinates as a geohash string.

- type: geo_hash
name: user_geohash
lat: user.latitude
lon: user.longitude
precision: 6
FieldRequiredDescription
nameyesOutput feature name
latyesLatitude column
lonyesLongitude column
precisionyesGeohash precision (1–12)
precomputednoDefault false

Hidden signals

Any signal can set hide: true to compute the signal without including it in the final feature set. Hidden signals are still available as inputs to other signals:

feature_definitions:
- type: lookup
name: raw_price
input: item.price
hide: true

- type: bucket
name: price_tier
input: raw_price
boundaries: [10, 50, 100]

Here raw_price is computed and fed into price_tier, but only price_tier appears as a model feature.

Precomputed signals

Signals marked with precomputed: true have their values stored in RocksDB during training. At scoring time, these precomputed values are read directly instead of being recomputed, enabling real-time feature resolution.

Aggregation, factorize, sequence, vector_similarity, and vector_aggregation signals default to precomputed: true because they require historical state that isn't available at scoring time.

note

If the GBDT config sets use_session_interactions: false, session interactions are not passed to the engine at scoring time. Time-windowed aggregations then rely only on precomputed state; very recent behavior won't be reflected until the next training run updates state.

End-to-end example

This example builds a GBDT scoring model for an e-commerce recommendation engine that combines user attributes, item attributes, behavioral aggregations, time features, and embedding similarity.

Engine configuration

data:
item_table:
name: products
type: table
user_table:
name: users
type: table
interaction_table:
name: interactions
type: table

training:
models:
- name: purchase_score
policy_type: gbdt
objective: binary
feature_definitions:
# User features
- type: lookup
name: user_age
input: user.age
- type: lookup
name: user_account_days
input: user.account_age_days

# Item features
- type: lookup
name: item_price
input: item.price
- type: transform
name: log_price
input: item_price
method: log1p
- type: bucket
name: price_tier
input: item_price
boundaries: [10, 25, 50, 100, 250]

# Behavioral aggregations
- type: aggregation
name: views_7d
input: user_id
aggregation_fn: count
group_by: [user_id]
window: 7d
- type: aggregation
name: purchases_30d
input: user_id
aggregation_fn: count
group_by: [user_id]
window: 30d
filter: "event_type = purchase"
- type: ratio
name: purchase_rate
numerator: purchases_30d
denominator: views_7d
smooth: 1.0

# Time features
- type: cyclic_time
name: hour
input: created_at
component: hour
- type: time_since_last
name: days_since_purchase
group_by: user_id
unit: days
filter: "event_type = purchase"

# Cross features
- type: cross
name: user_category
inputs: [user_id, item.category]
method: hash
buckets: 50000

# Geo features
- type: geo_distance
name: delivery_distance
lat1: user.latitude
lon1: user.longitude
lat2: item.warehouse_lat
lon2: item.warehouse_lon
unit: km

queries:
product_ranking:
query:
type: rank
from: item
retrieve:
- type: column_order
columns:
- name: _derived_popular_rank
ascending: true
limit: 1000
score:
type: score_ensemble
value_model: purchase_score
input_user_id: $parameters.user_id
input_interactions_item_ids: $parameters.interaction_item_ids
limit: 50
parameters:
user_id:
default: null