Defining signals
Signal Engine is a declarative feature engineering system for GBDT (Gradient Boosted Decision Tree) models. Instead of writing custom feature transforms, you define signals — typed, composable feature definitions — that the engine resolves automatically at both training and scoring time.
Signal Engine is configured through the feature_definitions field on
the gbdt model policy. For when to use GBDT versus other models, see
Choose a model.
Quick start
Add a feature_definitions list to a gbdt model in your engine
config. Each entry is a signal with a type, a name, and
type-specific parameters:
training:
models:
- name: click_score
policy_type: gbdt
feature_definitions:
- type: lookup
name: item_price
input: item.price
- type: lookup
name: user_age
input: user.age
- type: aggregation
name: clicks_7d
input: user_id
aggregation_fn: count
group_by: [user_id]
window: 7d
When feature_definitions is omitted, Signal Engine generates a
reasonable default: lookup signals for every numeric user and item
column plus interaction count aggregations over 7-day and 30-day
windows.
Column references in signals use a fixed convention: user columns
as user.<column_name>, item columns as item.<column_name>,
and spine (interaction) columns by their table name (e.g.
user_id, item_id, created_at, label).
How it works
- At training time, Signal Engine joins your interaction spine with user and item metadata tables, then resolves every signal in order. Aggregation signals only see data before the current row to prevent time leakage.
- At scoring time, Signal Engine resolves the same signals using precomputed state stored in RocksDB, so features are available in real-time with no recomputation.
Aggregations only use data from before the current row, so time-windowed features cannot see future events. This prevents time leakage in both training and scoring.
Signals can reference the output of earlier signals by name, so you can chain transforms:
feature_definitions:
- type: lookup
name: raw_price
input: item.price
- type: transform
name: log_price
input: raw_price
method: log1p
Signal types
lookup
Direct column access from user, item, or interaction tables.
- type: lookup
name: item_price
input: item.price
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Column reference (e.g. user.age, item.price) |
precomputed | no | Store in RocksDB for online scoring (default false) |
expression
Arbitrary arithmetic expression over columns or other signals.
- type: expression
name: price_per_rating
expr: item.price / item.avg_rating
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
expr | yes | Expression string (e.g. item.price * 2) |
precomputed | no | Default false |
ratio
Safe division with optional additive smoothing to avoid division by zero.
- type: ratio
name: click_rate
numerator: clicks_7d
denominator: impressions_7d
smooth: 1.0
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
numerator | yes | Numerator column or signal |
denominator | yes | Denominator column or signal |
smooth | no | Additive smoothing constant (default 0) |
precomputed | no | Default false |
transform
Unary mathematical transform.
- type: transform
name: log_price
input: item.price
method: log1p
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column or signal |
method | yes | One of log1p, log, sqrt, abs |
precomputed | no | Default false |
cast
Explicit type casting.
- type: cast
name: year_int
input: item.year
target_type: int
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column or signal |
target_type | yes | One of int, int32, int64, float, float32, float64, str, bool |
precomputed | no | Default false |
clip
Clip values to a [min, max] range.
- type: clip
name: clipped_price
input: item.price
min: 0
max: 1000
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column or signal |
min | yes | Lower bound |
max | yes | Upper bound |
precomputed | no | Default false |
normalization
Normalize values using standard scaling or min-max scaling. The
scaler parameters (mean/std or min/max) must be provided
explicitly.
- type: normalization
name: price_scaled
input: item.price
method: standard_scaler
mean: 49.99
std: 25.0
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column or signal |
method | yes | standard_scaler or min_max_scaler |
mean | conditional | Required for standard_scaler |
std | conditional | Required for standard_scaler |
min | conditional | Required for min_max_scaler |
max | conditional | Required for min_max_scaler |
precomputed | no | Default false |
bucket
Discretize continuous values into buckets. When boundaries is
empty, quartile boundaries are computed at training time.
- type: bucket
name: price_bucket
input: item.price
boundaries: [10, 50, 100, 500]
output: ordinal_index
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column or signal |
boundaries | no | Sorted list of bucket boundaries (default: auto quartiles) |
output | no | ordinal_index (default) or one_hot |
precomputed | no | Default false |
multi_hot
Multi-hot encoding for categorical or list columns.
- type: multi_hot
name: genres_encoded
input: item.genres
vocab: [action, comedy, drama, horror, sci-fi]
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Categorical or list column |
vocab | no | Fixed vocabulary list (inferred if omitted) |
precomputed | no | Default false |
aggregation
Time-windowed aggregation over interaction history. Aggregations are precomputed by default and stored in RocksDB for real-time scoring.
- type: aggregation
name: purchases_30d
input: label
aggregation_fn: count
group_by: [user_id]
window: 30d
filter: "event_type = purchase"
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Column to aggregate |
aggregation_fn | yes | count, sum, avg, min, max, or count_distinct |
group_by | yes | List of columns to group by (e.g. [user_id]) |
window | no | Time window: <number><unit> where unit is d, h, m, or s |
filter | no | Filter expression (e.g. event_type = purchase) |
explode | no | Column to explode before aggregation |
precomputed | no | Default true |
For a simple row count per group (e.g. "number of interactions per
user"), use the group key as input with aggregation_fn: count —
e.g. input: user_id, group_by: [user_id].
time_since_last
Time elapsed since the last interaction matching optional filter criteria.
- type: time_since_last
name: days_since_last_click
group_by: user_id
unit: days
filter: "event_type = click"
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
group_by | yes | Group-by column |
unit | yes | days, hours, minutes, or seconds |
input | no | Timestamp column (defaults to spine timestamp) |
filter | no | Filter expression |
precomputed | no | Default false |
cyclic_time
Cyclic sin/cos encoding for timestamp components. Produces two output
columns: {name}_sin and {name}_cos.
- type: cyclic_time
name: hour_of_day
input: created_at
component: hour
| Field | Required | Description |
|---|---|---|
name | yes | Output column name prefix |
input | yes | Timestamp column |
component | yes | hour, dayofweek, day, month, or minute |
precomputed | no | Default false |
time_component
Scalar extraction of a timestamp component.
- type: time_component
name: day_of_week
input: created_at
component: dayofweek
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Timestamp column |
component | yes | dayofweek, day, month, year, hour, or minute |
precomputed | no | Default false |
cross
Hashed interaction or dot product of multiple columns.
- type: cross
name: user_category
inputs: [user_id, item.category]
method: hash
buckets: 10000
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
inputs | yes | List of columns to cross (minimum 2) |
method | no | hash (default) or dot_product |
buckets | conditional | Required when method is hash |
precomputed | no | Default false |
factorize
Map categorical values to integer indices.
- type: factorize
name: category_id
input: item.category
hash_bucket_size: 1000
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Categorical column |
vocab | conditional | Fixed vocabulary list (provide vocab or hash_bucket_size) |
hash_bucket_size | conditional | Hash bucket size for unknown values |
precomputed | no | Default true |
lag
Value of a column N interactions ago.
- type: lag
name: prev_item_price
input: item.price
group_by: user_id
amount: 1
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column |
group_by | yes | Group-by column |
amount | yes | Number of steps to lag |
precomputed | no | Default false |
diff
Difference between the current value and the value N steps ago.
- type: diff
name: price_change
input: item.price
group_by: user_id
amount: 1
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Input column |
group_by | yes | Group-by column |
amount | yes | Number of steps to diff |
precomputed | no | Default false |
vector_similarity
Cosine or dot-product similarity between two embedding vectors. Requires vector store tables to be configured on the engine.
- type: vector_similarity
name: user_item_sim
query: user.embedding
candidate: item.embedding
method: cosine
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
query | yes | Query vector column |
candidate | yes | Candidate vector column |
method | no | cosine (default) or dot_product |
precomputed | no | Default true |
Vector signals (vector_similarity, vector_aggregation) require
vector store tables to be configured on your engine (e.g. via your
embedding or index configuration). Without them, resolution will
fail at train or score time.
sequence
Extract a fixed-length sequence of IDs from interaction history.
- type: sequence
name: recent_items
input: item_id
group_by: user_id
max_len: 50
window: 30d
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | ID column to collect |
group_by | yes | Group-by column |
max_len | yes | Maximum sequence length |
window | no | Time window |
padding | no | Padding value for shorter sequences |
filter | no | Filter expression |
precomputed | no | Default true |
vector_aggregation
Aggregate embedding vectors over time windows using mean, sum, or max pooling.
- type: vector_aggregation
name: user_embedding_avg
input: item.embedding
group_by: user_id
op: mean
window: 30d
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | Vector column |
group_by | yes | Group-by column |
op | yes | mean, sum, or max |
window | no | Time window |
precomputed | no | Default true |
list_op
Operations on list-valued columns.
- type: list_op
name: num_tags
input: item.tags
op: len
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
input | yes | List column |
op | yes | len, jaccard_index, or {"contains": "value"} |
precomputed | no | Default false |
vector_flatten
Flatten a vector column into individual scalar columns
({name}_0, {name}_1, ...).
- type: vector_flatten
name: embedding
input: item.embedding
dim: 64
| Field | Required | Description |
|---|---|---|
name | yes | Output column name prefix |
input | yes | Vector column |
dim | no | Fixed dimension (inferred from data if omitted) |
precomputed | no | Default false |
geo_distance
Haversine distance between two geographic points.
- type: geo_distance
name: distance_km
lat1: user.latitude
lon1: user.longitude
lat2: item.latitude
lon2: item.longitude
unit: km
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
lat1, lon1 | yes | First point (latitude, longitude columns) |
lat2, lon2 | yes | Second point (latitude, longitude columns) |
unit | no | km (default) or miles |
precomputed | no | Default false |
geo_hash
Encode geographic coordinates as a geohash string.
- type: geo_hash
name: user_geohash
lat: user.latitude
lon: user.longitude
precision: 6
| Field | Required | Description |
|---|---|---|
name | yes | Output feature name |
lat | yes | Latitude column |
lon | yes | Longitude column |
precision | yes | Geohash precision (1–12) |
precomputed | no | Default false |
Hidden signals
Any signal can set hide: true to compute the signal without
including it in the final feature set. Hidden signals are still
available as inputs to other signals:
feature_definitions:
- type: lookup
name: raw_price
input: item.price
hide: true
- type: bucket
name: price_tier
input: raw_price
boundaries: [10, 50, 100]
Here raw_price is computed and fed into price_tier, but only
price_tier appears as a model feature.
Precomputed signals
Signals marked with precomputed: true have their values stored in
RocksDB during training. At scoring time, these precomputed values
are read directly instead of being recomputed, enabling real-time
feature resolution.
Aggregation, factorize, sequence, vector_similarity, and
vector_aggregation signals default to precomputed: true because
they require historical state that isn't available at scoring time.
If the GBDT config sets use_session_interactions: false, session
interactions are not passed to the engine at scoring time. Time-windowed
aggregations then rely only on precomputed state; very recent behavior
won't be reflected until the next training run updates state.
End-to-end example
This example builds a GBDT scoring model for an e-commerce recommendation engine that combines user attributes, item attributes, behavioral aggregations, time features, and embedding similarity.
Engine configuration
data:
item_table:
name: products
type: table
user_table:
name: users
type: table
interaction_table:
name: interactions
type: table
training:
models:
- name: purchase_score
policy_type: gbdt
objective: binary
feature_definitions:
# User features
- type: lookup
name: user_age
input: user.age
- type: lookup
name: user_account_days
input: user.account_age_days
# Item features
- type: lookup
name: item_price
input: item.price
- type: transform
name: log_price
input: item_price
method: log1p
- type: bucket
name: price_tier
input: item_price
boundaries: [10, 25, 50, 100, 250]
# Behavioral aggregations
- type: aggregation
name: views_7d
input: user_id
aggregation_fn: count
group_by: [user_id]
window: 7d
- type: aggregation
name: purchases_30d
input: user_id
aggregation_fn: count
group_by: [user_id]
window: 30d
filter: "event_type = purchase"
- type: ratio
name: purchase_rate
numerator: purchases_30d
denominator: views_7d
smooth: 1.0
# Time features
- type: cyclic_time
name: hour
input: created_at
component: hour
- type: time_since_last
name: days_since_purchase
group_by: user_id
unit: days
filter: "event_type = purchase"
# Cross features
- type: cross
name: user_category
inputs: [user_id, item.category]
method: hash
buckets: 50000
# Geo features
- type: geo_distance
name: delivery_distance
lat1: user.latitude
lon1: user.longitude
lat2: item.warehouse_lat
lon2: item.warehouse_lon
unit: km
queries:
product_ranking:
query:
type: rank
from: item
retrieve:
- type: column_order
columns:
- name: _derived_popular_rank
ascending: true
limit: 1000
score:
type: score_ensemble
value_model: purchase_score
input_user_id: $parameters.user_id
input_interactions_item_ids: $parameters.interaction_item_ids
limit: 50
parameters:
user_id:
default: null