Skip to main content

Two-Tower (Neural Retrieval)

Description

The Two-Tower model architecture separates the computation for users and items into two distinct neural networks ("towers") to efficiently generate embeddings for large-scale retrieval.

  • User Tower: Processes user-related features (ID, demographics, interaction history, context) to output a user embedding u.
  • Item Tower: Processes item-related features (ID, metadata, content features) to output an item embedding v in the same vector space.

Affinity is typically calculated using a simple similarity function (dot product or cosine similarity) between u and v.

Its primary strength lies in decoupling computation for serving: item embeddings can be pre-computed offline, and Approximate Nearest Neighbor (ANN) search can be used at inference time to efficiently retrieve candidate items whose embeddings are most similar to a real-time computed user embedding. This makes it highly suitable for the candidate generation (retrieval) stage in multi-stage recommendation systems.

While powerful for retrieval, the standard Two-Tower design limits explicit modeling of cross-feature interactions between users and items until the final similarity calculation.

Policy Type: two-tower Supports: embedding_policy

Hyperparameter tuning

  • batch_size: Number of samples processed before updating model weights.
  • n_epochs: Number of complete passes through the training dataset.
  • device
  • negative_samples_count: Number of negative samples per positive sample for contrastive learning.
  • embedding_dims: Dimensionality of the user and item embeddings.
  • lr: Learning rate for gradient descent optimization.
  • weight_decay: L2 regularization term to prevent overfitting.
  • use_item_ids_as_features: Whether to use item IDs as features.
  • strategy
  • patience: Number of epochs to wait without improvement before early stopping.
  • num_workers

V1 API

policy_configs:
embedding_policy:
policy_type: two-tower
# Training Hyperparameters
batch_size: 32 # Samples per training batch
n_epochs: 5 # Number of training epochs
negative_samples_count: 5 # Negative samples per positive for contrastive loss
lr: 0.001 # Learning rate
weight_decay: 0.0005 # L2 regularization strength
patience: 5 # Epochs for early stopping patience
# Architecture Hyperparameters
embedding_dims: 128 # Dimensionality of the shared embedding space (u and v)
activation_fn: "relu" # Activation function in hidden layers (e.g., "relu", "gelu")
dropout: 0.2 # Dropout rate for regularization

Usage

Use this model when:

  • You have rich item metadata (text descriptions, categories, images)
  • You need efficient large-scale vector search and retrieval
  • You want to leverage both collaborative and content signals
  • You need production-ready embeddings for real-time recommendations
  • You want the best performance for general item similarity tasks

Choose a different model when:

  • You have only interaction data without item features (use ALS/ELSA)
  • You need to model strict sequential patterns (use SASRec/BERT4Rec)
  • You have very limited compute resources
  • You need a simple baseline model

Use cases

  • E-commerce with product descriptions, categories, images (e.g., clothing, electronics)
  • Content platforms with rich metadata (articles, videos with descriptions)
  • Job recommendations with job descriptions and requirements
  • Real estate with property descriptions and features
  • Any domain where items have rich textual or categorical attributes

References