Skip to main content

BERT4Rec (Sequential)

Description

The BERT4Rec policy adapts the bidirectional Transformer architecture (BERT) for sequential recommendation. Unlike unidirectional models (like SASRec), it uses bidirectional self-attention and is typically trained using a masked item prediction objective (predicting masked items based on both past and future context within the sequence). This allows it to learn rich, context-aware item representations.

Policy Type: bert4rec Supports: embedding_policy, scoring_policy

Hyperparameter tuning

  • batch_size: Number of samples processed before updating model weights.
  • eval_batch_size: Batch size used during model evaluation.
  • n_epochs: Number of complete passes through the training dataset.
  • negative_samples_count: Number of negative samples per positive sample for contrastive learning.
  • device
  • hidden_size: Size of the hidden layers in the transformer.
  • inner_size: Size of the feed-forward network inner layer.
  • learning_rate: Learning rate for gradient descent optimization.
  • attn_dropout_prob: Dropout probability for attention layers.
  • hidden_act: Activation function.
  • hidden_dropout_prob: Dropout probability for hidden layers.
  • n_heads: Number of attention heads in the transformer.
  • n_layers: Number of transformer layers.
  • layer_norm_eps
  • initializer_range
  • mask_rate: Fraction of tokens to mask during training.
  • loss_type
  • max_seq_length: Maximum length of input sequences.
  • sample_strategy
  • sample_seed
  • sample_ratio
  • eval_step
  • early_stopping_step

V1 API

policy_configs:
scoring_policy: # Can also be used under embedding_policy
policy_type: bert4rec
# Training Hyperparameters
batch_size: 1000 # Samples per training batch
n_epochs: 1 # Number of training epochs
negative_samples_count: 2 # Negative samples (often relevant for loss calculation)
learning_rate: 0.001 # Optimizer learning rate
dropout_rate: 0.2 # General dropout rate for regularization
# Architecture Hyperparameters
hidden_size: 64 # Dimensionality of hidden layers/embeddings
n_heads: 2 # Number of self-attention heads
n_layers: 2 # Number of Transformer layers
max_seq_length: 50 # Maximum input sequence length

Usage

Use this model when:

  • You have sequential data and want bidirectional context understanding
  • You need richer item representations than unidirectional models
  • You want to leverage both past and future context in sequences
  • You're working with sequences where context matters in both directions

Choose a different model when:

  • You need real-time next-item prediction (unidirectional is more natural)
  • You want the simplest sequential model (Item2Vec or SASRec)
  • You don't have sequential data
  • You primarily need general item similarity (use Two-Tower or ALS)

Use cases

  • Context-aware sequential recommendations
  • Playlist generation with full context
  • Reading sequences where future context matters
  • Educational content sequences
  • Any sequential recommendation where bidirectional understanding helps

Reference