BERT4Rec (Sequential)

Description

The BERT4Rec policy adapts the bidirectional Transformer architecture (BERT) for sequential recommendation. Unlike unidirectional models (like SASRec), it uses bidirectional self-attention and is typically trained using a masked item prediction objective (predicting masked items based on both past and future context within the sequence). This allows it to learn rich, context-aware item representations.

Policy Type: bert4rec Supports: embedding_policy, scoring_policy

Hyperparameter tuning

event_values: List of event value strings to filter interactions by.
batch_size: Number of samples processed before updating model weights.
eval_batch_size: Batch size used during model evaluation.
n_epochs: Number of complete passes through the training data.
negative_samples_count: Number of negative samples per positive sample for contrastive learning.
device
hidden_size: Size of the hidden layers in the transformer.
inner_size: Size of the feed-forward network inner layer.
learning_rate: Learning rate for gradient descent optimization.
attn_dropout_prob: Dropout probability for attention layers.
hidden_act: Activation function.
hidden_dropout_prob: Dropout probability for hidden layers.
n_heads: Number of attention heads in the transformer.
n_layers: Number of transformer layers.
layer_norm_eps
initializer_range
mask_rate: Fraction of tokens to mask during training.
loss_type
max_seq_length: Maximum length of input sequences.
sample_strategy
sample_seed
sample_ratio
eval_step
early_stopping_step

V1 API

policy_configs:
  scoring_policy: # Can also be used under embedding_policy
    policy_type: bert4rec
    # Training Hyperparameters
    batch_size: 1000           # Samples per training batch
    n_epochs: 1                # Number of training epochs
    negative_samples_count: 2  # Negative samples (often relevant for loss calculation)
    learning_rate: 0.001       # Optimizer learning rate
    dropout_rate: 0.2          # General dropout rate for regularization
    # Architecture Hyperparameters
    hidden_size: 64            # Dimensionality of hidden layers/embeddings
    n_heads: 2                 # Number of self-attention heads
    n_layers: 2                # Number of Transformer layers
    max_seq_length: 50         # Maximum input sequence length

Usage

Use this model when:

You have sequential data and want bidirectional context understanding
You need richer item representations than unidirectional models
You want to leverage both past and future context in sequences
You're working with sequences where context matters in both directions

Choose a different model when:

You need real-time next-item prediction (unidirectional is more natural)
You want the simplest sequential model (Item2Vec or SASRec)
You don't have sequential data
You primarily need general item similarity (use Two-Tower or ALS)

Use cases

Context-aware sequential recommendations
Playlist generation with full context
Reading sequences where future context matters
Educational content sequences
Any sequential recommendation where bidirectional understanding helps

Reference

Sun, F., et al. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. CIKM.

Description​

Hyperparameter tuning​

Usage​

Use this model when:​

Choose a different model when:​

Use cases​

Reference​