BERT4Rec (Sequential)
Description
The BERT4Rec policy adapts the bidirectional Transformer architecture (BERT) for sequential recommendation. Unlike unidirectional models (like SASRec), it uses bidirectional self-attention and is typically trained using a masked item prediction objective (predicting masked items based on both past and future context within the sequence). This allows it to learn rich, context-aware item representations.
Policy Type: bert4rec
Supports: embedding_policy, scoring_policy
Hyperparameter tuning
batch_size: Number of samples processed before updating model weights.eval_batch_size: Batch size used during model evaluation.n_epochs: Number of complete passes through the training dataset.negative_samples_count: Number of negative samples per positive sample for contrastive learning.devicehidden_size: Size of the hidden layers in the transformer.inner_size: Size of the feed-forward network inner layer.learning_rate: Learning rate for gradient descent optimization.attn_dropout_prob: Dropout probability for attention layers.hidden_act: Activation function.hidden_dropout_prob: Dropout probability for hidden layers.n_heads: Number of attention heads in the transformer.n_layers: Number of transformer layers.layer_norm_epsinitializer_rangemask_rate: Fraction of tokens to mask during training.loss_typemax_seq_length: Maximum length of input sequences.sample_strategysample_seedsample_ratioeval_stepearly_stopping_step
V1 API
policy_configs:
scoring_policy: # Can also be used under embedding_policy
policy_type: bert4rec
# Training Hyperparameters
batch_size: 1000 # Samples per training batch
n_epochs: 1 # Number of training epochs
negative_samples_count: 2 # Negative samples (often relevant for loss calculation)
learning_rate: 0.001 # Optimizer learning rate
dropout_rate: 0.2 # General dropout rate for regularization
# Architecture Hyperparameters
hidden_size: 64 # Dimensionality of hidden layers/embeddings
n_heads: 2 # Number of self-attention heads
n_layers: 2 # Number of Transformer layers
max_seq_length: 50 # Maximum input sequence length
Usage
Use this model when:
- You have sequential data and want bidirectional context understanding
- You need richer item representations than unidirectional models
- You want to leverage both past and future context in sequences
- You're working with sequences where context matters in both directions
Choose a different model when:
- You need real-time next-item prediction (unidirectional is more natural)
- You want the simplest sequential model (Item2Vec or SASRec)
- You don't have sequential data
- You primarily need general item similarity (use Two-Tower or ALS)
Use cases
- Context-aware sequential recommendations
- Playlist generation with full context
- Reading sequences where future context matters
- Educational content sequences
- Any sequential recommendation where bidirectional understanding helps
Reference
- Sun, F., et al. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. CIKM.