BERT4Rec (Sequential)
info
This is a preview of the new Shaped docs. Found an issue or have feedback? Let us know!
Description
The BERT4Rec policy adapts the bidirectional Transformer architecture (BERT) for sequential recommendation. Unlike unidirectional models (like SASRec), it uses bidirectional self-attention and is typically trained using a masked item prediction objective (predicting masked items based on both past and future context within the sequence). This allows it to learn rich, context-aware item representations.
Policy Type: bert4rec
Supports: embedding_policy, scoring_policy
Hyperparameter tuning
batch_size: Number of samples processed before updating model weights.eval_batch_size: Batch size used during model evaluation.n_epochs: Number of complete passes through the training dataset.negative_samples_count: Number of negative samples per positive sample for contrastive learning.devicehidden_size: Size of the hidden layers in the transformer.inner_size: Size of the feed-forward network inner layer.learning_rate: Learning rate for gradient descent optimization.attn_dropout_prob: Dropout probability for attention layers.hidden_act: Activation function.hidden_dropout_prob: Dropout probability for hidden layers.n_heads: Number of attention heads in the transformer.n_layers: Number of transformer layers.layer_norm_epsinitializer_rangemask_rate: Fraction of tokens to mask during training.loss_typemax_seq_length: Maximum length of input sequences.sample_strategysample_seedsample_ratioeval_stepearly_stopping_step
V1 API
policy_configs:
scoring_policy: # Can also be used under embedding_policy
policy_type: bert4rec
# Training Hyperparameters
batch_size: 1000 # Samples per training batch
n_epochs: 1 # Number of training epochs
negative_samples_count: 2 # Negative samples (often relevant for loss calculation)
learning_rate: 0.001 # Optimizer learning rate
dropout_rate: 0.2 # General dropout rate for regularization
# Architecture Hyperparameters
hidden_size: 64 # Dimensionality of hidden layers/embeddings
n_heads: 2 # Number of self-attention heads
n_layers: 2 # Number of Transformer layers
max_seq_length: 50 # Maximum input sequence length
Usage
Use this model when:
- You have sequential data and want bidirectional context understanding
- You need richer item representations than unidirectional models
- You want to leverage both past and future context in sequences
- You're working with sequences where context matters in both directions
Choose a different model when:
- You need real-time next-item prediction (unidirectional is more natural)
- You want the simplest sequential model (Item2Vec or SASRec)
- You don't have sequential data
- You primarily need general item similarity (use Two-Tower or ALS)
Use cases
- Context-aware sequential recommendations
- Playlist generation with full context
- Reading sequences where future context matters
- Educational content sequences
- Any sequential recommendation where bidirectional understanding helps
Reference
- Sun, F., et al. (2019). BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. CIKM.