Skip to main content

Categoricals

Categorical features are the bedrock of descriptive data in search and recommendation systems. They represent distinct labels or groups, such as product_category, brand, color, content_type, user_segment, country_code, or item_tag. Unlike numerical data, these features don't have an inherent mathematical magnitude but provide critical structure and context. Understanding and effectively leveraging categories allows systems to grasp:

  • Item & User Attributes: What kind of item is this? What group does this user belong to?
  • Group-Based Preferences: Do users from a specific country prefer certain brands? Do users interested in one category often interact with another?
  • Filtering & Faceting: Enabling users to narrow down results based on specific attributes (e.g., filter by "Electronics" category).
  • Rule-Based Logic: Implementing business rules like boosting items of a specific content_type.
  • Identity: Representing unique entities like user_id and item_id, which are fundamental for personalization.

Transforming these non-numeric labels into meaningful signals, or features, that machine learning models can utilize is a foundational, yet nuanced, aspect of feature engineering. Get it right, and you enable precise filtering, grouping, and personalization. Handle it poorly, and models may misinterpret information or fail to learn critical relationships. The standard path involves various encoding strategies tailored to the nature of the category.

The Standard Approach: Building Your Own Categorical Feature Pipeline

Leveraging categorical data requires converting labels into numerical representations suitable for ML models. The strategy depends heavily on the characteristics of the feature, especially its cardinality.

Step 1: Gathering and Initial Handling

  • Collection: Aggregate categorical data from diverse sources – databases, event streams, APIs.
  • Handling Nulls / Missing Values: A common issue. Often best handled by explicitly creating a dedicated category level (e.g., "Unknown", "Missing", or _NULL_) rather than arbitrary imputation. This allows the model to potentially learn patterns associated with missingness.
  • Type Inference: Distinguish true categorical features from numerical IDs that might look like categories but represent distinct entities (like user_id), or from short free-text fields that might be better treated as language data.

The Challenge: Ensuring consistent representation of categories across data sources. Deciding on the optimal null handling strategy. Correctly identifying the nature of categorical-like fields.

Step 2: Understanding Cardinality

The number of unique levels in a categorical feature drastically impacts the choice of encoding.

  • Low Cardinality: Features with a small, fixed number of unique values (e.g., day_of_week [7 levels], is_subscribed [2 levels], device_type [~3-5 levels]). Relatively easy to handle.
  • High Cardinality: Features with many, potentially thousands or millions, of unique values (e.g., user_id, item_id, product_sku, city, artist_name). These pose significant challenges for traditional methods.

The Challenge: High cardinality features can lead to extremely high-dimensional and sparse feature spaces if not handled carefully, making models difficult to train and prone to overfitting.

Step 3: Choosing the Right Encoding Strategy (Sparse vs. Dense)

Converting labels to numbers is the core task.

  • One-Hot Encoding (OHE): Creates a binary column for each category level.
    • Pros: Simple, interpretable, makes no assumptions about order. Standard for linear models.
    • Cons: Leads to very high dimensionality and sparsity for high-cardinality features. Doesn't capture relationships between categories.
    • Output: Sparse representation.
  • Label / Index Encoding: Assigns a unique integer to each category level (e.g., "Red": 0, "Green": 1, "Blue": 2).
    • Pros: Dimensionally efficient.
    • Cons: Can imply a false ordinal relationship that misleads tree models or distance-based algorithms (Blue > Green > Red?). Often used as a first step before feeding into an embedding layer.
    • Output: Dense (single integer), but often input to a sparse lookup or dense embedding layer.
  • Embedding Layers (for High Cardinality): Maps each category level (especially high-cardinality IDs like user_id, item_id) to a low-dimensional dense vector (embedding). These embeddings are typically learned jointly with the main task (e.g., predicting clicks) during model training. This is the standard approach in modern deep learning-based recommendation systems.
    • Pros: Captures semantic relationships between categories (similar users/items get similar embeddings). Handles high cardinality efficiently. State-of-the-art for personalization.
    • Cons: Less interpretable than OHE. Requires sufficient data per category to learn good embeddings.
    • Output: Dense representation.
  • Categoricals that are Language: Some features are inherently text (e.g., tags, short_keywords, brand_name). While they can be treated as high-cardinality categories with learned embeddings, they can often benefit from language models (as discussed in the NLP post) to generate pre-trained text embeddings (like from Sentence Transformers or CLIP). These capture richer semantic meaning from the text itself.
    • Pros: Leverages powerful pre-trained knowledge. Can understand nuances in the text labels.
    • Cons: Computationally more expensive than simple embedding lookups.
    • Output: Dense representation.
  • Features from Other Models: Embeddings for categories (especially IDs) can sometimes be generated by separate, specialized models (e.g., graph embedding models like Node2Vec on user-item interaction graphs, or embeddings from a pre-trained product catalog model). These pre-computed embeddings are then fed as features into the final ranking model.

The Challenge: Selecting the optimal encoding based on cardinality, model type, and computational budget. Managing embedding layers and vocabularies, especially for new categories. Deciding when to treat a categorical feature as language.

Step 4: Handling Ordinal Features

These are categorical features with a meaningful inherent order.

  • Examples: size ('S', 'M', 'L', 'XL'), star_rating (1 to 5), building_floor (1, 2, 3...).
  • Encoding: Simple label/index encoding can work if the model can interpret the order (some tree models might). Alternatively, custom numerical mapping (e.g., 'S': 1, 'M': 2, 'L': 3) or thermometer encoding can be used.

The Challenge: Ensuring the model correctly interprets the ordered nature, not just treating it as distinct unordered categories.

Step 5: Binning & Dimensionality Reduction

Techniques to manage complexity, especially for high-cardinality features when not using embeddings, or to simplify low-cardinality features.

  • Combining Similar Categories: Manually group related categories based on domain knowledge (e.g., mapping various "smartphone" sub-categories to a single "Smartphone" category).
  • Handling Low Frequency / Rare Categories: Group infrequent levels into a single "Other" or "RARE" category. Reduces noise and dimensionality, especially useful before one-hot encoding.
  • Dimensionality Reduction (Less Common for Categoricals directly):
    • PCA: Can be applied to embeddings derived from categorical features, but not directly to OHE sparse representations usually.
    • Tree Models: Feature importance scores from models like Random Forest or Gradient Boosting can help select the most predictive categorical features, if dimensionality is a major concern.

The Challenge: Defining appropriate thresholds for rare categories. Ensuring meaningful groupings when combining levels.

Step 6: Integration & Usage Context

Categorical features play roles at various stages:

  • At Retrieval Time: Crucial for filtering candidates. Use exact matches on key categories (e.g., content_type = 'article', brand = 'Acme') often via inverted indexes or database queries.
  • At Scoring Time: Feed encoded features (OHE, embeddings) into the ranking ML model to influence the score based on learned category preferences or attributes.
  • At Ordering Time: Apply post-scoring rules like boosting/burying items based on category membership (e.g., boost "featured" category, filter out based on user's negative preference for a category).

The Challenge: Ensuring consistent encoding and availability across retrieval and scoring systems. Managing the complexity of combining multiple categorical filters.

Streamlining Categorical Feature Engineering

The DIY path for categorical features involves careful consideration of cardinality, choosing appropriate encoding methods, managing vocabularies, and ensuring consistency. Platforms and tools aim to abstract much of this complexity.

How a Streamlined Approach Can Help:

  1. Automated Type Inference & Encoding: Automatically detect categorical columns. Apply sensible default encoding strategies based on inferred cardinality (e.g., learnable embeddings for high-cardinality IDs, potentially OHE or index + embedding for low-cardinality).
  2. Native Embedding Management: Seamlessly manage the creation, training, and serving of embedding layers for high-cardinality features like user_id and item_id as an integral part of the platform's models.
  3. Integrated Language Handling: Automatically leverage built-in language models when categorical features are identified as text (e.g., tags, brand_names), generating rich semantic embeddings.
  4. Robust Null & New Category Handling: Provide default strategies for missing values (e.g., dedicated null embedding) and gracefully handle new category levels encountered during inference (out-of-vocabulary handling).
  5. Managed Infrastructure: Abstract away the complexity of building encoding pipelines, managing embedding tables, and ensuring low-latency serving.

Leveraging Categoricals with Shaped

Shaped helps streamlines categorical feature engineering:

Goal: Automatically use category, brand, user ID, and item ID for recommendations.

1. Ensure Data is Available: Assume item_metadata (with item_id, category, brand) and user_events (with user_id, item_id, event_type) are accessible.

2. Define Model Configuration:

categorical_model.yaml
model:
name: category_recs_platform
schema_override: # Optionally explicitly define data types, or let Shaped infer them
item:
id: item_id
features:
- name: title
type: Text
- name: category
type: Category
- name: brand
type: Category
created_at: created_at
interaction:
label:
name: label
type: BinaryLabel
created_at: created_at
features:
- name: event_value
type: Category
connectors:
- name: items
type: database
id: items_source
- name: event_stream
type: database
id: interactions_source
fetch:
items: |
SELECT
item_id, # <-- Platform identifies as high-cardinality ID
title, # <-- Likely treated as text
category, # <-- Platform identifies as low/medium cardinality categorical
brand, # <-- Platform identifies as low/medium cardinality categorical
created_at
FROM items_source
events: |
SELECT
user_id, # <-- Platform identifies as high-cardinality ID
item_id, # <-- Links to items data
event_type, # <-- Low cardinality categorical
timestamp AS created_at,
1 AS label # Example: Binary label for positive interactions
FROM interactions_source

3. Create the Model:

shaped create-model --file categorical_model.yaml

4. Monitor Training: Wait for the model to reach the ACTIVE state. Note that the model will go through Fetching, Tuning, Training, Deploying and then finally Active.

shaped view-model --model-name categorical_model.yaml

5. Use Shaped Recommendation and Search APIs: Call standard rank or similar_items APIs. The relevance scores now deeply incorporate user and item identities via embeddings, along with preferences learned from lower-cardinality features like category and brand.

from shaped import Shaped

# Initialize the Shaped client
client = Shaped()
response = client.rank(
model_name='category_recs_platform',
user_id='USER_XYZ',
limit=10
)
for item in response.metadata:
print(f"- {item['title']} (Category: {item['category']}, Brand: {item['brand']})")

Conclusion: Categories are Key, Handle Them Wisely

Categorical features are fundamental building blocks for context and personalization in search and recommendation. Effectively transforming them from simple labels into powerful numerical signals requires careful consideration of cardinality, choosing the right encoding strategy (often involving embeddings for high-cardinality IDs), and managing complexities like null values and new categories.

Streamlined platforms and MLOps tools can drastically simplify this process by automating type inference, encoding, embedding management, and infrastructure concerns. This allows teams to leverage the full power of their categorical data—from basic filtering to deep personalization via learned embeddings—without getting bogged down in the intricate implementation details, ultimately leading to more relevant, structured, and personalized user experiences.

Ready to streamline your feature engineering process?

Request a demo of Shaped today to see Shaped in action for your feature types. Or, start exploring immediately with our free trial sandbox.