Numericals
Beyond Simple Values: The Power of Understanding Numerical Data for Relevance
In search and recommendation systems, numerical data like prices, ratings, view counts, inventory levels, dimensions, distances, and timestamps are ubiquitous. While seemingly straightforward, simply feeding raw numerical values into models often fails to capture their full potential for driving relevance. Deeper understanding and transformation of these numbers allow systems to grasp:
- Relative Importance: Is a $10 price difference significant for a $20 item but negligible for a $1000 item?
- Non-Linear Effects: Does user interest plateau after a certain number of views, or drop off sharply below a specific rating threshold?
- User Sensitivity: Are certain users highly price-sensitive, while others prioritize high ratings?
- Contextual Meaning: Does a "5" mean "5 stars" (good) or "5 items left" (urgency)?
- Temporal Patterns: Does popularity fluctuate based on the time of day or day of the week?
Transforming raw numerical data into meaningful signals, or features, that machine learning models can effectively utilize is a critical aspect of feature engineering. Get it right, and you unlock nuanced personalization and improved ranking. Neglect it, and models may misinterpret signals or fail to capture important patterns. The standard path involves careful statistical analysis and manual transformation techniques.
The Standard Approach: Building Your Own Numerical Feature Pipeline
Leveraging numerical data effectively requires cleaning, transforming, and sometimes combining numbers to create structured inputs that better reflect their underlying meaning and relationships. Doing this yourself typically involves several steps:
Step 1: Gathering and Initial Cleaning
- Collection: Aggregate numerical data from various sources – product databases, user profiles, event logs, third-party APIs.
- Validation & Error Handling: Check for inconsistencies (e.g., negative prices, ratings outside expected range). Handle outliers that might skew analysis or model training (e.g., cap extreme values).
- Missing Value Imputation: Decide how to handle missing numbers. Common strategies include filling with the mean, median, mode, a constant (like zero), or using more complex model-based imputation. The choice depends heavily on the feature's meaning and distribution.
- Type Consistency: Ensure numerical data is stored and processed using appropriate data types (integer, float).
The Challenge: Data quality issues are common. Choosing the right imputation strategy requires understanding the data and potential downstream impacts. Handling outliers requires careful judgment.
Step 2: Feature Transformation & Creation
This is where raw numbers are reshaped to be more informative for ML models.
Scaling and Normalization
Bring features to a similar scale. Essential for distance-based algorithms (like k-NN) and models trained with gradient descent (like neural networks).
- Min-Max Scaling: Rescales features to a specific range (e.g., [0, 1]). Sensitive to outliers.
- Standardization (Z-score Scaling): Rescales features to have zero mean and unit variance. Less sensitive to outliers.
Advanced Transformations
- Discretization (Binning): Convert continuous numerical features into discrete categorical bins (e.g., price ranges: $0-$10, $10-$50, $50+). Helps models capture non-linear relationships. Requires choosing appropriate bin boundaries (equal width, equal frequency, domain knowledge).
- Polynomial Features & Interactions: Create new features by combining existing ones (e.g., price rating, width height) or raising them to a power (views^2). Helps models capture interaction effects and more complex patterns.
- Log, Root, or Power Transforms: Apply mathematical functions (like log(x+1)) to handle highly skewed distributions (common with counts like views or sales). Makes the distribution more symmetrical.
- Ratio Features: Create features representing relative values (e.g., discount_percentage = (original_price - sale_price) / original_price, click_through_rate = clicks / views). Often carry more business meaning than raw numbers.
- Time-Based Features: Extract components from timestamps (hour of day, day of week, month) or calculate durations (time since last purchase, age of account).
The Challenge: Requires statistical knowledge and domain expertise to choose the right transformations. Feature explosion can occur if creating many interaction/polynomial terms. Determining optimal bin boundaries requires experimentation.
Step 3: Feature Selection
After potentially creating many new features, select the most impactful ones to avoid overfitting, reduce computational cost, and improve model interpretability.
Methods: Use statistical tests (correlation, ANOVA), model-based importance (from tree models or linear models), or dimensionality reduction techniques (like PCA, though less common for preserving specific numerical meanings).
The Challenge: Requires careful validation to ensure important signals aren't discarded. Can be computationally intensive.
Step 4: Feature Storage and Serving
Consistent application of these transformations during both training and real-time inference is crucial.
Feature Stores: Increasingly popular for managing feature definitions, computation logic, and serving features with low latency. They help ensure train-serve skew is minimized.
The Challenge: Implementing and managing a feature store adds infrastructure complexity and cost. Ensuring feature freshness and consistency requires robust pipelines.
Step 5: Integration into Models
Feed the final set of engineered numerical features into downstream machine learning models (ranking models, recommendation algorithms, classifiers).
The Challenge: Ensuring the model architecture can effectively utilize the engineered features. Debugging issues related to incorrect feature values at inference time.
Step 6: Maintenance and Monitoring
- Distribution Monitoring: Track the statistical properties of numerical features over time to detect data drift, which might require retraining models or updating transformation logic (e.g., imputation values, bin boundaries).
- Pipeline Updates: Maintain and update the feature generation pipelines as source data changes or new features are needed.
The Challenge: Requires ongoing monitoring and maintenance effort. Adapting pipelines without disrupting live systems can be complex.
Streamlining Numerical Feature Engineering
The DIY path for numerical features requires careful statistical consideration, domain knowledge, and pipeline engineering. Platforms and tools are emerging that aim to simplify this by integrating best practices for numerical feature handling.
How a Streamlined Approach Can Help:
- Automated Preprocessing: Handle common tasks like missing value imputation (using sensible defaults or learned strategies) and scaling/normalization automatically when numerical columns are detected.
- Native Integration: Seamlessly combine processed numerical features with behavioral signals, text features, image features, and other metadata within unified ranking or recommendation models.
- Implicit Optimization & Interaction Handling: Leverage model architectures (especially deep learning models) that can implicitly learn non-linearities and feature interactions without extensive manual creation of polynomial or binned features. The platform's training process optimizes the use of numerical inputs for the specific business objective.
- Flexibility for Custom Features: Allow users to provide pre-engineered numerical features (e.g., custom ratios, domain-specific scores) alongside raw numerical columns if specific transformations are critical.
- Managed Infrastructure & Scale: Abstract away the complexity of scaling computations, storing intermediate values, and serving features consistently.
- Graceful Handling of Edge Cases: Robustly manage missing values, outliers (potentially through learned clipping or robust scaling), and varying data types.
Leveraging Numerical Features in a Shaped Workflow
Let's see an example of how you can incorporate numerical features into Shaped.
Goal: Automatically use price, rating, and view counts to improve recommendations.
1. Ensure Data is Available:
Assume item_metadata
(with price
, average_rating
, view_count
) and user_interactions
are accessible.
2. Define Model Configuration (YAML):
model:
name: numerical_recs_model
connectors:
- name: items
type: database
id: items_source
- name: events
type: event_stream
id: events_source
fetch:
items: |
SELECT
item_id, title, category,
price, # <-- Raw numerical feature
average_rating, # <-- Raw numerical feature
view_count # <-- Raw numerical feature
FROM items_source
events: |
SELECT
user_id, item_id, event_type,
event_timestamp
FROM events_source
3. Trigger Model Training:
Initiate the Shaped model creation and training process. Shaped handles the necessary preprocessing (e.g., scaling, imputation) and integrates these features into its learning process.
shaped create-model --file numerical_model.yaml
# Monitor the model until it reaches the ACTIVE state
shaped view-model --model-name numerical_recs_model
4. Use Standard Shaped APIs:
Call Shaped’s standard APIs (rank
, similar_items
, etc.). The API interaction remains simple, but the model's relevance calculations are now informed by the properly processed numerical features, learning their impact alongside other signals.
- Python
- JavaScript
from shaped import Shaped
# Initialize the Shaped client
shaped_client = Shaped()
# Get recommendations using the numerically-enhanced model
response = shaped_client.rank(
model_name='numerical_recs_model',
user_id='USER_1',
limit=10
)
# Print the recommendations
if response and response.metadata:
print("Recommended Items:")
for item in response.metadata:
print(f"- {item['title']} (Price: {item['price']}, Rating: {item['average_rating']}, Views: {item['view_count']})")
else:
print("No recommendations found.")
const { Shaped } = require('@shaped/shaped');
// Initialize the Shaped client
const shapedClient = new Shaped();
// Get recommendations using the numerically-enhanced model
shapedClient.rank({
modelName: 'numerical_recs_model',
userId: 'USER_1',
limit: 10
}).then(response => {
if (response && response.metadata) {
console.log("Recommended Items:");
response.metadata.forEach(item => {
console.log(`- ${item.title} (Price: ${item.price}, Rating: ${item.average_rating}, Views: ${item.view_count})`);
});
} else {
console.log("No recommendations found.");
}
}).catch(error => {
console.error("Error fetching recommendations:", error);
});
Conclusion: Harness Numerical Power, Minimize Statistical Pain
Numerical data holds significant potential for improving relevance, but extracting this value traditionally requires careful statistical transformations, domain knowledge, robust data pipelines, and ongoing maintenance. Ignoring these steps can often lead to suboptimal model performance.
Emerging platforms and MLOps tools aim to simplify numerical feature engineering by automating common preprocessing steps like scaling and imputation, and by using powerful model architectures capable of learning complex relationships implicitly. They integrate numerical data with other feature types whilst simultaneously managing the underlying infrastructure. By providing flexibility for users to incorporate custom-engineered features, these streamlined approaches allow teams to focus more on their core business logic and less on the intricacies of manual numerical feature manipulation, ultimately leading to more relevant and personalized user experiences, faster.
Ready to streamline your feature engineering process?
Request a demo of Shaped today to see Shaped in action for your feature types. Or, start exploring immediately with our free trial sandbox.