This is an article from the Shaped 1.0 documentation. The APIs have changed and information may be outdated. Go to Shaped 2.0 docs
Images
Beyond Tags: The Power of Understanding Visuals for Relevance
In today's visually driven online world, relying solely on filenames, manually assigned tags, or interaction history for search and recommendation systems leaves significant value untapped. The rich, unstructured visual information within your platform's images – product photos, user-uploaded content, article illustrations, banners – holds the key to unlocking deeper relevance. Understanding these visuals allows systems to grasp:
- True Visual Content: What objects, scenes, or styles are actually depicted in this image, beyond basic labels?
- Aesthetic Similarity: Are these two items visually complementary or stylistically aligned, even if their metadata differs?
- Visual Nuance: Can we identify subtle visual attributes like color palettes, textures, or composition that influence user preference?
- Cross-Modal Understanding: How does the visual content relate to accompanying text descriptions or user queries?
- Latent Visual Preferences: Can we infer user tastes based on the visual characteristics of items they interact with?
Transforming raw pixels into meaningful signals, or features, that machine learning models can utilize is a crucial, yet demanding, aspect of feature engineering known as Computer Vision (CV). Nail it, and you enable powerful visual search, style-based recommendations, and richer user profiles. Neglect it, and you miss a fundamental dimension of user experience and relevance. The standard path involves building complex, resource-intensive CV pipelines.
The Standard Approach: Building Your Own Visual Understanding Pipeline

Leveraging visual data requires turning unstructured images into structured numerical representations (embeddings) that capture visual meaning. Doing this yourself typically involves a multi-stage, expert-driven process:
Step 1: Gathering and Preprocessing Image Data
- Collection: Aggregate image assets from diverse sources – CDNs, product databases, user upload storage, content management systems. Often involves handling URLs or binary data.
- Cleaning & Normalization: This is critical for consistent model input. Resize images to uniform dimensions, handle different file formats (JPG, PNG, WEBP), normalize pixel values (e.g., scale to [0, 1] or standardize based on ImageNet stats), potentially apply data augmentation (rotations, flips, color jitter) during training. Address corrupted or missing images.
- Pipelines: Build and maintain robust data pipelines to reliably ingest, validate, and preprocess potentially millions or billions of images.
The Challenge: Image data is diverse in size, quality, format, and content. Building reliable preprocessing pipelines requires significant data engineering and CV domain knowledge. Storage costs can also be substantial.
Step 2: Choosing the Right Vision Model Architecture
Selecting the appropriate CV model to generate embeddings is vital and requires navigating a rapidly evolving landscape.
- The Ecosystem (Hugging Face Hub, TIMM): Platforms like Hugging Face and libraries like timm (PyTorch Image Models) offer thousands of pre-trained vision models.
- Convolutional Neural Networks (CNNs): Architectures like ResNet, EfficientNet were dominant and remain strong baselines. They excel at capturing local patterns.
- Vision Transformers (ViTs): Increasingly the state-of-the-art, models like ViT, Swin Transformer, DeiT treat image patches like sequences, often capturing more global context. Generally require more data/compute.
- Multimodal Models (e.g., CLIP, BLIP): Models like openai/clip-vit-base-patch32 or Salesforce's BLIP variants embed both images and text into a shared space. This is crucial for text-to-image search, image-to-text search, and zero-shot classification based on textual descriptions.
- Task-Specific Models: Models trained for specific tasks like object detection (YOLO, DETR) or segmentation (U-Net) generate different kinds of features, less commonly used for general similarity embeddings but vital for specific applications.
The Challenge: Requires deep CV expertise to select the appropriate architecture and pre-trained weights based on data characteristics, desired embedding properties (local vs. global features, text alignment), downstream task (similarity, classification, search), and computational budget (ViTs can be heavy).
Step 3: Fine-tuning Models for Your Task and Data
Pre-trained models often need adaptation to perform optimally on your specific visual domain and business goals.
- Domain Adaptation: Further pre-train a model on your own large image corpus (e.g., all product photos) to help it learn the nuances of your specific visual style and object types.
- Metric Learning / Similarity Fine-tuning: Train the model using triplets (anchor, positive, negative examples) or pairs of images based on known similarity (e.g., same product, different angle vs. different product) to optimize embeddings for visual similarity search. Requires curated labeled data.
- Personalization Fine-tuning: Train models (e.g., Two-Tower architectures) where one tower processes user features/history and the other processes item image features, optimizing embeddings such that their similarity predicts user engagement (clicks, add-to-carts). Requires pairing interaction data with image data.
The Challenge: Fine-tuning vision models is computationally expensive (often requiring multi-GPU setups), needs significant ML/CV expertise, access to relevant labeled data (which can be hard to acquire for visual tasks), and extensive experimentation.
Step 4: Generating and Storing Embeddings
Once a model is ready, run inference on your image dataset to get the embedding vectors.
- Inference at Scale: Set up efficient batch inference pipelines, almost always GPU-accelerated, to process large volumes of images.
- Vector Storage: Store these high-dimensional vectors. Just like text embeddings, Vector Databases (Pinecone, Weaviate, Milvus, Qdrant, etc.) are essential for efficient storage and fast Approximate Nearest Neighbor (ANN) search to power visual similarity lookups.
The Challenge: Large-scale image inference is computationally demanding and costly. Deploying, managing, scaling, and securing a Vector Database adds significant operational overhead.