Multi-modal enrichment
Multi-modal enrichment uses vision-language models to analyze images and generate text descriptions. This is useful when you have product images, user-generated content, or other visual data that needs semantic text representations for search and recommendation.
How it works
Multi-modal enrichment processes images through a vision-language model that understands both visual and textual information. The model analyzes image content and generates descriptions based on your prompt instructions.
To enable image processing, include an image_url column in your source_columns. The column must contain publicly accessible URLs pointing to image files.
If your image columns have different names, use an SQL view to rename them to image_url.
Configuration
Create an AI enrichment view with view_type: "AI_ENRICHMENT" and include image_url in the source_columns array. Write a prompt that specifies what visual information to extract from the images.
Example: Product image analysis
{
"name": "products_with_image_descriptions",
"view_type": "AI_ENRICHMENT",
"source_table": "products",
"source_columns": [
"item_id",
"product_name",
"image_url"
],
"source_columns_in_output": [
"item_id",
"product_name"
],
"enriched_output_columns": [
"image_description"
],
"prompt": "Analyze the product image and describe the item's visual characteristics, including color, style, pattern, material texture, and any visible text or branding. Focus on factual details that would help customers understand what the product looks like."
}
Input and output:
Given an input row:
item_id: "12345"product_name: "Classic Denim Jacket"image_url: "https://example.com/images/jacket-12345.jpg"
The model generates this description in the image_description column:
Blue denim jacket with a classic fit. Features a button-front closure, two chest pockets, and long sleeves with button cuffs. Light wash finish with subtle fading. Brand logo visible on the left chest pocket.
Prompt guidelines
When writing prompts for image enrichment, be specific about what visual information to extract. Images contain many visual elements, so clearly state which aspects matter for your use case:
- Product attributes: Color, style, pattern, material, dimensions
- Scene composition: Layout, background, context
- Text content: Visible text, branding, labels
- Visual details: Texture, finish, condition
The enriched output is materialized as a persistent table that can be used in engines for semantic search and recommendation.