Image-to-Code Conversion Pipeline

<aside>

This document details the core image processing pipeline of the backend system, which transforms an uploaded image into corresponding code. The process involves several key stages, each employing specific technologies and strategies to analyse the image and extract meaningful information for code generation.

</aside>

I. Image Reception and Initial Setup

<aside>

Image Ingestion:
- The process begins when an image file is uploaded to the backend via an HTTP POST request. The system is designed to handle these uploads, associating a unique identifier (request ID) with each conversion task. This ID is crucial for organising files and tracking the process.
- The uploaded image is then saved to a designated temporary storage location on the server. This storage is usually structured using the request ID to keep files from different requests isolated (e.g., within a directory like uploads/<request_id>/images/).
Processing Mode Determination:
- The system can operate in different modes, which the user can select. This choice alters the strategies and tools used in subsequent image analysis and description phases, allowing for a trade-off between processing speed and analysis accuracy.
- The choices are between using streaming or non-streaming, and between heuristic or non-heuristic approaches. Non-heuristic processing improves accuracy, while heuristic processing increases speed. </aside>

II. Image Segmentation (Layout Analysis)

<aside>

For images that represent long web pages or complex layouts, an initial segmentation step may be performed to break them down into more manageable sections.

Segmentation Necessity Check:
- The system first evaluates if the input image requires segmentation. This decision is based on its height. (Threshold = 1430px)
Separation Line Detection:
- If segmentation is deemed necessary, the system employs sophisticated techniques to identify horizontal lines that visually separate distinct sections of the image. This involves:
  - Variance and Edge Analysis: Analysing rows of pixels to find areas with low visual variance (indicating blank spaces) that are adjacent to areas with significant changes in pixel values (indicating borders or edges of content blocks).
  - Color Transition Analysis: Detecting abrupt changes in color profiles across horizontal regions of the image, which often signify boundaries between different UI sections.
  - Combined Heuristics with Library Support: The system integrates results from multiple line detection approaches. For instance, it leverages functionalities from the image splitting library — ‣ to perform tasks such as identifying blank spaces or analyzing color-based splits. Combining the custom variance analysis with the library has been effective in accurately identifying separation lines.
  - Confidence Scoring: Potential separation lines are assigned confidence scores. Lines detected by multiple methods or those that strongly delineate clear visual breaks receive higher scores.
  - Filtering and Selection: A final set of separation lines is chosen by filtering candidates based on their confidence scores and ensuring a minimum vertical distance between selected lines to avoid over-segmentation.
Image Splitting:
- Once the separation lines are finalized, the original image is divided into multiple smaller image segments along these lines.
- Each segment is saved as an individual image file in a structured output directory (e.g., uploads/<request_id>/images/). These segments will then be processed individually in subsequent stages. If no segmentation is performed, the original image proceeds to the next stage. </aside>

III. Component Detection (Bounding Box Generation)

<aside>

This stage focuses on identifying individual UI elements within the image (or its segments) and determining their precise locations.

Element Identification:
- The system analyzes the image to detect various common UI components such as headers, logos, navigation links, text blocks, images, icons, buttons, and sidebars.
Bounding Box Creation:
- For each identified UI element, a bounding box (a rectangular region) is calculated to define its spatial extent on the image. This process leverages the Gemini 2.5 Pro model, a multimodal AI model capable of understanding and processing visual information to identify and locate objects within an image.
- These bounding box coordinates (x_min, y_min, x_max, y_max) are determined.
- To aid visualization and debugging, an image with these bounding boxes drawn directly onto it is generated.
Component Data Storage:
- The information about each detected component—including its type (label, e.g., "button") and its bounding box coordinates—is systematically recorded. This data is typically saved in a structured format, like a JSON file, associated with the image being processed (e.g., uploads/<request_id>/labels/<image_name>.json). </aside>

IV. Description Generation

<aside>

This phase examines each detected component more closely to understand its content, purpose, and visual attributes.

Individual Component Processing:
- The system processes each component identified in the previous stage.
Content and Attribute Extraction:
- Heuristic Approach:
  - Hierarchy Generation: The labels we get here are converted into hierarchy, considering the parent-child relationships of components.
  - Image Asset Handling: The media elements are categorised according to their label names and added to this hierarchy as assets for each particular component. Later, these assets are cropped from the original image and saved (e.g., in uploads/<request_id>/final/assets/).
  - Styling and Layout: The styling and layout calculation in this code is a sophisticated process that transforms spatial data from UI elements into usable CSS and positioning information.
    
    1. Coordinate System Conversion
    
    The system works with two coordinate systems:
    - Normalized coordinates (0-1000 range): Used internally for device independence
    - Pixel coordinates: Used when extracting media or performing actual rendering
    2. Relational Positioning
    
    The code calculates positioning for both:
    - Relative to parent: How a component is positioned within its immediate parent
    - Relative to viewport: How a component is positioned on the entire screen
    For each component, it calculates these positions as percentages.
    
    3. Positioning Strategy Selection
    
    Determines whether components should use absolute or relative positioning based on:
    - Component type (e.g., tooltips typically need absolute positioning)
    - Parent type (e.g., cards often contain absolutely positioned children)
    - Whether siblings overlap (overlapping elements need absolute positioning)
    4. Special Handling for Media Elements
    
    Media elements (images, icons) are treated differently:
    1. They use pixel-perfect positioning rather than percentages
    2. They are almost always absolutely positioned within their parent
    3. Their physical dimensions are preserved rather than made responsive
    5. Tailwind CSS Class Generation
    
    For each component, the system generates appropriate Tailwind CSS classes:
    1. Positioning classes: relative, absolute
    2. Dimension classes: w-[30%], h-[50%]
    3. Spacing classes: top-[10%], left-[5%]
    It generates CSS classes by determining the arbitary values through percentage calculations.
    
    6. Responsive Design Considerations
    
    The percentage-based calculations ensure the layout can be responsive.
  - Color Analysis: It creates a "medialess" version of the image by removing all media elements For each component, it:
    - Extracts the dominant colors from its area in the image using https://github.com/fengsp/color-thief-py Library.
    - Updates the component's style information with this color palette.
    - Has a fallback method if the https://github.com/fengsp/color-thief-py library isn't available.
- Non-Heuristic (AI-Driven) Approach:
  - Vision-Language Models (VLMs): For each component (by sending the full image with the component highlighted), Claude 3.7 Sonnet is prompted to provide a detailed description.
  - The prompt asks the VLM to identify the component type, extract any text, describe its visual appearance (colors, shapes, styles), and infer its likely function or purpose within the UI.
  - For image components, VLMs can provide descriptive alt-text.
Consolidated Component Data: </aside>

V. Code Generation

<aside>

</aside>

Core Technologies and Strategies Employed:

<aside>

</aside>

This pipeline represents a sophisticated approach to converting visual designs into code, blending traditional image processing techniques with advanced AI-driven analysis to achieve a functional and representative output.