Other Approaches/Strategies

1. Component Detection:

<aside>

Currently using Gemini 2.5 Pro to detect component-level bounding boxes. We explored other models, but found they lack accuracy in identifying bounding boxes correctly.

<aside> 💡

https://huggingface.co/foduucom/web-form-ui-field-detection

https://www.kaggle.com/code/nilanjandebnath/webpage-element-detection-using-yolov8

</aside>
We considered training a model for component detection, but this would require significant computing resources, training data, and time. </aside>

2. Description Generation:

<aside>

For each component, Claude 3.7 Sonnet is prompted to provide a detailed description. We tested several approaches to description generation:
1. Using structured JSON responses with predefined fields for color, font size, font family, and other attributes
2. Experimenting with varying levels of description detail (from concise to comprehensive)
3. Allowing the LLM to generate freeform descriptions without predefined fields—this produced better results, but required more tokens
We finalized the third option since it could better capture the intricacies in the description. </aside>

3. Code Generation:

<aside>

In this final stage, we translate structured, detailed information about UI components into functional code using Gemini 2.5 Pro.
We evaluated several models for code generation:
- Gemini 2.5 Pro: Delivers stable and consistent results
- Claude 3.7 Sonnet: Excels at layout structure in some cases
- GPT-4.1: Shows inconsistent performance
Ultimately, we chose Gemini for code generation due to its superior visual understanding </aside>

4. Color Palette:

<aside>

Currently, the system relies on the LLM to suggest colors. Also experimented with using Color Thief (https://github.com/fengsp/color-thief-py?tab=readme-ov-file**)** to extract the color palette of a component, but faced several limitations:

It captures colors from the entire component, including image accents, which adds noise.
It often returns slightly lighter or darker shades, making the palette less precise.
When passing Color Thief’s palette to the LLM, it sometimes negatively impacted the layout.
Explored traditional methods like clustering to extract colors, but since Color Thief already uses a similar approach so continued with it.
Also experimented with computer vision techniques and superpixel segmentation, combining those images with the original input, but this did not improve the results. </aside>

5. Layout

<aside>

Few-shot prompting: Provided the LLM with example website code, images, and description samples to improve layout handling and image understanding. This increased token usage but did not significantly improve accuracy.
Canny edge detection: Tested to capture layout boundaries, but it had no meaningful impact on results.
Sobel edge detection: Also tried to highlight horizontal and vertical structures, but it similarly showed no noticeable improvement. </aside>

6. Initial Thought Process

<aside>

From the start, it was clear that a divide-and-conquer approach was needed, as one-shot generation using VLMs was not producing good results.
We initially tried generating code for each individual component and then merging them, but this approach consumed a lot of tokens, increased computational load, and still failed to deliver accurate results.
Eventually, we settled on generating a comprehensive description for each component before moving to code generation. </aside>

7. Font Detection

<aside>

Font Recognition Challenges

Current models have limited font identification capabilities and would require extensive additional training data and model development for accurate font matching. (https://huggingface.co/gaborcselle/font-identifier) For now, we rely on font approximation. </aside>