From URL to Generated Image — Horizn Analytics

The Problem

Generating images with AI sounds attractive for professional contexts. It is cost-efficient, fast, and independent of several external circumstances. But simple prompts rarely produce results that meet serious aesthetic standards. The outputs are probabilistic by nature: without careful preparation, the model has no idea what makes a specific brand look like itself.

1 / 4

Fig. 1 — Four outputs of the naive prompt: "Generate a photo for Horizn Studios, a premium travel brand."

In practice this showed up immediately. Early generations included Lufthansa logos on aircraft in the background, ICE trains pulling into frame, hotel lobbies and airport terminals. The model fills in the gaps with whatever it associates with "travel", which is far too generic to describe any specific brand.

The challenge is to communicate a brand's visual identity to the model as concretely as possible. Not just a mood, but specific signals: how scenes are lit, what colors appear where, how people are framed, what kind of story the image is telling.

The Idea

A brand's website already contains most of this information. The words a brand uses, the emotions they evoke. All of it is there, just unstructured. The question is whether it can be extracted and made usable automatically.

This project attempts exactly that. Using the Berlin-based luggage brand Horizn Studios as a test case, text and images from selected pages of their website were scraped and analysed to build a structured brand profile: a machine-readable description of how the brand looks and feels. That profile then drives image generation prompts instead of a human writing them from scratch.

Fig. 2 — Pipeline overview: from URL to structured brand prompt.

The pipeline draws on several analytical frameworks. Kansei Engineering maps emotional positioning across twelve bipolar dimensions such as premium vs. accessible or minimal vs. ornate. Alongside that, the pipeline extracts primary settings, target audience archetypes using Sinus-Milieus, a structured color palette with role assignments, and photographic signals from lifestyle imagery including shot sizes, lighting conditions, composition styles and people presence. These signals are sampled together and assembled into a structured prompt that gives the model concrete, role-specific instructions rather than a vague description.

Building the Pipeline

Scraping

The starting point is a scraper built with Playwright and BeautifulSoup. Playwright renders each page in a headless Chromium browser, which is necessary for Shopify storefronts where most content is injected dynamically. For Horizn Studios, six pages were scraped: homepage, product pages, about us, impact page, and two brand spotlights.

Text is filtered during collection: navigation elements, footers, size guides, and other non-editorial content are excluded. What remains is grouped into semantic chunks, each built around a heading and its associated body text. The result is a single corpus.json containing 58 text chunks and all usable image URLs found across those pages.

$ python pipeline.py

corpus.json — 6 pages scraped
────────────────────────────────────────────────
about_us                    10 chunks   21 images
homepage                    12 chunks   18 images
more_about_quality           8 chunks    6 images
more_about_our_efforts       9 chunks   14 images
more_about_our_partners     11 chunks   19 images
spotlight_sofo_backpack      8 chunks    9 images
────────────────────────────────────────────────
Total                       58 chunks   87 images

Sample chunk [0] · about_us
  heading : Introduction
  text    : "We seek to build bridges and make connections.
             Our mission is to enable curious minds to
             travel the world consciously and seamlessly."

Fig. 3 — corpus.json structure after scraping six Horizn Studios pages.

Text Analysis

Each text chunk is passed individually to Gemma 4 running via Ollama. The model analyses each chunk against four established frameworks: Kansei Engineering for product perception across twelve bipolar dimensions, the Geneva Emotion Wheel for emotional impact, IPTC NewsCodes for setting classification, and Sinus-Milieus for target audience archetypes. Each chunk returns a confidence-weighted vote. Neutral responses are treated as abstentions. The winning signal per dimension represents the brand's position on that axis.

brand_profile.json — Kansei profile
────────────────────────────────────────────────────────
dimension              result        signal strength
────────────────────────────────────────────────────────
modern_traditional     modern        ████████████  36.5
minimal_ornate         minimal       ████████░░░░  12.7
functional_decorative  functional    ████████████  26.6
bold_subtle            bold          ██░░░░░░░░░░   5.3
urban_natural          urban         ██░░░░░░░░░░   5.3
premium_accessible     premium       ████████████  38.3
serious_playful        serious       ░░░░░░░░░░░░   1.8
timeless_trendy        timeless      ████░░░░░░░░  10.6
individual_collective  collective    █░░░░░░░░░░░   3.5
rough_refined          refined       █████████░░░  23.4
dynamic_static         dynamic       █████░░░░░░░  12.1
transparent_mysterious transparent   ██░░░░░░░░░░   —
────────────────────────────────────────────────────────
profile_confidence     0.844

Fig. 4 — Kansei profile extracted from 58 text chunks. Signal strength reflects accumulated confidence-weighted votes.

Vision Analysis

The challenge with images is that not every image on a brand website is useful. Product packshots, icons, and UI elements carry no lifestyle signal. The pipeline filters these out in two passes: first by URL pattern, then by asking the vision model directly whether the image shows a product in a real-life context. Only images that pass both filters are analysed in full for shot size, lighting, composition, color, and narrative style.

Filtered packshots vs. accepted lifestyle images

Fig. 5 — URL-based packshot filter. Top row: removed. Bottom row: passed to vision analysis.

Aggregation

All text and vision signals flow into a single aggregation step. The model synthesises the raw votes and visual distributions into a final brand profile, including natural-language image prompt components ready for use.

A note on color extraction: during development the pipeline was also tested with a smaller local model. The general approach worked, but color values were significantly off. A color that should have been deep navy blue was returned as a bright saturated blue. The larger cloud model performed considerably better, though even that does not achieve full color accuracy. For a production context this would need a more reliable solution: either defining brand colors explicitly or replacing semantic extraction with direct pixel sampling.

1 / 4

Fig. 6 — Color values extracted per image by the vision model, with role assignments.

Generating the Prompts

The prompt generator reads the latest brand profile and decision log and draws a random combination of visual parameters, each weighted by how frequently it appeared in the analysed images. Setting, shot size, camera angle, lighting, composition, depth of field, narrative style: each is a slot filled from the observed data. Colors are drawn from the full pool of raw vision analysis entries rather than the aggregated palette, which gives 28 distinct values instead of five and preserves the actual variety of the brand's imagery.

The prompt is assembled as a deterministic template following the BFL prompting structure: Subject, Environment, Lighting, Composition, Colors, Mood. No second LLM call, every element is guaranteed to appear in the right order, every time.

Prompt template

Fig. 7 — Prompt template with randomly drawn slot values. Each run produces a different brand-consistent combination.

The Results

1 / 4

Prompt used

Fig. 8 — Four generated images with their corresponding structured prompts. Model: Nano Banana 2 via ComfyUI.

Conclusion

Running the structured prompts through Nano Banana 2 produced results that are visibly more brand-consistent than the naive baseline. The model does not know Horizn Studios, but the prompt encodes enough of what the brand looks like that the output moves clearly in the right direction.

That said, the results are far from production-ready. Color accuracy remains approximate despite the role-based assignment. Shot size and composition instructions are followed inconsistently. Several details that would never appear in Horizn's actual imagery keep surfacing: people visible in the background, static poses that feel staged in the wrong way, framing that does not match the brand's energy. These are not just prompt engineering problems. They point to gaps in what the pipeline currently captures. How people move, how they carry themselves, the specific relationship between subject and background: none of this is encoded yet.

There are also more fundamental limitations. The suitcase in every prompt is described as a generic hard-shell. A real production pipeline would inject an actual product asset instead. The dataset of six pages is small. And while the prompt structure follows BFL guidelines, serious prompt engineering would go considerably further.

What the project does show is that the approach works in principle. Structured brand data produces meaningfully better outputs than a naive description. The more the pipeline knows about a brand, its visual grammar and not just its mood, the more precisely it can steer the model. The natural next step would be larger scraping scope, direct pixel sampling for color, and asset integration for the actual product.

Playwright BeautifulSoup Ollama Gemma 31B ComfyUI Nano Banana 2 Python 3.11