Diffusion Models: Continuing Studies from a Former Art Student

star.app

vinely.app

3D.glb

nft.eth

Diffusion Models: Continuing Studies from a Former Art Student

stable-diffusionflow-modelsgenerative-art

2025-04-14

On understanding how machines perceive vibes, aesthetics, and meaning

Lately, I've been thinking about how images are represented in feature space-
Where in a neural network does the vibe of an image actually reside?

This started while playing with same.energy, a visual search engine powered by CLIP embeddings. Compared to Pinterest, its results feel... more semantically fixated. Less stylistic drift, more concept lock-in. And that got me wondering:
Why does one model generalize toward style, and another toward meaning?

Is it how CLIP encodes images? Or something about the final similarity search?
This small difference pulled me deeper into the mechanics of visual representation in diffusion models—and eventually into studies of human vision, LoRA fine-tuning, and latent space disentanglement.

Inspiration

Same.energy is a visual search engine that uses CLIP and vector similarity search to find and display images based on learned embeddings of images and texts.

How is this different from Pinterest?

Functionally, they're similar: both surface visual results and let users explore related imagery. Pinterest has been evolving its visual search algorithm around object recognition and product discovery since 2014. But after spending time with same.energy—and as an avid Pinterest user—I noticed something different:

Same.energy seems to fixate on semantic meaning. Pinterest tends to generalize better to stylistic variation.

Here's what I mean:

Second-level query for "dreamy art" on Pinterest

Second-level query for "dreamy art" on same.energy

Pinterest returns images with diverse subjects—fish, boats, ballerinas—while same.energy focuses on a single type of composition: mostly a solitary figure. I saw this across many queries.

Of course, Pinterest's model is proprietary, so it's possible their search algorithm explicitly diversifies results at the last stage. But this behavior got me thinking:

How exactly are images embedded and organized in latent space?
And where—if anywhere—does a vibe live inside the model?

Related Work

Recent research by Frenkel et al. explores a LoRA-based fine-tuning method for disentangling style and semantic content in SDXL. By training lightweight adapters (LoRAs) in specific layers of Stable Diffusion, they found that certain layers specialize in texture and color, while others focus on structure.

Though their experiments focused on SDXL, it raises questions for other diffusion architectures too—like DiTs. Are there "modules" inside these models that correspond to artistic concepts? Can we trace which layers encode color grading, or spatial rhythm, or emotional tone?

Human Vision and Interpretability: A Parallel

I find this kind of layer-wise specialization fascinating because it parallels how the human visual system works.

The retina captures light and detail
Cones specialize in color
Higher brain regions interpret structure and semantics

If someone suffers a stroke in the occipital lobe, they may still receive visual input, but lose the ability to process what they're seeing.

A simulation of vision loss from occipital lobe stroke

It reminds me that perception isn't just about what's in front of us—it's also shaped by memory, attention, and learned priors. Similar is true for generative models.

What is a vibe?

A distinctive feeling or quality that can be sensed—but not easily defined.

In visual art—especially in 2D media like painting or photography—artists use techniques like color theory, rhythm, composition, balance, form, and texture to create emotional effects. Yet the final experience is always subjective. There's a gap between the artist's intention and the viewer's perception.

That ambiguity is kind of the point. But it also makes controlling a model's output harder.

The Problem I See

As a visual thinker, I've always struggled to express emotion purely through words—especially during my teenage years of immigration, when language itself felt limiting. I've long believed that text alone can't fully convey mood.

Modern text-to-image models ask users to verbalize their visual intent. That's a mismatch.
There's a cold-start problem baked into the process: you imagine an image in your head, but the moment you translate it into words, you've already distorted it. You've lost something.

Even with tools like ControlNet, the gap between what you intend and what you get remains wide.

So maybe it's time to ask:

Can we create new interfaces—or new embeddings—that allow people to search, steer, and create visually, without needing to describe everything in language first?

Pattern Studies

stable-diffusiongenerative-art

2024-05-01

In this study, I explore the potential of AI-assisted design in transforming natural forms and original artwork into complex, aesthetically pleasing patterns suitable for fashion applications. The process begins with either a hand-drawn sketch or a photograph of a natural object, which serves as the seed for stable diffusion models.