Researchers examined how pairs of models behave when they alternately generate an image from a text description and then describe that image in words, in a setup similar to the children's game of telephone.
The team began with 100 short prompts produced by a search algorithm to cover a wide range of themes, such as a person alone in nature finding an old book written in a forgotten language.
For each prompt, the image generator Stable Diffusion XL created an image, which was then passed to the language-and-vision model LLaVA to produce a new textual description that guided the next image, forming a closed loop of alternating image and text.
Lead author Arend Hintze of Dalarna University in Sweden expected the loops might fluctuate while staying close to the original description, for example by repeatedly rendering a mountain with a village on it.
Instead, after about 100 back-and-forth steps, the systems routinely drifted away from the initial prompts and settled into one of 12 recurring categories, including gothic cathedrals, natural landscapes, sports scenes, urban night views, stormy lighthouses, and rustic interiors.
This convergence proved robust: it persisted when the researchers used longer, more detailed seed prompts and when they increased randomness in the models' decision-making settings.
In one example starting from the prompt about a prime minister working to sell a fragile peace deal under the pressure of impending military action, the first generated image showed a suited man overlaid on newsprint, but by the 34th iteration the scene had shifted to a classical library and by the 100th to a plush sitting room with red furnishings.
To test how general this tendency is, the team repeated the experiment with four different image-generation models and four different image-description models, yet the loops still converged on the same limited set of generic visual themes.
Hintze argues that these tendencies likely arise from biases in the training data, which contain large numbers of photographs of familiar subjects such as sports, architecture, and scenic landscapes that people frequently choose to capture.
When the researchers extended the loops up to 1,000 cycles, the images usually stabilized after roughly 100 iterations but could suddenly jump to a different broad motif hundreds of steps later, indicating that the systems can transition between a small number of attractor states.
Hintze notes that once a motif is established it appears stable for long stretches, but it is still unclear whether some types of imagery consistently emerge before others along a preferred progression.
The results indicate that, without human input, current generative systems tend to reduce visual diversity rather than expand it, reinforcing common patterns instead of exploring unusual or conceptually driven imagery.
The authors argue that keeping humans actively involved in the creative loop will be important if artificial intelligence is to support cultural variety rather than amplify conformity.
They also point to a need for model-level mechanisms that counteract this convergence and better separate the processes of generating novel content and evaluating its interest or aesthetic value.
Hintze describes creativity as the combination of producing something new and then judging whether it is interesting, beautiful, or stimulating, and concludes that existing systems currently perform the generative part well but lack the evaluative filtering that humans apply.
Research Report:Autonomous language-image generation loops converge to generic visual motifs
Related Links
Cell Press journal Patterns
All about the robots on Earth and beyond!
| Subscribe Free To Our Daily Newsletters |
| Subscribe Free To Our Daily Newsletters |