Science With Stable Diffusion, you may never believe what you see online again

  • 🏰 The Fediverse is up. If you know, you know.
  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account

AI image generation is here in a big way. A newly released open source image synthesis model called Stable Diffusion allows anyone with a PC and a decent GPU to conjure up almost any visual reality they can imagine. It can imitate virtually any visual style, and if you feed it a descriptive phrase, the results appear on your screen like magic.

Some artists are delighted by the prospect, others aren't happy about it, and society at large still seems largely unaware of the rapidly evolving tech revolution taking place through communities on Twitter, Discord, and Github. Image synthesis arguably brings implications as big as the invention of the camera—or perhaps the creation of visual art itself. Even our sense of history might be at stake, depending on how things shake out. Either way, Stable Diffusion is leading a new wave of deep learning creative tools that are poised to revolutionize the creation of visual media.

The rise of deep learning image synthesis​

Stable Diffusion is the brainchild of Emad Mostaque, a London-based former hedge fund manager whose aim is to bring novel applications of deep learning to the masses through his company, Stability AI. But the roots of modern image synthesis date back to 2014, and Stable Diffusion wasn't the first image synthesis model (ISM) to make waves this year.

In April 2022, OpenAI announced DALL-E 2, which shocked social media with its ability to transform a scene written in words (called a “prompt”) into myriad visual styles that can be fantastic, photorealistic, or even mundane. People with privileged access to the closed-off tool generated astronauts on horseback, teddy bears buying bread in ancient Egypt, novel sculptures in the style of famous artists, and much more.

Not long after DALL-E 2, Google and Meta announced their own text-to-image AI models. MidJourney, available as a Discord server since March 2022 and open to the public a few months later, charges for access and achieves similar effects but with a more painterly and illustrative quality as the default.

Then there's Stable Diffusion. On August 22, Stability AI released its open source image generation model that arguably matches DALL-E 2 in quality. It also launched its own commercial website, called DreamStudio, that sells access to compute time for generating images with Stable Diffusion. Unlike DALL-E 2, anyone can use it, and since the Stable Diffusion code is open source, projects can build off it with few restrictions.

In the past week alone, dozens of projects that take Stable Diffusion in radical new directions have sprung up. And people have achieved unexpected results using a technique called "img2img" that has "upgraded" MS-DOS game art, converted Minecraft graphics into realistic ones, transformed a scene from Aladdin into 3D, translated childlike scribbles into rich illustrations, and much more. Image synthesis may bring the capacity to richly visualize ideas to a mass audience, lowering barriers to entry while also accelerating the capabilities of artists that embrace the technology, much like Adobe Photoshop did in the 1990s.

You can run Stable Diffusion locally yourself if you follow a series of somewhat arcane steps. For the past two weeks, we've been running it on a Windows PC with an Nvidia RTX 3060 12GB GPU. It can generate 512×512 images in about 10 seconds. On a 3090 Ti, that time goes down to four seconds per image. The interfaces keep evolving rapidly, too, going from crude command-line interfaces and Google Colab notebooks to more polished (but still complex) front-end GUIs, with much more polished interfaces coming soon. So if you're not technically inclined, hold tight: Easier solutions are on the way. And if all else fails, you can try a demo online.

How stable diffusion works​

Broadly speaking, most of the recent wave of ISMs use a technique called latent diffusion. Basically, the model learns to recognize familiar shapes in a field of pure noise, then gradually brings those elements into focus if they match the words in the prompt.

To get started, a person or group training the model gathers images with metadata (such as alt tags and captions found on the web) and forms a large data set. In Stable Diffusion's case, Stability AI uses a subset of the LAION-5B image set, which is basically a huge image scrape of 5 billion publicly accessible images on the Internet. Recent analysis of the data set shows that many of the images come from sites such as Pinterest, DeviantArt, and even Getty Images. As a result, Stable Diffusion has absorbed the styles of many living artists, and some of them have spoken out forcefully against the practice. More on that below.

Next, the model trains itself on the image data set using a bank of hundreds of high-end GPUs such as the Nvidia A100. According to Mostaque, Stable Diffusion cost $600,000 to train so far (estimates of training costs for other ISMs typically range in the millions of dollars). During the training process, the model associates words with images thanks to a technique called CLIP (Contrastive Language–Image Pre-training), which was invented by OpenAI and announced just last year.

Through training, an ISM using latent diffusion learns statistical associations about where certain colored pixels usually belong in relation to each other for each subject. So it doesn't necessarily "understand" their relationship at a high level, but the results can still be stunning and surprising, making inferences and style combinations that seem very intelligent. After the training process is complete, the model never duplicates any images in the source set but can instead create novel combinations of styles based on what it has learned. The results can be delightful and wildly fun.

At the moment, Stable Diffusion doesn't care if a person has three arms, two heads, or six fingers on each hand, so unless you're a wizard at crafting the text prompts necessary to generate great results (which AI artists sometimes call "prompt engineering"), you'll probably need to generate lots of images and cherry-pick the best ones. Keep in mind that the more a prompt matches captions for known images in the data set, the more likely you'll get the result you want. In the future, it's likely that models will improve enough to reduce the need for cherry-picking—or some kind of internal filter will do the picking for you.

Ethical and legal concerns abound​

As hinted above, Stable Diffusion's public release has raised alarm bells among people who fear its cultural and economic impact. Unlike DALL-E 2, Stable Diffusion's training data (the "weights") are available for anyone to use without any hard restrictions. The official Stable Diffusion release (and DreamStudio) includes automatic "NSFW" filters (nudity) and an invisible tracking watermark embedded in the images, but these restrictions can easily be circumvented in the open source code. This means Stable Diffusion can be used to create images that OpenAI currently blocks with DALL-E 2: propaganda, violent imagery, pornography, images that potentially violate corporate copyright, celebrity deepfakes, and more. In fact, there are already some private Discord servers dedicated to pornographic output from the model.

To be clear, Stable Diffusion's license officially forbids many of these uses, but with the code and weights out in the wild, enforcement will prove very difficult, if not impossible. When presented with these concerns, Mostaque said that he feels the benefits of having this kind of tool out in the open where it can be scrutinized outweigh the potential drawbacks. In a short interview, he told us, "We believe in individual responsibility and agency. We included an ethical use policy and tools to mitigate harm."

Also, Stable Diffusion has drawn the ire of artists on Twitter due to the model's ability to imitate the style of living artists, as mentioned above. (And despite the claims of some viral tweets, Stability AI has never advertised this ability. One of the most shared tweets mistakenly pulled from an independent study done by an AI researcher.) In the quest for data, the image set used to train Stable Diffusion includes millions of pieces of art gathered from living artists without consultation with the artists, which raises profound ethical questions about authorship and copyright. Scraping the data appears lawful by US legal precedent, but one could argue that the law might be lagging behind rapidly evolving technology that upends previous assumptions about how public data might be utilized.

As a result, if image synthesis technology becomes adopted by major corporations in the future (which may be coming soon—"We have a collaborative relationship with Adobe," says Mostaque), companies might train their own models based on a "clean" data set that includes licensed content, opt-in content, and public domain imagery to avoid some of these ethical issues, even if using an Internet scrape is technically legal. We asked Mostaque if he had any plans along these lines, and he replied, "Stability is working on a range of models. All models by ourselves and our collaborators are legal within their jurisdictions."

Another issue with diffusion models from all vendors is cultural bias. Since these ISMs currently work by scraping the Internet for images and their related metadata, they learn social and cultural stereotypes present in the data set. For example, early on in the Stable Diffusion beta on its Discord server, testers found that almost every request for a "beautiful woman" involved unintentional nudity of some kind, which reflects how Western society often depicts women on the Internet. Other cultural and racist stereotypes abound in ISM training data, so researchers caution that it should not be used in a production environment without significant safeguards in place, which is likely one reason why other powerful models such as DALLE-2 and Google's Imagen are still not broadly available to the public.

While concerns about data set quality and bias echo strongly among some AI researchers, the Internet remains the largest source of images with metadata attached. This trove of data is freely accessible, so it will always be a tempting target for developers of ISMs. Attempting to manually write descriptive captions for millions or billions of images for a brand-new ethical data set is probably not economically feasible at the moment, so it's the heavily biased data on the Internet that is currently making this technology possible. Since there is no universal worldview across cultures, to what degree image synthesis models filter or interpret certain ideas will likely remain a value judgment among the different communities that use the technology in the future.

What comes next​

If historical trends in computing are any suggestion, odds are high that what now takes a beefy GPU will eventually be possible on a pocket smartphone. "It is likely that Stable Diffusion will run on a smartphone within a year," Mostaque told us. Also, new techniques will allow training these models on less expensive equipment over time. We may soon be looking at an explosion in creative output fueled by AI.

Stable Diffusion and other models are already starting to take on dynamic video generation and manipulation, so expect photorealistic video generation via text prompts before too long. From there, it's logical to extend these capabilities to audio and music, real-time video games, and 3D VR experiences. Soon, advanced AI may do most of the creative heavy lifting with just a few suggestions. Imagine unlimited entertainment generated in real-time, on demand. "I expect it to be fully multi-modal," said Mostaque, "so you can create anything you can imagine, like the Star Trek holodeck experience."

ISMs are also a dramatic form of image compression: Stable Diffusion takes hundreds of millions of images and squeezes knowledge about them into a 4.2GB weights file. With the correct seed and settings, certain generated images can be reproduced deterministically. One could imagine using a variation of this technology in the future to compress, say, an 8K feature film into a few megabytes of text. Once that's the case, anyone could compose their own feature films that way as well. The implications of this technology are only just beginning to be explored, so it may take us in wild new directions we can't foresee at the moment.

Realistic image synthesis models are potentially dangerous for reasons already mentioned, such as the creation of propaganda or misinformation, tampering with history, accelerating political division, enabling character attacks and impersonation, and destroying the legal value of photo or video evidence. In the AI-powered future, how will we know if any remotely produced piece of media came from an actual camera, or if we are actually communicating with a real human? On these questions, Mostaque is broadly hopeful. "There will be new verification systems in place, and open releases like this will shift the public debate and development of these tools," he said.

That's easier said than done, of course. But it's also easy to be scared of new things. Despite our best efforts, it's difficult to know exactly how image synthesis and other AI-powered technologies will affect us on a societal scale without seeing them in wide use. Ultimately, humanity will adapt, even if our cultural frameworks end up changing radically in the process. It has happened before, which is why the Ancient Greek philosopher Heraclitus reportedly said, "The only constant is change."

In fact, there's a photo of him saying that now, thanks to Stable Diffusion.
 
Check this shit out. StableDiffusion photoshop plugin in action:
 
Check this shit out. StableDiffusion photoshop plugin in action:
5z9DXAzPIkAgf3Rp.mp4
I'm usually cynical about AI generated art, since DALLE-2 stuff usually looks kinda weird and uncanny, but being able to point to a Japanese castle and basically say "turn this into a blue house" and it looks alright, or just the idea of roughly MS Painting a hill onto an already existing picture and telling the computer to make it look nice and it sorta kinda does that, is fucking crazy to me.

I think the coolest shit is going to be coming from artists who get over themselves and use AI generation as another tool, and don't listen to the contingent which will undoubtedly grow of """real artists""" who refuse to use it at all.
 
Last edited:
I already don't.

I only believe what @CatParty tells me.

Yes.

Screenshot 2022-09-10 at 02.45.15.png
 
They are just upset that I can generate endless skeletons smoking cigars, and they can do nothing about it!
 

Attachments

  • Skeleton_smoking_a_cigar_while_playing_poker_paint_3dade96f.png
    Skeleton_smoking_a_cigar_while_playing_poker_paint_3dade96f.png
    438.1 KB · Views: 16
  • Skeleton_smoking_a_cigar_while_playing_poker_paint_66489190.png
    Skeleton_smoking_a_cigar_while_playing_poker_paint_66489190.png
    429.7 KB · Views: 15
It's again not quite there yet. The user has to do a brunt of the work, alone trying to find a prompt that works, then doing photo editing to make it just right. Stable diffusion and it's branches won't accomplish photoreal fakery for a few years minimum. Then you can make articles saying this.
 
I like the weird shit anon's are making on 4chan(yes it's all porn).

I'm waiting for a hardrive to come and the first thing im gonna do is set one of those stable diffusion shit's up and use it to make Pepe's

/g has a setup guide for it if you want to try it out

This is a pretty good guide as well.


Learning a lot on how to actually guide it to give you something resembling what you want, it's finnicky but it can be very impressive. Output speed is crazy compared to what used to be available. 512x512 images every few seconds, and you can have it set to generate several from several different prompts.
 
I like the weird shit anon's are making on 4chan(yes it's all porn).

I'm waiting for a hardrive to come and the first thing im gonna do is set one of those stable diffusion shit's up and use it to make Pepe's

/g has a setup guide for it if you want to try it out

I knew things were changing but I didn't imagine they'd change this fast. Incredible.
hkjh.png

Edit: if the video doesn't load please let me know.
 
Looks like art is going back to the time where knowing how to make it look good and technically feasible is necessary. Much like those old gothic buildings that not only looked good, but also stable.

And the only ones whining about this are hacks who weren' t artists in the first place.
 
Looks like art is going back to the time where knowing how to make it look good and technically feasible is necessary. Much like those old gothic buildings that not only looked good, but also stable.

And the only ones whining about this are hacks who weren' t artists in the first place.
This is going to be bad for real artists too, especially artists just starting out now. I think the worst hit will be anybody selling stuff to corporations, the only positive might be the death of the algeria artstyle. If it's effortless to make actual good art with a few words, why make shitty clipart? That's me being an optimist though.
 
This is going to be bad for real artists too, especially artists just starting out now. I think the worst hit will be anybody selling stuff to corporations, the only positive might be the death of the algeria artstyle. If it's effortless to make actual good art with a few words, why make shitty clipart? That's me being an optimist though.
Its deliberate. Seen how the pepsi logo got made? Its alot of pseudo-science shit that was no doubt used to sway investor minds.

But the Alegria shit is deliberate. Part of the reason why the characters are ugly nonsensical blobs is to enforce the communist shit present in Tumblr and outward (no races and shiet). While I can see some companies going right back classic Coca-Cola ads, there's gonna be plenty who stick with Alegria. There's also the simple fact that Alegria is so simple that they can hire anyone (usually their friends) to fill in the roles. Sadly, it also applies to this too.

Which sadly means, we may just see AI art dominate art museums like the Louvre. If anything, AI tools like this should be used to improve one's own skillset and as a guide. Not as a crutch. A robot after all, is a tool that can do more than one action. And if the one behind the tool is an absolute retard, there's only so much the tool can help.
 
Its deliberate. Seen how the pepsi logo got made? Its alot of pseudo-science shit that was no doubt used to sway investor minds.

But the Alegria shit is deliberate. Part of the reason why the characters are ugly nonsensical blobs is to enforce the communist shit present in Tumblr and outward (no races and shiet). While I can see some companies going right back classic Coca-Cola ads, there's gonna be plenty who stick with Alegria. There's also the simple fact that Alegria is so simple that they can hire anyone (usually their friends) to fill in the roles. Sadly, it also applies to this too.

Which sadly means, we may just see AI art dominate art museums like the Louvre. If anything, AI tools like this should be used to improve one's own skillset and as a guide. Not as a crutch. A robot after all, is a tool that can do more than one action. And if the one behind the tool is an absolute retard, there's only so much the tool can help.
I'm pretty sure the real reason why limb proportions don't matter in Alegria is so any new hire can do it regardless of skill. Faster to learn and faster to make than more complex approaches. And if proportions and colors are always off, it can always be passed off as the "style."
Edit: Reread your post and realized you said the style is intentionally dumbed down so that just about anyone can make it. I feel dumb now.
 
Regarding the 'replacement of artists'. I've been using it as a form of inspiration, then I plan to do my own take on that 'idea' later in my own style. For example, I was trying to get it to generate something like those Morrowind concept pieces for ESO, and as a result I combined mountain and tree at some point, but the AI saw that as "mountain that is a tree" and now I have this entire new niche of cool looking inspiration to use.

1662787635854.png1662787710716.png1662787747553.png

It also at the same time reflects the prompts art style, so by using Morrowind as one of the words, it creates these warm ashy landscapes. But for one of them I replaced Morrowind with Skyrim and its very clear in the final image

1662787862586.png

and as a final bonus here is one where I put Twilight Princess, you can see how the entire colour palette shifts to match that seen in the game
1662787979228.png

Basically, what I'm trying to say is that any artist who is actually an artist would simply use this as a new tool, something to work from, to use as inspiration. Anyone who is actually afraid this is going to replace them is probably not a very good artist in the first place. AI is not going to replace creativity, I had to do a bunch of work to get a prompt that outputs these kinds of results, and even then they fail to hold up when upscaled.
 
Regarding the 'replacement of artists'. I've been using it as a form of inspiration, then I plan to do my own take on that 'idea' later in my own style. For example, I was trying to get it to generate something like those Morrowind concept pieces for ESO, and as a result I combined mountain and tree at some point, but the AI saw that as "mountain that is a tree" and now I have this entire new niche of cool looking inspiration to use.


It also at the same time reflects the prompts art style, so by using Morrowind as one of the words, it creates these warm ashy landscapes. But for one of them I replaced Morrowind with Skyrim and its very clear in the final image


and as a final bonus here is one where I put Twilight Princess, you can see how the entire colour palette shifts to match that seen in the game

Basically, what I'm trying to say is that any artist who is actually an artist would simply use this as a new tool, something to work from, to use as inspiration. Anyone who is actually afraid this is going to replace them is probably not a very good artist in the first place. AI is not going to replace creativity, I had to do a bunch of work to get a prompt that outputs these kinds of results, and even then they fail to hold up when upscaled.
But the work done to find a good prompt is still a hell of a lot less work than it takes to make a piece of art like this yourself, and certainly cheaper than paying someone to do it. As far as the quality argument, ai generation looked like complete shit a few years ago and at way lower resolutions. Not to mention how uncreative these were originally, just essentially being made to try and generate faces of cats and dogs and failing miserably, and at lower resolutions. Now they almost effortlessly can generate headshots of people that are not real and with a few words make almost anything you want based off of decades of work put on the internet by real humans. Even if the current models fail under scrutiny when upscaled, future hardware will make it possible to generate them at higher resolutions in the first place. I also agree that ai work will never fully replace artists, but is sure as hell going to make it harder for them.
 
This was already true with deepfakes and Photoshop to a lesser extent. Fake photos are far from new.
 
This was already true with deepfakes and Photoshop to a lesser extent. Fake photos are far from new.
Deepfakes were still only useful on mostly still, human subjects. This technique later improved to work better on moving subjects and more cartoonish stuff, though I'd argue not that impressively yet. This is generating a decent resolution image that can clone any piece of art you could want, and it will only get better. It takes mere seconds for something to be made from a few words of your choice. I haven't been impressed by this stuff in a while but this being able to be ran off of my pc with decent specs from a few years ago and compete with the results of a cloud based model is crazy.
 
Back
Top Bottom