Automation and scale are key to AI image generation
Generative image models offer limitless possibilities for content creation, creative expression, and automation of costly and time-consuming tasks. Models such as Stable Diffusion and Midjourney are already transforming content creation in many industries including movies, video games and marketing. They will permeate many more industries over the coming year.
However, the current process of interactively generating a handful of images* is inefficient and limits the value of these models as it limits the chance you will see the best images. For every good image posted on Reddit or Twitter, thousands have been generated and discarded. Who knows how good the images were in the unexplored parts of the model, the unseen other ten thousand images?
We believe that to fully realize the potential of these generative models, they need to be combined with a platform that enables fast iteration on thousands of images by scaling human feedback with further AI tools and traditional automation. This includes automated training, inference parameter space search, ML-enabled image search, filtering, and comparison tools, and integrated assistive models such as resolution-upscaling and detail correction (e.g. “face-fix”).
We are a stealth-mode startup building a platform to solve these challenges. If you are interested in trying a private-alpha version of this platform, please contact us at scaled-diffusion@proton.me
We have been generating tens of thousands of images using our platform to serve a particular vertical. This experience, and the requirements it has exposed, has driven our development. It has also demonstrated a number of aspects that significantly affect the quality of generated images and the occurrence-rate of very high quality images.
Key learnings
- Gold in a haystack. Many images are good, many are bad, and some are phenomenal. To find the exceptional ones you need to generate a large number of images.
- Fine tune on high quality images. Training new models by fine-tuning on high quality datasets enables a step-change in output quality. These datasets need not be very large, but they must be high quality. We have found that datasets as small as 5-10 images are enough to successfully train a model to generate images in their style (without replicating the input images).
- Training parameters are key and non-linear. Some training parameters have a large impact on image quality such as the number of training steps. We have found it essential to generate a substantial number of images at training-step intervals to identify the best results. Much of the time you will find that the effect of some training parameters is unstable and that exploring the inference space of multiple versions of a model is key.
- Good inference parameters change each model. You have to explore the inference-parameter space for each new model you train or target you define. Given the scale of the search space, automatic exploration is essential to productivity. Of course, you also start to identify common bounds on parameters (e.g. performing 10 inference steps will almost never be enough).
- Prior preservation with real images is key for creativity. When training models (fine-tuning) for a particular object, style, or target it is highly beneficial to include high quality real images for prior preservation of the category of the target (e.g. dogs if you are training a model to generate new images of your dog). Including prior preservation in training substantially improves the diversity of results (i.e. the model’s creativity). Furthermore, a key finding here is that using real images for this prior preservation significantly improves results, as opposed to using images generated by the base model that you are fine-tuning.
As you would expect, we have built our platform to address all of these points and continue to improve it daily.
Bring automation and scale to the problem. You get better results.
* In more technical terms, manually exploring the parameter space is inefficient.