Shorts: Concepts
In this section, we explain some basic concepts about Generative AI processes. If you have questions about a specific topic, you can get in touch with us and we will try to help!
Core AI Concepts
Prediction
Every time you run a model on a powered with AI app, you’re creating a prediction. A prediction is an object that represents a single result from running the model, from an imput to an ouput. As an input it could include the prompt you provided, an image, or the settings. The output is the text, image, sound, or other kind of result but also other metadata like the model version, the user who created it, and timestamps.
Whenever you run a model, you’re creating a prediction. AI predictions could take a few milliseconds or minutes, even hours, depending on the model and server.
Temperature or Guidance Scale
The temperature, or guidance scale (sometimes referred as cfg - classifier free guidance) is a parameter that controls how much the image generation process follows the text prompt. The higher the value, the more image sticks to a given text input but also the less diversity and quality you will get.
Dataset
A machine learning dataset is a collection of data that is used to train the model. A dataset acts as an example to teach the machine learning algorithm how to make predictions. The common types of data includes text data or image data.
AI Biases
Machine learning bias, also known as algorithm bias or AI bias, is a phenomenon that occurs when an algorithm produces results that are systemically prejudiced due to erroneous assumptions in the machine learning (ML) process. Deep learning modules tend to reproduce or exacerbate biases that are already present in the training dataset.
Model Architecture & Training
Diffusion Models
Diffusion models are a class of generative models that learn to create images by gradually removing noise from random data. They work by adding noise to training images and then learning to reverse this process. Popular examples include Midjourney, Stable Diffusion, DALL-E, Krea, which have revolutionized AI image generation.
Latent Space
Latent space is a compressed representation of data where similar concepts are positioned close together. In image generation, models work in this compressed space rather than directly with pixels, making the process more efficient and allowing for smoother interpolations between different concepts.
VAE (Variational Autoencoder)
A VAE is a component that compresses images into latent representations and reconstructs them back to pixel space. In Stable Diffusion, the VAE handles the conversion between the latent space where generation happens and the final image output.
U-Net
U-Net is the core neural network architecture used in diffusion models for image generation. It predicts the noise to be removed at each step of the denoising process, gradually transforming random noise into coherent images based on text prompts.
Fine-tuning
Fine-tuning involves training a pre-trained model on specific data to adapt it for particular styles, subjects, or use cases. This process allows customization of AI models for specialized applications without training from scratch.
LoRA (Low-Rank Adaptation)
LoRA is an efficient fine-tuning technique that modifies only a small subset of model parameters, allowing for quick adaptation to new styles or concepts while maintaining the original model’s capabilities. LoRA models are lightweight and can be easily shared and combined.
Control & Guidance Systems
Adapters
Adapters are added to AI-pre-trained text-to-image models. They let us input visual information to guide large text-to-image models during the generation process. Some examples are ControlNet or T2I-Adapter.
ControlNet
ControlNet is a powerful adapter that enables precise control over image generation using various input conditions like edge maps, depth maps, pose estimation, or sketches. It allows users to maintain structural consistency while generating images, making it invaluable for architectural visualization and design work.
T2I-Adapter
T2I-Adapter provides an alternative to ControlNet for guiding image generation with visual inputs. It offers similar functionality but with different implementation approaches, often requiring less computational resources while maintaining good control over the generation process.
CFG Scale
Classifier Free Guidance (CFG) Scale determines how closely the AI follows your text prompt. Lower values (1-7) create more creative, diverse images but may stray from the prompt. Higher values (8-20) stick closely to the prompt but may reduce image quality and creativity.
Generation Parameters
Seeds
A seed is a number that determines the initial random state for image generation. Using the same seed with identical settings will produce the same image, allowing for reproducible results and systematic experimentation with different parameters.
Sampling Methods
Sampling methods (like DPM++, Euler, DDIM) determine how the AI removes noise during image generation. Different samplers can affect image quality, generation speed, and the final aesthetic, with some being better suited for specific types of content.
Noise Scheduling
Noise scheduling controls how noise is added and removed during the diffusion process. Different schedules can affect the quality and characteristics of generated images, with some schedules being optimized for faster generation or better quality.
Prompt Engineering
Prompt engineering is the art and science of crafting effective text prompts to guide AI image generation. It involves understanding how different words, phrases, and structures influence the output, including the use of negative prompts to avoid unwanted elements.
Advanced Techniques
Inpainting
Inpainting allows you to modify specific parts of an existing image by masking areas and generating new content to fill them. This technique is useful for editing, removing objects, or adding elements to existing images while maintaining visual coherence.
Outpainting
Outpainting extends images beyond their original boundaries, generating new content that seamlessly continues the existing image. This technique is valuable for expanding compositions or creating panoramic views from smaller images.
Upscaling
AI upscaling enhances image resolution and quality using machine learning algorithms. Modern upscaling techniques can increase image size while adding realistic details, making them essential for preparing AI-generated images for high-resolution applications.
Multimodal AI
Multimodal AI combines multiple input types (text, images, sketches, audio) to create more sophisticated and controllable generation processes. This approach enables more intuitive interaction with AI systems and better creative control.
Real-time Generation
Real-time generation refers to AI systems that can produce images almost instantaneously as you type or draw. This capability enables interactive creative workflows and immediate feedback, revolutionizing how designers and artists work with AI tools.
External references:
AI Image Generation Platforms:
Technical Documentation & Tools:
Research & Learning: