DiffuseNet

Collection of generative models trained from scratch including DDPM, DiT, VAE for learning and research

I have trained a number of generative models from scratch on PyTorch and trained them on server-grade hardware. In the process, I conducted an exhaustive literature survey of state-of-the-art papers for the related models. All the code is available on Github. You can also check out the project report.

Variation Autoencoder (VAE)

Implemented a VAE from scratch inspired by SD-VAE. It was trained on both MNIST and Minecraft images. The model uses a convolutional autoencoder with upsampling and downsampling blocks along with residual attention layers.

Training was performed using adversarial loss, KLD loss and LPIPS loss using a pretrained vgg16 network.

Interpolations

Linear interpolations between two latents

Reconstructions

The VAE was trained on 256x256 Minecraft images and outputs latents of dim 64x8x8, with a 48x compression.

Diffusion Transformer (DiT)

Implementation of Diffusion Transformer inspired by the original DiT paper. The model uses transformer blocks with timesteps conditioned through adaLN Tested both small (76 M) and large (608 M) variants on the Minecraft dataset using our pre-trained VAE. All the models were trained on an NVIDIA A100.

Generated Minecraft (left) and Pokemon (right) images by DiT-large model from noise

Text conditioning with Classifier Free Guidance (cfg)

Experimented with text conditioning through cross-attention in the DiTBlock with a pre-trained CLIP text embedding model.

Used Qwen-2.5VL for extracting captions for ~36000 pokemon frames and trained a 1.2B parmeter DiT model with cfg and dropout. Here are some examples for selected prompts:

Red-haired character standing beside a vibrant blue water body, soft daylight, serene atmosphere
Red-haired character walking through dense forest, overcast day, pixelated art style, serene atmosphere, lush greenery surrounding the path
Small orange lizard-like creature with flames on its tail, battling with a yellow mouse Pokémon with lightning and thunder
A red-roofed healing center in a vibrant green field, daytime, close-up, peaceful

All these generations were tested within a suitable cfg guidance scale of 5-20.

DDPM Generations

U-Net generations on MNIST (left) and CIFAR-10 (right)

Timestep Sampling (T=500)