Experiments for diffusion upscaling low-resolution images with tiled pipelines, a custom MultiDiffusion path, and Flux.2 reference-conditioned tiles.
See upscalers.md for the current upscaler architecture guide, engine tradeoffs, and result-informed next experiments.
uv sync
uv sync --extra mlThe first real generation downloads model weights from Hugging Face unless they are already cached. Set HF_TOKEN for higher Hub rate limits.
uv run find-alan-refine --helpfind-alan-refine runs the full-coverage iterative refinement pass from full_coverage_v5.py, packaged as an importable module and CLI.
The refinement pass:
- Opens an existing image.
- Builds shifted patch grids with edge-aware writable masks.
- Randomly packs non-overlapping patches into mini-batches.
- Runs Flux inpainting over each mini-batch.
- Blends the refined patch interiors back into the working image.
- Saves per-iteration outputs, a final image, a before/after comparison, and a patch progression GIF.
uv run find-alan-refine input.png outputs/refined \
--iterations 4 \
--max-batch-size 12 \
--strength 0.2The old environment variables still work as defaults for the CLI:
INPUT_IMAGEOUTPUT_DIRNUM_ITERSMAX_BATCH_SIZESTRENGTH
For scripts that combine stages, use the package API:
from pathlib import Path
from find_alan.refinement import TiledRefinementConfig, run_tiled_refinement
result = run_tiled_refinement(
TiledRefinementConfig(
input_path=Path("input.png"),
output_dir=Path("outputs/refined"),
)
)
print(result.final_path)Run static checks when dev dependencies are installed:
uv run ty check
uv run find-alan-upscale --help
uv run find-alan-crop-plan --helpThe upscaler has a few separate jobs that are easy to mix together:
- Image-to-image denoising starts from a resized version of the low-resolution input.
--denoising-strengthcontrols how much noise is added before the model redraws it. Higher values give the model more room to invent new detail. - ControlNet feeds the source image back into the model as a spatial constraint. It says: keep this composition, these edges, this local texture, and this rough object placement.
--controlnet-strengthcontrols how strongly that constraint is enforced. - Tiling/MultiDiffusion is about scale and seams. Large images are too big to denoise in one pass, so the model denoises overlapping latent crops and blends them into one canvas.
- Flux.2 reference conditioning is different from ControlNet. Flux.2 accepts image inputs as reference/context tokens, so the
flux2-tileengine gives each target crop to Flux.2 as a reference image and asks it to redraw that crop at the same pixel size.
ControlNet and MultiDiffusion solve different problems. ControlNet controls what the generated image should stay aligned to. MultiDiffusion controls how many overlapping windows are combined into a seamless large image.
For more hallucinated detail, raise --denoising-strength and lower --controlnet-strength. For a faithful upscale, lower denoising and raise ControlNet strength.
For flux2-tile, there is no SDXL ControlNet and no global latent canvas. Faithfulness comes from the per-tile reference image plus the prompt. Seam reduction comes from overlap and gaussian blending.
flowchart TD
Base["Low-resolution crowd image"] --> Resize["Resize to target scale"]
Resize --> Engine{"Choose upscale engine"}
Engine --> ModTile["mod-tile"]
Engine --> Multi["multidiffusion"]
Engine --> Flux["flux2-tile"]
Engine --> FluxMD["flux2-multidiffusion"]
Engine --> SD3["sd3-tile"]
ModTile --> Blend["Tile, condition, and blend"]
Multi --> Blend
Flux --> Blend
FluxMD --> Blend
SD3 --> Blend
Blend --> BaseOut["Upscaled crowd base"]
BaseOut --> Review["Review seams, detail, and layout"]
Review --> Local["Optional local object insertion or repair"]
Local --> Final["Final image"]
Default engine. Uses the Diffusers community tiled super-resolution pipeline with SDXL and ControlNet Tile/Union.
The flow is:
- Resize the input image to the target scale.
- Use ControlNet Tile/Union to keep the resized image structure visible to SDXL.
- Let the community tiled SR pipeline split work into tiles and blend the output.
Best for: quick baselines, stable 4x results, preserving the source layout. It is the safer first pass when you want to check whether the source and prompt are reasonable.
uv run find-alan-upscale input.png output.png --scale 4Experimental engine. Runs overlapping latent crops at each denoising step, fuses their noise predictions, and advances the whole latent canvas once per step. This is closer to the original MultiDiffusion idea.
The flow is:
- Resize the input image to the target scale.
- Encode that resized image into one large latent canvas.
- For each denoising step, generate a crop grid over the latent canvas.
- Run ControlNet and the UNet on every overlapping crop.
- Blend the predicted noise from all crops with soft weights.
- Advance the whole latent canvas once with the fused noise prediction.
- Decode the final latent canvas back to pixels.
ControlNet still runs inside each crop. That means the custom MultiDiffusion engine can still be source-faithful if --controlnet-strength is high. The MultiDiffusion part makes the crop fusion more coherent; it does not, by itself, make the model more imaginative.
Best for: testing stronger hallucinated detail, jittered crop fusion, and high-overlap seamlessness. It is much slower than mod-tile.
uv run find-alan-upscale input.png output.png \
--engine multidiffusion \
--scale 2 \
--steps 28 \
--denoising-strength 0.92 \
--controlnet-strength 0.45 \
--guidance-scale 6 \
--md-tile-size 768 \
--md-overlap 384 \
--md-jitter 256Experimental engine for Flux.2. It does not reuse the SDXL ControlNet or latent MultiDiffusion loop, because Flux.2 is a DiT pipeline with image/reference conditioning. Instead, it:
- Resizes the input image to the target scale.
- Splits the resized image into overlapping pixel crops.
- Sends each crop to Flux.2 as the reference image.
- Prompts Flux.2 to faithfully redraw that reference crop.
- Blends the generated crops back together with gaussian weights.
Best for: A100 trials where Flux.2 quality is more important than strict SDXL ControlNet-style fidelity. --denoising-strength and --controlnet-strength are not used by this engine.
Detailed flow:
- Open the source image as RGB.
- Compute the scaled output size and round it to a multiple of 16, matching Flux.2/VAE packing constraints.
- Resize the source image to that final output size with Lanczos filtering.
- Round
--flux2-tile-sizeup to a multiple of 16 and clamp--flux2-overlapso it is smaller than the tile. - Build a full-cover crop grid. The grid always includes the top-left and bottom-right bounds, so edge pixels are covered even when the image size is not an exact multiple of the stride.
- For each crop, cut the resized image and pass that crop to Flux.2 as
image=...withheightandwidthset to the crop size. - Add a faithfulness instruction to the user prompt, including a reminder to preserve composition, linework, colors, viewpoint, and crowd layout.
- Generate one tile independently. The engine uses bf16 on CUDA and fp32 on CPU.
- Convert the generated tile to float RGB and multiply it by a gaussian weight map. The center of the tile contributes more strongly than the edges.
- Accumulate weighted tile pixels into one output canvas and accumulate the matching weights.
- Normalize
canvas / weights, clip to RGB, and save the final image.
The tradeoff is important: because Flux.2 tiles are sampled independently, this engine is simpler than latent MultiDiffusion but has less global coordination. Increase --flux2-overlap when seams are visible. Increase --flux2-tile-size when objects need more surrounding context. Use --flux2-jitter to change the crop alignment in a deterministic seed-controlled way.
Model selection:
- Default:
black-forest-labs/FLUX.2-dev. --flux2-pipeline autoselects the Diffusers pipeline from the model id.- Use
--flux2-pipeline devforFlux2Pipeline. - Use
--flux2-pipeline kleinforFlux2KleinPipeline. - Use
--flux2-pipeline klein-kvforFlux2KleinKVPipeline.
uv run find-alan-upscale input.png output.png \
--engine flux2-tile \
--scale 2 \
--steps 50 \
--guidance-scale 4 \
--flux2-tile-size 1024 \
--flux2-overlap 256 \
--no-cpu-offloadDebug helper. Prints the jittered crop grids used by the custom MultiDiffusion scheduler.
uv run find-alan-crop-plan --width 320 --height 240 --scale 10 --steps 4--denoising-strength: higher means more imagined changes; lower means more faithful to the upscaled input.
--controlnet-strength: higher pins structure and local texture to the source; lower gives the model more freedom. Yes, ControlNet is one of the main reasons outputs stay similar.
--guidance-scale: higher follows the prompt harder, but can overcook details.
--md-overlap: higher improves seam consistency but increases runtime.
--md-jitter: changes crop alignment between denoising steps, which can reduce repeated tile artifacts.
--flux2-tile-size: pixel crop size for flux2-tile. Larger tiles give Flux.2 more context, but require more VRAM and time.
--flux2-overlap: pixel overlap between Flux.2 tiles. Higher overlap gives the gaussian blend more room to hide seams.
--flux2-jitter: optional maximum random tile-grid offset for Flux.2. This is deterministic with --seed.
--flux2-pipeline: selects the Diffusers Flux.2 pipeline class. auto is usually enough unless the model id does not clearly name the variant.
--flux2-caption-upsample-temperature: optional Flux.2 prompt upsampling temperature. Leave unset for the local prompt as written.
Faithful baseline:
uv run find-alan-upscale input.png output.png --engine mod-tile --scale 4 --denoising-strength 0.45More imagined 2x MultiDiffusion:
uv run find-alan-upscale input.png output.png \
--engine multidiffusion \
--scale 2 \
--steps 28 \
--denoising-strength 0.92 \
--controlnet-strength 0.45 \
--guidance-scale 6 \
--md-tile-size 768 \
--md-overlap 384 \
--md-jitter 256Detailed 4x MultiDiffusion trial:
uv run find-alan-upscale input.png output.png \
--engine multidiffusion \
--scale 4 \
--steps 24 \
--denoising-strength 0.85 \
--controlnet-strength 0.75 \
--guidance-scale 5 \
--md-tile-size 1024 \
--md-overlap 512 \
--md-jitter 256Use a separate local pass for final object insertion and local corrections.
The current upscaling engines should stay focused on making a strong base image. Do not put object-specific language into the global upscale prompt, because it can create false positives or repeated motifs across the crowd.
The object insertion stage should stay separate from global upscaling, but the exact approach is intentionally not fixed yet. It might use masked inpainting, local img2img repair, compositing, or a small engine-specific workflow once the base image quality is clear.
For now, treat the global output as an engine-flexible base image. After choosing the best base, use a local pass around the target region so the object can be inserted or repaired without changing the whole crowd scene.
Avoid running a full-image upscale or redraw after inserting the object, because that could smear it, duplicate it, or change the hiding location.
The Flux2-style source images in data/examples/lr/flux2 are being upscaled for large-screen review with SDXL ControlNet MultiDiffusion, not Flux2. The queued batch writes to data/examples/out/flux2.
Queued batch suffix:
c31d3a8a_rb826e0f
Output naming pattern:
data/examples/out/flux2/<source_stem>_sdxl_md_3x_upscaleprompt_d075_c035_tile1024_c31d3a8a_rb826e0f.png
Settings:
uv run find-alan-upscale input.png output.png \
--engine multidiffusion \
--scale 3 \
--steps 28 \
--denoising-strength 0.75 \
--controlnet-strength 0.35 \
--guidance-scale 4.5 \
--md-tile-size 1024 \
--md-overlap 512 \
--md-jitter 256Rationale:
3xturns the1920x1072sources into roughly5760x3216, which is better suited to large-screen display than2x.1024tiles with512overlap keep the setting consistent with the best conference SDXL MultiDiffusion runs and prioritize seam control.denoising-strength 0.75andcontrolnet-strength 0.35are the current preferred balance for adding detail while keeping the crowd layout anchored.- Refinement is intentionally not queued for this batch yet; inspect the 3x bases first, then refine selected outputs.
Use find-alan-refine only after a base upscale has been selected or when a specific comparison needs polishing. The refinement stage is a tiled Flux inpaint pass: each patch sees a 512x512 context window and writes only the inner 256x256 region when --inner-ratio 0.5 is used.
Default comparison refinement settings:
uv run find-alan-refine base.png output_dir \
--iterations 4 \
--strength 0.2 \
--steps 28 \
--guidance-scale 3.5 \
--outer-size 512 \
--inner-ratio 0.5 \
--feather 4 \
--max-batch-size 12Output directory naming pattern:
data/examples/out/<set>/refined/<base_stem>_refine_default_i4_s020_steps28_c<commit7>_r<run7>
For a heavier refinement stress test, increase iterations while keeping strength fixed:
uv run find-alan-refine base.png output_dir \
--iterations 12 \
--strength 0.2 \
--steps 28 \
--guidance-scale 3.5 \
--outer-size 512 \
--inner-ratio 0.5 \
--feather 4 \
--max-batch-size 12--max-batch-size 24 can improve throughput on an otherwise empty 80 GB GPU, but it is more fragile. Use 12 as the reliable default and only raise it for throughput experiments.
Operational notes:
- Keep
pueue parallel 1for these runs unless jobs are pinned to separate GPUs. - If refinement fails with CUDA OOM while loading the pipeline, lowering
--max-batch-sizeusually will not help; that failure happens before patch batches start and usually means another process is already occupying VRAM. - If refinement fails during patch processing, retry with a smaller batch size such as
--max-batch-size 6. - For 3x Flux2-source bases, inspect the base upscales first and refine only selected outputs.
There are three ways to insert a figure into a crowd scene, differing in model requirements and how much control you want over placement.
Uses FLUX.2-klein-4B with a dedicated image_reference parameter. No mask needed — the model places the figure based on the prompt and scene context.
Before running, accept the model licence at huggingface.co/black-forest-labs/FLUX.2-klein-4B.
uv run find-alan-insert \
--scene <base image>.png \
--figure <figure image>.png \
--output <result filename>.png \
--seed 42Key options:
| Flag | Default | Notes |
|---|---|---|
--strength |
0.85 |
How much the scene is allowed to change (0–1). Lower = preserve more. |
--guidance-scale |
8.0 |
Prompt adherence. |
--steps |
50 |
Inference steps. |
--prompt |
(see code) | Text describing placement and blending. |
uv run find-alan-insert --helpUses FLUX.1-Redux-dev + `FLUX.1-Fill-dev. You supply a mask that marks exactly where the figure is inserted; the Redux prior encodes the reference figure as visual tokens that condition the fill.
Both models are gated — accept the licence at huggingface.co/black-forest-labs/FLUX.1-Fill-dev and huggingface.co/black-forest-labs/FLUX.1-Redux-dev.
With a mask file (white = inpaint, black = keep):
uv run find-alan-inpaint \
--scene <base image>.png \
--figure <figure image>.png \
--mask <mask image>.png \
--output <result filename>.png \
--seed 42Or with a bounding box instead:
uv run find-alan-inpaint \
--scene examples/crowd_scene.png \
--figure examples/figure.png \
--bbox 210 330 90 150 \
--output examples/result.pnguv run find-alan-inpaint --helpUses YOLOv8 to detect people in the scene, selects one as the target, then inpaints the reference figure into that region with FLUX.1-Redux-dev + FLUX.1-Fill-dev. Because the mask is sized to a real crowd member the inserted figure automatically matches the correct scale and perspective.
Both FLUX models are gated — accept the licences at huggingface.co/black-forest-labs/FLUX.1-Fill-dev and huggingface.co/black-forest-labs/FLUX.1-Redux-dev before running.
uv run find-alan-insert-detected \
--scene <base image>.png \
--figure <figure image>.png \
--output <result filename>.png \
--seed 42Key options:
| Flag | Default | Notes |
|---|---|---|
--strategy |
random |
Which detected person to replace: random, largest, smallest, center. |
--conf |
0.3 |
YOLO confidence threshold — lower to detect more people. |
--yolo-model |
yolov8n |
YOLOv8 variant (yolov8n/s/m/l/x). Larger = more accurate, slower. |
--padding |
0.15 |
Fraction to expand the detected bbox for edge blending. |
--guidance-scale |
30.0 |
CFG scale. |
--steps |
50 |
Inference steps. |
--save-mask |
(none) | Optional path to save the generated mask for inspection. |
uv run find-alan-insert-detected --helpThe command detects people in a scene image using YOLO, then uses FLUX (FLUX.2-Klein) inpainting to replace one of them with a given figure image.
For example, to generate the Alan in Venice example:
uv run find-alan-insert-detected --scene examples/final_result.png --figure examples/alan_cartoon.png --yolo-model yolov8s-worldv2 --detection-classes person --seed 111What this specific command does:
--scene examples/final_result.png # the background/scene to modify
--figure examples/alan_cartoon.png # the person/figure to insert
--yolo-model yolov8s-worldv2 # YOLO model for detection
--detection-classes person # detect people```
Pipeline steps:
1. Detect — runs YOLOv8 (yolov8s-worldv2) on the scene to find all bounding boxes matching person
2. Select — picks one target person (at random)
3. Pad bbox — expands the bounding box by 20% on all sides to give inpainting context
4. Crop — extracts that padded region from the scene
5. Inpaint — loads FLUX.2-Klein (~13 GB) and runs inpainting on the crop, using the reference image to condition what gets inserted
6. Composite — resizes the inpainted crop back and blends it into the original scene with a feathered mask at the edges for smooth transitions
7. Save — writes the final image to `examples/<figure>_<scene>_<seed>.png`
Net effect: one person in final_result.png is swapped out for Alan (the cartoon figure), seamlessly composited back into the original scene.