We basically have two types of Upscaling options available to us:
Think of it as StreamDiffusion v1 (SISR) vs StreamDiffusion v2 (VSR)
For the specific use case of Real-Time AI Video Streaming, the open-source solutions have two options: High Fidelity (Too Slow/Expensive) or High Speed (Low Stability).
The Bottleneck: Realtime video generation models already consume massive compute and are not fast, giving only 20 FPS for LongLive model on a H100. Adding a heavy upscaler adds a lot of latency.
The Trap: Using standard image upscalers (SISR) on video causes "flickering" and looks pixelated, making the AI video look unstable and far off from what the quality would be if the model generates at higher resolution.
Decision Matrix:
For scope local and only windows OS we can integrate nvidia upscalers just like touch designer does. These upscalers seem to be only available for free (from my research) on Windows and not on Linux. For Linux it needs enterprise software. Also it’s not clearly known that Nvidia upscalers actually perform well on AI generated videos.
For scope cloud on H100 SXM, if we want good quality the only option is FlashVSR but it comes at a cost of FPS drop (More details on how much in the sections below). If FPS drop is not an option then RealESRGANx2 is the best option. Note: RealESRGANx2 is the model we use currently in StreamDiffusion v1 (daydream.live) with users suggesting that the output quality isn’t great.
For scope cloud RTX 5090, we are stuck with RealESRGANx2
Note: Most of the research/models out there are for 4x scaling rather than 2x. And also BasicVSR++, RealBasicVSR and other realtime VSR solutions out there are not being maintained and have proven to be extremely difficult to setup/install as the last commit is almost 3-4 years old.
If we have to upscale with quality, we would ideally have two options (For this comparison let's take 512x512 → 1024x1024 and LongLive pipeline):
For 512x512 generation on a H100 with LongLive pipeline we get about 25 FPS, for the ease of understanding let's assume the latency per frame is 40 ms
FlashVSR to upscale from 512x512 to 1024x1024 at 31.2 FPS, again for ease the latency will be 32
So we end up with about 72 ms latency per frame which is 13.9 FPS for 1024x1024.
[LongLive 512x512 output
](attachment:4fed76ca-00d2-474c-a480-844dec3db79c:output_512x512.mp4)
LongLive 512x512 output