Partition GPU - Search Videos

My MIG (Multi-Instances GPU) setup came just in time for testing Gemma 4 with MTP.The nice part of MIG is that I can run two isolated inference tenants on the same A100: one Gemma 4 baseline, one Gemma 4 with multi-token prediction (MTP), each pinned to its own MIG instance.Same physical GPU. Separate memory and compute partitions. Cleaner comparison.Here is my first MTP test running on MIG. MTP is twice as fast as regular Gemma 4. @googlegemma Also thanks to @vllm_project for day-0 MTP support

My MIG (Multi-Instances GPU) setup came just in time for testing Gem…

1.2K views3 days ago

x.comMichael Guo

Sharding a massive AI model onto a microchip cluster often forces a tradeoff between saving memory on weights and saving memory on prompts.When training a system on an 8-GPU node, the network cannot fit on one chip. You must partition it. Tensor parallelism divides the model's learned weights across chips, while sequence parallelism divides the input text. Applying both at once usually requires grid layouts that waste bandwidth.In "Folding Tensor and Sequence Parallelism," Zyphra researchers mer

Sharding a massive AI model onto a microchip cluster often forces a tr…

69 views3 days ago

x.comAI Explainer Videos

See more videos