Using NVIDIA TensorRT-LLM to run gpt-oss-20b

This notebook provides a step-by-step guide on how to optimizing gpt-oss models using NVIDIA's TensorRT-LLM for high-performance inference. TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in performant way.

TensorRT-LLM supports both models:

gpt-oss-20b
gpt-oss-120b

In this guide, we will run gpt-oss-20b, if you want to try the larger model or want more customization refer to this deployment guide.

Note: Your input prompts should use the harmony response format for the model to work properly, though this guide does not require it.

You can simplify the environment setup by using NVIDIA Brev. Click the button below to launch this project on a Brev instance with the necessary dependencies pre-configured.

Once deployed, click on the "Open Notebook" button to get start with this guide

To run the gpt-oss-20b model, you will need an NVIDIA GPU with at least 20 GB of VRAM.

Recommended GPUs: NVIDIA Hopper (e.g., H100, H200), NVIDIA Blackwell (e.g., B100, B200), NVIDIA RTX PRO, NVIDIA RTX 50 Series (e.g., RTX 5090).

CUDA Toolkit 12.8 or later
Python 3.12 or later

There are multiple ways to install TensorRT-LLM. In this guide, we'll cover using a pre-built Docker container from NVIDIA NGC as well as building from source.

If you're using NVIDIA Brev, you can skip this section.

Pull the pre-built TensorRT-LLM container for GPT-OSS from NVIDIA NGC. This is the easiest way to get started and ensures all dependencies are included.

docker pull nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev
docker run --gpus all -it --rm -v $(pwd):/workspace nvcr.io/nvidia/tensorrt-llm/release:gpt-oss-dev

Alternatively, you can build the TensorRT-LLM container from source. This approach is useful if you want to modify the source code or use a custom branch. For detailed instructions, see the official documentation.

TensorRT-LLM will be available through pip soon

Note on GPU Architecture: The first time you run the model, TensorRT-LLM will build an optimized engine for your specific GPU architecture (e.g., Hopper, Ada, or Blackwell). If you see warnings about your GPU's CUDA capability (e.g., sm_90, sm_120) not being compatible with the PyTorch installation, ensure you have the latest NVIDIA drivers and a matching CUDA Toolkit version for your version of PyTorch.