Instructions to use arcee-ai/Trinity-Nano-Preview-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use arcee-ai/Trinity-Nano-Preview-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="arcee-ai/Trinity-Nano-Preview-NVFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Trinity-Nano-Preview-NVFP4", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("arcee-ai/Trinity-Nano-Preview-NVFP4", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use arcee-ai/Trinity-Nano-Preview-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "arcee-ai/Trinity-Nano-Preview-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Trinity-Nano-Preview-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/arcee-ai/Trinity-Nano-Preview-NVFP4
- SGLang
How to use arcee-ai/Trinity-Nano-Preview-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "arcee-ai/Trinity-Nano-Preview-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Trinity-Nano-Preview-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "arcee-ai/Trinity-Nano-Preview-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "arcee-ai/Trinity-Nano-Preview-NVFP4", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use arcee-ai/Trinity-Nano-Preview-NVFP4 with Docker Model Runner:
docker model run hf.co/arcee-ai/Trinity-Nano-Preview-NVFP4
Trinity Nano Preview NVFP4
Trinity Nano Preview is a preview of Arcee AI's 6B MoE model with 1B active parameters. It is the small-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
This is a chat tuned model, with a delightful personality and charm we think users will love. We note that this model is pushing the limits of sparsity in small language models with only 800M non-embedding parameters active per token, and as such may be unstable in certain use cases, especially in this preview.
This is an experimental release, it's fun to talk to but will not be hosted anywhere, so download it and try it out yourself!
Trinity Nano Preview is trained on 10T tokens gathered and curated through a key partnership with Datology, building upon the excellent dataset we used on AFM-4.5B with additional math and code.
Training was performed on a cluster of 512 H200 GPUs powered by Prime Intellect using HSDP parallelism.
More details, including key architecture decisions, can be found on our blog here
This repository contains the NVFP4 quantized weights of Trinity-Nano-Preview for deployment on NVIDIA Blackwell GPUs.
Model Details
- Model Architecture: AfmoeForCausalLM
- Parameters: 6B, 1B active
- Experts: 128 total, 8 active, 1 shared
- Context length: 128k
- Training Tokens: 10T
- License: OpenMDW-1.1
Quantization Details
- Scheme: NVFP4 (
nvfp4_mlp_only— MLP/expert weights only, attention remains BF16) - Tool: NVIDIA ModelOpt
- Calibration: 512 samples, seq_length=2048, all-expert calibration enabled
- KV cache: Not quantized
Running with vLLM
Requires vLLM >= 0.18.0. Native FP4 compute requires Blackwell GPUs; older GPUs fall back to Marlin weight decompression automatically.
Blackwell GPUs (B200/B300/GB300) — Docker (recommended)
docker run --runtime nvidia --gpus all -p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:v0.18.0-cu130 \
arcee-ai/Trinity-Nano-Preview-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--max-model-len 8192
Hopper GPUs (H100/H200) and others
vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
--trust-remote-code \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
Note (Blackwell pip installs): If installing vLLM via pip on Blackwell rather than using Docker, native FP4 kernels may produce incorrect output due to package version mismatches. As a workaround, force the Marlin backend:
export VLLM_NVFP4_GEMM_BACKEND=marlin
vllm serve arcee-ai/Trinity-Nano-Preview-NVFP4 \
--trust-remote-code \
--moe-backend marlin \
--gpu-memory-utilization 0.90 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000
Marlin decompresses FP4 weights to BF16 for compute, providing the full memory compression benefit (~3.7× vs BF16) but not native FP4 compute speedup. On Hopper GPUs (H100/H200), Marlin is selected automatically and no extra flags are needed.
License
Trinity-Nano-Preview-NVFP4 is released under the OpenMDW-1.1 license.
- Downloads last month
- 57
Model tree for arcee-ai/Trinity-Nano-Preview-NVFP4
Base model
arcee-ai/Trinity-Nano-Base-Pre-Anneal