Mar 5 - 'Final' Update: iMatrix + Benchmarks + New quant algo

#31
by danielhanchen - opened
  • All GGUFs now use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.

  • GGUFs updated with an improved quantization algorithm.

  • Rest of variants like Q8_0, Q4_K_M, BF16 are now uploaded.

  • Updated with fixed chat template for improved tool-calling & coding performance!

  • Replaced BF16 layers with F16 for faster inference on unsupported devices.

  • See our new benchmarks for 122B-A10B here.
    122b final

  • Think toggle for Qwen3.5 now in LM Studio. See our guide for instructions.

  • Please follow the correct instructions / settings in our guide here.

Fine-tuning and RL Qwen3.5

danielhanchen pinned discussion

Thank you for your awesome work.

Thank you for your work. I have the previous UD-Q8_K_XL with file size at 37GB. Given the massive increase in file size for latest UD-Q8_K_XL, are there any benefits in upgrading? I have ~100GB VRAM. Would it be possible for you to add a line of your previous quants so users can decide whether they need to upgrade or not?

I mean add a data line to your graph above

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

It technically does, but my CTX dropped from 150K to 50K at FP16.

Edit:
standard Q4_K_M = 57K CTX on my system. Looks like we'd have to drop all the way down to Q4_K_S.

Amazing work. Is there planned releases for MLX for all of the models up to 397b? Should MLX be used or is GGUF better overall?

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

You can load smaller 4 bit quant.

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

You can load smaller 4 bit quant.

That is why I was asking for comparisons between prior quants and latest ones, so people can decide whether they should upgrade, or drop down a quant tier

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

You can load smaller 4 bit quant.

UD-Q4_K_L was not updated for some reason, and it's kind of lame having to drop down to a standard quant just because the size has increased.

UD_Q8_XL is 12GB larger than usual Q8 ??? ist not Q8 anymore its Q12 ^^
mean wile Q4_K_M has same size than UD_Q4_XL ... hmmm

Great work.
Very notable increase in quality.

UD_Q8_XL is 12GB larger than usual Q8 ??? ist not Q8 anymore its Q12 ^^
mean wile Q4_K_M has same size than UD_Q4_XL ... hmmm

Yeah, I hate to say it but increasing quality by practically shifting to a whole different quant bracket is a mixed bag.
I went from running UD_Q4_K_XL at 150K CTX to regular Q4_K_S at 94K.

Unfortunately, I also accidentally saved the new K_XL version over the old one, or I'd probably go back.

I think the previous version of UD Q8 K XL was working better for me. The new UD Q8 KXL is substantially slower (I was managing to run it at 50 tokens/second despite offload) and seems to be less stable, for me

I wasn't sure with files of these sizes, but it looks like you can actually still download the old versions.

"Commit history"
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commits/main

Find it in the confusingly named list of "Upload folder using huggingface_hub"
Which on click will say something like:

Files changed (1)
Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf

(Remember to go back to when the old version was uploaded, not the new one)
Then "Browse files" and you can re-download from there.

That's brilliant thank you

The new version is 50% slower on my Apple machine.... perhaps this is what we are dealing in everyday, speed or accuracy.

  • All GGUFs now use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.

  • GGUFs updated with an improved quantization algorithm.

  • Rest of variants like Q8_0, Q4_K_M, BF16 are now uploaded.

  • Updated with fixed chat template for improved tool-calling & coding performance!

  • Replaced BF16 layers with F16 for faster inference on unsupported devices.

  • See our new benchmarks for 122B-A10B here.
    122b final

  • Think toggle for Qwen3.5 now in LM Studio. See our guide for instructions.

  • Please follow the correct instructions / settings in our guide here.

Fine-tuning and RL Qwen3.5

is this only for qwen 3.5 or all other model has an optimization that we can do ?

Thank you for your work. I have the previous UD-Q8_K_XL with file size at 37GB. Given the massive increase in file size for latest UD-Q8_K_XL, are there any benefits in upgrading? I have ~100GB VRAM. Would it be possible for you to add a line of your previous quants so users can decide whether they need to upgrade or not?

im also looking the whole interner for this. i cant fully understand what problem was solved, so that to have the same amount of performance as before we need to downgrade to Q6_K_XL possibly loosing accurancy which can fit on 48gb vram systems. Also in the graph there is no Q8 to compare to Q6_K_XL and Q8_K_XL from the updated version.

Answer is right there is first post. I am not sure what you are looking at on whole Internet.

Accuracy is increased and now XL depicts highest Accuracy on that particular bit quantization. They have reduced highest KLD in each quant. Hence making it more accurate on the cost of size. Assuming 6 bits deliver you same accuracy as 8 bit before what is the harm in using 6 bits. 4 bit ones have now much more UD weights as well.

Problem is that we don't know that the 6 bit one is equivalent to the prior 8 bit one.

Thank you for your work. I have the previous UD-Q8_K_XL with file size at 37GB. Given the massive increase in file size for latest UD-Q8_K_XL, are there any benefits in upgrading? I have ~100GB VRAM. Would it be possible for you to add a line of your previous quants so users can decide whether they need to upgrade or not?

im also looking the whole interner for this. i cant fully understand what problem was solved, so that to have the same amount of performance as before we need to downgrade to Q6_K_XL possibly loosing accurancy which can fit on 48gb vram systems. Also in the graph there is no Q8 to compare to Q6_K_XL and Q8_K_XL from the updated version.

You just made an argument quoting a chart that doesnt compare against its older revision but other quants provided by other teams. What we are asking for is a comparison or a statement that this is equal greater or worse than this from the old revision. There is a statement that accurancy has increased. Yes we can believe and understand it but if you do not have anything to compare against it then the argument quickly falls aparts. Maybe its just me. Who knows ...

This comment has been hidden (marked as Resolved)

Hello ! I still find that the model is doing weird stuff. For example, if i ask him 'What's the V4 address assigned to MID-123 in PREPROD ?', he tries

Assistant sent this info to MTAuMTAzLj
{
  "repo": "Binaries_PREPROD",
  "filename": "docsAndTags.csv",
  "term": "MID-123_123_123_123_300158654701786317483507641347805623419"
}
Result
{
  "results": []
}

The previous search didn't find anything. Let me try a simpler search term for MID-123. I should search just "MID-123" in the docsAndTags.csv file.

However, looking at the instructions more carefully, I see that when searching for MID-to-Package mapping, I need to use the exact format from the CSV. Let me try searching with just "MID-123".


Ran get_search
Assistant sent this info to MTAuMTAzLj
{
  "repo": "Binaries_PREPROD",
  "filename": "docsAndTags.csv",
  "term": "MID-29_29_29_29_300158654701786317483507641347805623419"
}
Result
{
  "results": []
}

As you can see, he's not able to set MID-123, he puts random digits.

I'm using Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

  • llamacpp
[Unit]
Description=llama.cpp Qwen3-35B Server
After=network.target

[Service]
User=root
Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=0
Environment=GGML_CUDA_GRAPH_OPT=0
WorkingDirectory=/var/opt/lib/co/llama.cpp.cuda
ExecStart=/var/opt/lib/co/llama.cpp.cuda/build/bin/llama-server \
  --threads 22 \
  --threads-batch 8 \
  --jinja \
  --flash-attn on \
  --model /root/models/qwen3-35b/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --ctx-size 70000 \
  --host 0.0.0.0 \
  --n-cpu-moe 5 \
  --batch-size 8192 \
  --ubatch-size 4096 \
  --port 8050 \
  --cache-ram 0 \
  --temp 0.6 \
  --top-p 0.90 \
  --top-k 20 \
  --min-p 0.00

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

For some reason the sizes in the table of the benchmark (https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-4-march-5th-2026-update-more-robustness) do not match currently uploaded ggufs.

Is new q4_k_xl better then old q5_k_xl?

I feel I've posted too much on this topic, and I've spent more time than I'd like to testing models... but one more update:

After the autoparser PR was just merged into Llama.CPP the other day, my scores all jumped a bit higher and Q4_K_S is now performing quite well.
It still feels weird to "downgrade" from UD-Q4_K_XL, but I've hit a happy medium for the time being that combines the performance I was getting with being up to date.

The only real downside is my cache has dropped to less than I had, but it's still 80K-100K and there's usually diminishing accuracy at high context levels.

@usbphone

The only real downside is my cache has dropped to less than I had, but it's still 80K-100K and there's usually diminishing accuracy at high context levels.

What KV cache quantization do you use?

Noticed that general.file_type=Q8_0 for Qwen3.5-35B-A3B-UD-Q6_K_S.gguf

I am using Zed with llamacpp (Windows + ROCm + 7900XTX), and i still got this weird <tool_call> inside of Thinking blocks, that stops agentic coding interractions. Are anyone has same error or any ideas how to fix it?

upd: Latest llamacpp release from here https://github.com/ggml-org/llama.cpp/releases/tag/b8533 and latest version of this model.
upd2: Qwen3.5-35B-A3B.Q4_K_M has issue.
upd3: Qwen3.5-27B.Q5_K_M seems to work fine.
upd4: I see, seems like this is a bug that was partially fixed, see discussion here: https://github.com/ggml-org/llama.cpp/issues/20837

Sign up or log in to comment