unsloth/Qwen3.5-35B-A3B-GGUF · Mar 5 - 'Final' Update: iMatrix + Benchmarks + New quant algo

danielhanchen

Unsloth AI org 25 days ago

•

edited 25 days ago

All GGUFs now use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.
GGUFs updated with an improved quantization algorithm.
Rest of variants like Q8_0, Q4_K_M, BF16 are now uploaded.
Updated with fixed chat template for improved tool-calling & coding performance!
Replaced BF16 layers with F16 for faster inference on unsupported devices.
See our new benchmarks for 122B-A10B here.
Think toggle for Qwen3.5 now in LM Studio. See our guide for instructions.
Please follow the correct instructions / settings in our guide here.

Fine-tuning and RL Qwen3.5

You can also fine-tune and perform reinforcement learning (RL) on all Qwen3.5 models with Unsloth via our free Colab notebooks.
Read our Qwen3.5 fine-tuning guide for tips, VRAM requirements, code and more here.

danielhanchen pinned discussion 25 days ago

engrtipusultan

25 days ago

Thank you for your awesome work.

d2rx

25 days ago

Thank you for your work. I have the previous UD-Q8_K_XL with file size at 37GB. Given the massive increase in file size for latest UD-Q8_K_XL, are there any benefits in upgrading? I have ~100GB VRAM. Would it be possible for you to add a line of your previous quants so users can decide whether they need to upgrade or not?

d2rx

25 days ago

I mean add a data line to your graph above

maglat

25 days ago

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

usbphone

25 days ago

•

edited 25 days ago

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

It technically does, but my CTX dropped from 150K to 50K at FP16.

Edit:
standard Q4_K_M = 57K CTX on my system. Looks like we'd have to drop all the way down to Q4_K_S.

ragallo

25 days ago

•

edited 25 days ago

Amazing work. Is there planned releases for MLX for all of the models up to 397b? Should MLX be used or is GGUF better overall?

engrtipusultan

25 days ago

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

You can load smaller 4 bit quant.

d2rx

25 days ago

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

You can load smaller 4 bit quant.

That is why I was asking for comparisons between prior quants and latest ones, so people can decide whether they should upgrade, or drop down a quant tier

usbphone

25 days ago

Sadly the increase in size lead to that Q4_K_XL do not fit on a single RTX3090 anymore

You can load smaller 4 bit quant.

UD-Q4_K_L was not updated for some reason, and it's kind of lame having to drop down to a standard quant just because the size has increased.

kalle07

25 days ago

UD_Q8_XL is 12GB larger than usual Q8 ??? ist not Q8 anymore its Q12 ^^
mean wile Q4_K_M has same size than UD_Q4_XL ... hmmm

Shuasimodo

25 days ago

Great work.
Very notable increase in quality.

usbphone

25 days ago

•

edited 25 days ago

UD_Q8_XL is 12GB larger than usual Q8 ??? ist not Q8 anymore its Q12 ^^
mean wile Q4_K_M has same size than UD_Q4_XL ... hmmm

Yeah, I hate to say it but increasing quality by practically shifting to a whole different quant bracket is a mixed bag.
I went from running UD_Q4_K_XL at 150K CTX to regular Q4_K_S at 94K.

Unfortunately, I also accidentally saved the new K_XL version over the old one, or I'd probably go back.

Grossor

25 days ago

•

edited 25 days ago

I think the previous version of UD Q8 K XL was working better for me. The new UD Q8 KXL is substantially slower (I was managing to run it at 50 tokens/second despite offload) and seems to be less stable, for me

usbphone

25 days ago

I wasn't sure with files of these sizes, but it looks like you can actually still download the old versions.

"Commit history"
https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF/commits/main

Find it in the confusingly named list of "Upload folder using huggingface_hub"
Which on click will say something like:

Files changed (1)
Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf

(Remember to go back to when the old version was uploaded, not the new one)
Then "Browse files" and you can re-download from there.

Grossor

25 days ago

That's brilliant thank you

LankyFung

24 days ago

The new version is 50% slower on my Apple machine.... perhaps this is what we are dealing in everyday, speed or accuracy.

gopi87

24 days ago

All GGUFs now use our new imatrix data. See some improvements in chat, coding, long context, and tool-calling use-cases.

GGUFs updated with an improved quantization algorithm.

Rest of variants like Q8_0, Q4_K_M, BF16 are now uploaded.

Updated with fixed chat template for improved tool-calling & coding performance!

Replaced BF16 layers with F16 for faster inference on unsupported devices.

See our new benchmarks for 122B-A10B here.

Think toggle for Qwen3.5 now in LM Studio. See our guide for instructions.

Please follow the correct instructions / settings in our guide here.

Fine-tuning and RL Qwen3.5

You can also fine-tune and perform reinforcement learning (RL) on all Qwen3.5 models with Unsloth via our free Colab notebooks.

Read our Qwen3.5 fine-tuning guide for tips, VRAM requirements, code and more here.

is this only for qwen 3.5 or all other model has an optimization that we can do ?

psychophrenic

24 days ago

Thank you for your work. I have the previous UD-Q8_K_XL with file size at 37GB. Given the massive increase in file size for latest UD-Q8_K_XL, are there any benefits in upgrading? I have ~100GB VRAM. Would it be possible for you to add a line of your previous quants so users can decide whether they need to upgrade or not?

im also looking the whole interner for this. i cant fully understand what problem was solved, so that to have the same amount of performance as before we need to downgrade to Q6_K_XL possibly loosing accurancy which can fit on 48gb vram systems. Also in the graph there is no Q8 to compare to Q6_K_XL and Q8_K_XL from the updated version.

engrtipusultan

24 days ago

Answer is right there is first post. I am not sure what you are looking at on whole Internet.

Accuracy is increased and now XL depicts highest Accuracy on that particular bit quantization. They have reduced highest KLD in each quant. Hence making it more accurate on the cost of size. Assuming 6 bits deliver you same accuracy as 8 bit before what is the harm in using 6 bits. 4 bit ones have now much more UD weights as well.

Grossor

24 days ago

Problem is that we don't know that the 6 bit one is equivalent to the prior 8 bit one.

psychophrenic

24 days ago

Thank you for your work. I have the previous UD-Q8_K_XL with file size at 37GB. Given the massive increase in file size for latest UD-Q8_K_XL, are there any benefits in upgrading? I have ~100GB VRAM. Would it be possible for you to add a line of your previous quants so users can decide whether they need to upgrade or not?

im also looking the whole interner for this. i cant fully understand what problem was solved, so that to have the same amount of performance as before we need to downgrade to Q6_K_XL possibly loosing accurancy which can fit on 48gb vram systems. Also in the graph there is no Q8 to compare to Q6_K_XL and Q8_K_XL from the updated version.

You just made an argument quoting a chart that doesnt compare against its older revision but other quants provided by other teams. What we are asking for is a comparison or a statement that this is equal greater or worse than this from the old revision. There is a statement that accurancy has increased. Yes we can believe and understand it but if you do not have anything to compare against it then the argument quickly falls aparts. Maybe its just me. Who knows ...

NIK2703

24 days ago

This comment has been hidden (marked as Resolved)

mava8

24 days ago

•

edited 24 days ago

Hello ! I still find that the model is doing weird stuff. For example, if i ask him 'What's the V4 address assigned to MID-123 in PREPROD ?', he tries

Assistant sent this info to MTAuMTAzLj
{
  "repo": "Binaries_PREPROD",
  "filename": "docsAndTags.csv",
  "term": "MID-123_123_123_123_300158654701786317483507641347805623419"
}
Result
{
  "results": []
}

The previous search didn't find anything. Let me try a simpler search term for MID-123. I should search just "MID-123" in the docsAndTags.csv file.

However, looking at the instructions more carefully, I see that when searching for MID-to-Package mapping, I need to use the exact format from the CSV. Let me try searching with just "MID-123".


Ran get_search
Assistant sent this info to MTAuMTAzLj
{
  "repo": "Binaries_PREPROD",
  "filename": "docsAndTags.csv",
  "term": "MID-29_29_29_29_300158654701786317483507641347805623419"
}
Result
{
  "results": []
}

As you can see, he's not able to set MID-123, he puts random digits.

I'm using Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf

llamacpp

[Unit]
Description=llama.cpp Qwen3-35B Server
After=network.target

[Service]
User=root
Environment=GGML_CUDA_ENABLE_UNIFIED_MEMORY=0
Environment=GGML_CUDA_GRAPH_OPT=0
WorkingDirectory=/var/opt/lib/co/llama.cpp.cuda
ExecStart=/var/opt/lib/co/llama.cpp.cuda/build/bin/llama-server \
  --threads 22 \
  --threads-batch 8 \
  --jinja \
  --flash-attn on \
  --model /root/models/qwen3-35b/Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --ctx-size 70000 \
  --host 0.0.0.0 \
  --n-cpu-moe 5 \
  --batch-size 8192 \
  --ubatch-size 4096 \
  --port 8050 \
  --cache-ram 0 \
  --temp 0.6 \
  --top-p 0.90 \
  --top-k 20 \
  --min-p 0.00

Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

DeZepTup

23 days ago

For some reason the sizes in the table of the benchmark (https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks#id-4-march-5th-2026-update-more-robustness) do not match currently uploaded ggufs.

NIK2703

23 days ago

Is new q4_k_xl better then old q5_k_xl?

usbphone

21 days ago

•

edited 21 days ago

I feel I've posted too much on this topic, and I've spent more time than I'd like to testing models... but one more update:

After the autoparser PR was just merged into Llama.CPP the other day, my scores all jumped a bit higher and Q4_K_S is now performing quite well.
It still feels weird to "downgrade" from UD-Q4_K_XL, but I've hit a happy medium for the time being that combines the performance I was getting with being up to date.

The only real downside is my cache has dropped to less than I had, but it's still 80K-100K and there's usually diminishing accuracy at high context levels.

Reverger

20 days ago

•

edited 20 days ago

@usbphone

The only real downside is my cache has dropped to less than I had, but it's still 80K-100K and there's usually diminishing accuracy at high context levels.

What KV cache quantization do you use?

ixdx

14 days ago

Noticed that general.file_type=Q8_0 for Qwen3.5-35B-A3B-UD-Q6_K_S.gguf

sedative7626

4 days ago

•

edited 4 days ago

I am using Zed with llamacpp (Windows + ROCm + 7900XTX), and i still got this weird <tool_call> inside of Thinking blocks, that stops agentic coding interractions. Are anyone has same error or any ideas how to fix it?

upd: Latest llamacpp release from here https://github.com/ggml-org/llama.cpp/releases/tag/b8533 and latest version of this model.
upd2: Qwen3.5-35B-A3B.Q4_K_M has issue.
upd3: Qwen3.5-27B.Q5_K_M seems to work fine.
upd4: I see, seems like this is a bug that was partially fixed, see discussion here: https://github.com/ggml-org/llama.cpp/issues/20837