Maybe an Ollama + Qwen 3.5 series specific issue?
What I would look at first is model compatibility, not disk contents and not raw VRAM. Your symptom pattern is: the Hugging Face repo resolves, the GGUF blob downloads, Ollama stores it under a SHA256 blob path, and then the model fails during the load/init phase with a local 500 Internal Server Error. Recent Ollama issues show the same pattern for Hugging Face Qwen3.5 GGUFs, including reports on 0.17.5 and 0.17.6. (GitHub)
What the error means
That blob path error does not mean “the file is missing.” It means Ollama found the local blob and then failed to initialize it. In other words:
- download worked
- manifest write worked
- model open/decode/load failed
That distinction matters because it points away from Hugging Face transport problems and toward runtime incompatibility inside Ollama. The live Qwen3.5 issue thread in Ollama says HF-downloaded qwen35/qwen35moe models can fail because their metadata layout differs from Ollama’s packaged variants, which causes decode failure and then fallback failure. (GitHub)
Why your exact repo is a higher-risk case
Your target repo is not a simple text-only single-file package. The Hugging Face file tree labels it Image-Text-to-Text, includes mmproj-BF16.gguf, and ships separate quantized GGUF files like Q4_K_M, Q5_K_M, Q6_K, and Q8_0. The Q8_0 file is listed as 28.6 GB and the projector file is about 931 MB. That means you are dealing with a multimodal-flavored package, not just “one plain text GGUF.” (Hugging Face)
That matters because current Ollama issue reports around HF Qwen3.5 are not just about “large model too big.” They are about Qwen3.5 architecture handling and, in neighboring reports, split text/vision style packaging that does not cleanly load through the current HF import path. (GitHub)
Why I do not think your main problem is “two 4090s are not enough”
Ollama’s FAQ says that when a model fits on one GPU, it prefers one GPU. If it does not fit on one GPU, it can spread the model across all available GPUs. So dual 4090s are a valid setup for larger models, and your hardware does not immediately point to “this should hard-fail before starting.” (Ollama)
Also, your failure is happening at the load step, not after a long prompt or a giant context window. That makes a metadata or architecture problem more likely than ordinary context-memory pressure. Context settings still matter later, but they are probably not the first thing breaking here. Ollama documents that parallelism and context length scale memory use, and that Flash Attention plus KV-cache quantization are later levers for memory reduction. (Ollama)
The strongest technical clue
The most important current Ollama issue for your case is the one stating that HF qwen35/qwen35moe models can have a different attention.head_count_kv representation from the Ollama library models. The issue says that causes NewTextProcessor() to fail during decode, and then the fallback path fails because support there was not merged yet. That is a very direct explanation for “download succeeds, then load fails.” (GitHub)
There is also a March 2026 duplicate report showing actual loader output that ends with:
error loading model architecture: unknown model architecture: 'qwen35'
on Ollama 0.17.6, again from a Hugging Face Qwen3.5 GGUF run. That lines up closely with your symptom family. (GitHub)
One subtle background point
Ollama’s official import docs say GGUF import is supported and show both local GGUF import and adapter workflows, but the architecture list on that page names Llama, Mistral, Gemma, and Phi3. At the same time, Ollama’s own model library now has an official qwen3.5:27b page whose metadata shows arch qwen35. That combination suggests something important: Ollama’s own packaged qwen3.5 models may work before arbitrary HF-imported qwen3.5 GGUFs do. In other words, “Ollama supports qwen3.5” and “Ollama supports every HF qwen3.5 GGUF repo through hf.co/...” are not the same statement. (Ollama)
What I think is happening in your case
My ranking would be:
- Most likely: current Ollama incompatibility with this HF Qwen3.5 GGUF import path. (GitHub)
- Also likely: the repo’s multimodal
mmproj packaging makes the load path more fragile. (Hugging Face)
- Less likely: a true VRAM-size problem. (Ollama)
- Much less likely: bad filename or failed download, because the loader got far enough to attempt model initialization from the local blob. (GitHub)
What to check next
1. Look at the actual Ollama server log
On Ubuntu with systemd, Ollama’s troubleshooting page says to use:
journalctl -u ollama --no-pager --follow --pager-end
That is the single highest-value next step, because it will tell you whether the internal failure is unknown model architecture: 'qwen35', a decode error, or a GPU-init problem. (Ollama)
2. Turn on debug logging
Ollama’s Linux docs say you can add this systemd override:
[Service]
Environment="OLLAMA_DEBUG=1"
Then restart the service and re-run the model. That will give you better loader diagnostics than the CLI’s short 500 message. (Ollama)
3. Sanity-check Ollama with the official packaged Qwen3.5
Try:
ollama run qwen3.5:27b
Ollama’s official qwen3.5:27b page shows it is a packaged qwen35 model, updated recently, with a 17 GB Q4_K_M artifact. If this works on your machine, then your CUDA path, service, and multi-GPU environment are probably fine, and the problem is specifically the Hugging Face imported model packaging. (Ollama)
4. Do not use Q8_0 as your first test
Your target repo offers Q4_K_M at 16.5 GB, Q5_K_M at 19.2 GB, Q6_K at 22.1 GB, and Q8_0 at 28.6 GB. Even if size is not the root cause, starting with Q4_K_M or Q5_K_M removes one extra variable. If the smaller quant fails with the same architecture-style error, that is more evidence that the core problem is compatibility, not memory pressure. (Hugging Face)
5. Separate “HF import path” from “GGUF loader path”
Ollama’s import docs say you can import a local GGUF via a Modelfile using:
FROM /path/to/file.gguf
and then build it with ollama create my-model. I do not expect this to magically fix a true qwen35 loader incompatibility, because the same runtime still has to parse the model, but it is a useful diagnostic because it removes the HF fetch layer from the equation. (Ollama)
What not to chase first
I would not start by deleting and redownloading the model repeatedly. The evidence points away from a bad transfer and toward a load-time compatibility problem. (GitHub)
I would also not spend much time on the broken BF16 multipart discussion in that repo unless you specifically need BF16. Your current attempt is Q8_0, which is a different file. There is a repo discussion showing BF16 part 2 problems, but that is a separate issue and probably not your immediate blocker. (Hugging Face)
My bottom line
Your setup is probably fine. Your command syntax was accepted, the download worked, and the file landed where Ollama stores model blobs. The most likely problem is that this particular Hugging Face Qwen3.5 GGUF package is hitting a current Ollama loader/runtime gap, and the repo’s multimodal-style packaging increases the odds of that. (GitHub)
So the best next move is:
- inspect
journalctl logs
- enable
OLLAMA_DEBUG=1
- test
ollama run qwen3.5:27b
- retry with
Q4_K_M or Q5_K_M
- if you need this exact Jackrong model and Ollama still fails, move it to a backend with stronger direct GGUF support