500 Internal Server Error with Ollama

thebski · March 28, 2026, 1:44pm

Hey guys!

I’m new here to HF and trying to utilize local LLMs in general. I have been playing around with Claude Code using Ollama. It’s been working fine but I just installed a second GPU so am wanting to play with more models. I’m new to using HF to download. I’m trying to run the following model in Ollama and ultimately Claude Code:

I used the command provided by the ‘use this model’ drop down to run Ollama and pull the manifest.

ollama run ``hf.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-v2-GGUF:Q8_0

It downloaded fine, and I can confirm the file is in the directory and named as the model is trying to call. However, I’m still getting this error when I try to launch Ollama with it.

Error loading model: 500 Internal Server Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-9093fa002d4e64576b0724bd67010de8f87d5ec284fa71d8394fc47e2d9c65be

I’m on Ubuntu 24.04 and have two RTX 4090s so it shouldn’t be a size problem. Does anyone have any ideas on what to look for?

Thank you!

John6666 · March 29, 2026, 12:36am

Maybe an Ollama + Qwen 3.5 series specific issue?

What I would look at first is model compatibility, not disk contents and not raw VRAM. Your symptom pattern is: the Hugging Face repo resolves, the GGUF blob downloads, Ollama stores it under a SHA256 blob path, and then the model fails during the load/init phase with a local 500 Internal Server Error. Recent Ollama issues show the same pattern for Hugging Face Qwen3.5 GGUFs, including reports on 0.17.5 and 0.17.6. (GitHub)

What the error means

That blob path error does not mean “the file is missing.” It means Ollama found the local blob and then failed to initialize it. In other words:

download worked
manifest write worked
model open/decode/load failed

That distinction matters because it points away from Hugging Face transport problems and toward runtime incompatibility inside Ollama. The live Qwen3.5 issue thread in Ollama says HF-downloaded qwen35/qwen35moe models can fail because their metadata layout differs from Ollama’s packaged variants, which causes decode failure and then fallback failure. (GitHub)

Why your exact repo is a higher-risk case

Your target repo is not a simple text-only single-file package. The Hugging Face file tree labels it Image-Text-to-Text, includes mmproj-BF16.gguf, and ships separate quantized GGUF files like Q4_K_M, Q5_K_M, Q6_K, and Q8_0. The Q8_0 file is listed as 28.6 GB and the projector file is about 931 MB. That means you are dealing with a multimodal-flavored package, not just “one plain text GGUF.” (Hugging Face)

That matters because current Ollama issue reports around HF Qwen3.5 are not just about “large model too big.” They are about Qwen3.5 architecture handling and, in neighboring reports, split text/vision style packaging that does not cleanly load through the current HF import path. (GitHub)

Why I do not think your main problem is “two 4090s are not enough”

Ollama’s FAQ says that when a model fits on one GPU, it prefers one GPU. If it does not fit on one GPU, it can spread the model across all available GPUs. So dual 4090s are a valid setup for larger models, and your hardware does not immediately point to “this should hard-fail before starting.” (Ollama)

Also, your failure is happening at the load step, not after a long prompt or a giant context window. That makes a metadata or architecture problem more likely than ordinary context-memory pressure. Context settings still matter later, but they are probably not the first thing breaking here. Ollama documents that parallelism and context length scale memory use, and that Flash Attention plus KV-cache quantization are later levers for memory reduction. (Ollama)

The strongest technical clue

The most important current Ollama issue for your case is the one stating that HF qwen35/qwen35moe models can have a different attention.head_count_kv representation from the Ollama library models. The issue says that causes NewTextProcessor() to fail during decode, and then the fallback path fails because support there was not merged yet. That is a very direct explanation for “download succeeds, then load fails.” (GitHub)

There is also a March 2026 duplicate report showing actual loader output that ends with:

error loading model architecture: unknown model architecture: 'qwen35'

on Ollama 0.17.6, again from a Hugging Face Qwen3.5 GGUF run. That lines up closely with your symptom family. (GitHub)

One subtle background point

Ollama’s official import docs say GGUF import is supported and show both local GGUF import and adapter workflows, but the architecture list on that page names Llama, Mistral, Gemma, and Phi3. At the same time, Ollama’s own model library now has an official qwen3.5:27b page whose metadata shows arch qwen35. That combination suggests something important: Ollama’s own packaged qwen3.5 models may work before arbitrary HF-imported qwen3.5 GGUFs do. In other words, “Ollama supports qwen3.5” and “Ollama supports every HF qwen3.5 GGUF repo through hf.co/...” are not the same statement. (Ollama)

What I think is happening in your case

My ranking would be:

Most likely: current Ollama incompatibility with this HF Qwen3.5 GGUF import path. (GitHub)
Also likely: the repo’s multimodal mmproj packaging makes the load path more fragile. (Hugging Face)
Less likely: a true VRAM-size problem. (Ollama)
Much less likely: bad filename or failed download, because the loader got far enough to attempt model initialization from the local blob. (GitHub)

What to check next

1. Look at the actual Ollama server log

On Ubuntu with systemd, Ollama’s troubleshooting page says to use:

journalctl -u ollama --no-pager --follow --pager-end

That is the single highest-value next step, because it will tell you whether the internal failure is unknown model architecture: 'qwen35', a decode error, or a GPU-init problem. (Ollama)

2. Turn on debug logging

Ollama’s Linux docs say you can add this systemd override:

[Service]
Environment="OLLAMA_DEBUG=1"

Then restart the service and re-run the model. That will give you better loader diagnostics than the CLI’s short 500 message. (Ollama)

3. Sanity-check Ollama with the official packaged Qwen3.5

Try:

ollama run qwen3.5:27b

Ollama’s official qwen3.5:27b page shows it is a packaged qwen35 model, updated recently, with a 17 GB Q4_K_M artifact. If this works on your machine, then your CUDA path, service, and multi-GPU environment are probably fine, and the problem is specifically the Hugging Face imported model packaging. (Ollama)

4. Do not use `Q8_0` as your first test

Your target repo offers Q4_K_M at 16.5 GB, Q5_K_M at 19.2 GB, Q6_K at 22.1 GB, and Q8_0 at 28.6 GB. Even if size is not the root cause, starting with Q4_K_M or Q5_K_M removes one extra variable. If the smaller quant fails with the same architecture-style error, that is more evidence that the core problem is compatibility, not memory pressure. (Hugging Face)

5. Separate “HF import path” from “GGUF loader path”

Ollama’s import docs say you can import a local GGUF via a Modelfile using:

FROM /path/to/file.gguf

and then build it with ollama create my-model. I do not expect this to magically fix a true qwen35 loader incompatibility, because the same runtime still has to parse the model, but it is a useful diagnostic because it removes the HF fetch layer from the equation. (Ollama)

What not to chase first

I would not start by deleting and redownloading the model repeatedly. The evidence points away from a bad transfer and toward a load-time compatibility problem. (GitHub)

I would also not spend much time on the broken BF16 multipart discussion in that repo unless you specifically need BF16. Your current attempt is Q8_0, which is a different file. There is a repo discussion showing BF16 part 2 problems, but that is a separate issue and probably not your immediate blocker. (Hugging Face)

My bottom line

Your setup is probably fine. Your command syntax was accepted, the download worked, and the file landed where Ollama stores model blobs. The most likely problem is that this particular Hugging Face Qwen3.5 GGUF package is hitting a current Ollama loader/runtime gap, and the repo’s multimodal-style packaging increases the odds of that. (GitHub)

So the best next move is:

inspect journalctl logs
enable OLLAMA_DEBUG=1
test ollama run qwen3.5:27b
retry with Q4_K_M or Q5_K_M
if you need this exact Jackrong model and Ollama still fails, move it to a backend with stronger direct GGUF support

Topic		Replies	Views
Ollama + Llama-3.2-11b-vision-uncensored like 22 Beginners	1	1786	December 10, 2024
Error running Llama 3.1 Minitron 4B quantized model with Ollama Models	2	1151	August 28, 2024
Why the model provide an error response ever time Beginners	5	81	March 4, 2025
How to download a model and run it with Ollama locally? Beginners	17	125975	May 15, 2025
“Use this model”->Ollama: can't pull model with Q4 🤗Hub	4	475	May 15, 2025