Instructions to use DESUCLUB/Qwen3-NoThinkEmbed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DESUCLUB/Qwen3-NoThinkEmbed with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DESUCLUB/Qwen3-NoThinkEmbed")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("DESUCLUB/Qwen3-NoThinkEmbed")
model = AutoModelForCausalLM.from_pretrained("DESUCLUB/Qwen3-NoThinkEmbed")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use DESUCLUB/Qwen3-NoThinkEmbed with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DESUCLUB/Qwen3-NoThinkEmbed"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DESUCLUB/Qwen3-NoThinkEmbed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/DESUCLUB/Qwen3-NoThinkEmbed

SGLang

How to use DESUCLUB/Qwen3-NoThinkEmbed with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DESUCLUB/Qwen3-NoThinkEmbed" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DESUCLUB/Qwen3-NoThinkEmbed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DESUCLUB/Qwen3-NoThinkEmbed" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DESUCLUB/Qwen3-NoThinkEmbed",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use DESUCLUB/Qwen3-NoThinkEmbed with Docker Model Runner:
```
docker model run hf.co/DESUCLUB/Qwen3-NoThinkEmbed
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Details

Model Description

This model is based on Qwen3, and is an iterative process and set of experiments to try to remove thinking mode from Qwen3 architecturally, instead of providing <think>/n/n</think> tokens.

Current edits to Qwen3 model

The current model has been stripped of thinking tokens in the tokenizer.json and tokenizer_config.json files
Embedding has also been truncated from 151936 to 151667, truncating the lookup for thinking tokens

Usage:

You can use this model via the code below

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "DESUCLUB/Qwen3-NoThinkEmbed"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print("content:", content)

Reproducing NoThink model

The code used for reproducing this model can also be found in this repo, under think_remover.py

Do note that if trying to reproduce this model, you will need to edit the Qwen3-4B tokenizer.json or use the ones provided here
The tokenizer has been modified to remove all thinking tokens

Credits:

Credit goes to the Qwen Team for developing the Qwen3 suite of models, as well as providing the baseline for the inference code above

Downloads last month: 3

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for DESUCLUB/Qwen3-NoThinkEmbed

Quantizations

1 model