Instructions to use DESUCLUB/Qwen3-NoThinkEmbed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DESUCLUB/Qwen3-NoThinkEmbed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DESUCLUB/Qwen3-NoThinkEmbed") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("DESUCLUB/Qwen3-NoThinkEmbed") model = AutoModelForCausalLM.from_pretrained("DESUCLUB/Qwen3-NoThinkEmbed") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use DESUCLUB/Qwen3-NoThinkEmbed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DESUCLUB/Qwen3-NoThinkEmbed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DESUCLUB/Qwen3-NoThinkEmbed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/DESUCLUB/Qwen3-NoThinkEmbed
- SGLang
How to use DESUCLUB/Qwen3-NoThinkEmbed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DESUCLUB/Qwen3-NoThinkEmbed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DESUCLUB/Qwen3-NoThinkEmbed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DESUCLUB/Qwen3-NoThinkEmbed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DESUCLUB/Qwen3-NoThinkEmbed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use DESUCLUB/Qwen3-NoThinkEmbed with Docker Model Runner:
docker model run hf.co/DESUCLUB/Qwen3-NoThinkEmbed
Model Details
Model Description
This model is based on Qwen3, and is an iterative process and set of experiments to try to remove thinking mode from Qwen3 architecturally, instead of providing <think>/n/n</think> tokens.
Current edits to Qwen3 model
- The current model has been stripped of thinking tokens in the tokenizer.json and tokenizer_config.json files
- Embedding has also been truncated from 151936 to 151667, truncating the lookup for thinking tokens
Usage:
- You can use this model via the code below
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "DESUCLUB/Qwen3-NoThinkEmbed"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print("content:", content)
Reproducing NoThink model
The code used for reproducing this model can also be found in this repo, under think_remover.py
- Do note that if trying to reproduce this model, you will need to edit the Qwen3-4B tokenizer.json or use the ones provided here
- The tokenizer has been modified to remove all thinking tokens
Credits:
Credit goes to the Qwen Team for developing the Qwen3 suite of models, as well as providing the baseline for the inference code above
- Downloads last month
- 3