deepset/prompt-injections
Viewer • Updated • 662 • 5.93k • 156
Multi-task DistilBERT classifier for conversational AI pipelines in interactive fiction and games. Performs two classification tasks in a single forward pass:
| Task | Output | Notes |
|---|---|---|
| Dialogue act | 21-class label | Classifies player utterance type |
| Manipulation detection | Binary probability | Detects prompt injection / NPC takeover attempts |
accusation, acknowledgment, action, agree, command, conditional, confession, disagree, emote, farewell, flirt, greeting, hedge, hostile, intent, offer, opinion, out_of_character, question, statement, yes_no_question
onnxruntime)import json
import numpy as np
import onnxruntime
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
snapshot_download(repo_id="myemfar/distilbert-multitask", local_dir="./distilbert_multitask")
session = onnxruntime.InferenceSession("./distilbert_multitask/model.onnx")
tokenizer = AutoTokenizer.from_pretrained("./distilbert_multitask")
with open("./distilbert_multitask/label_map_da.json") as f:
labels = {int(k): v for k, v in json.load(f).items()}
inputs = tokenizer("Where is the tavern?", return_tensors="np")
logits_da, logits_manip = session.run(None, dict(inputs))
da_label = labels[int(np.argmax(logits_da))] # "question"
manip_prob = float(1 / (1 + np.exp(-logits_manip[0][0]))) # sigmoid
Synthetic training data generated via Claude across 21 conversational categories, curated for interactive fiction and RPG dialogue contexts. Approximately 2,000 labeled examples with targeted augmentation at category boundaries.
Fine-tuned on a combination of three public datasets plus domain-specific negative examples (in-character RPG dialogue):
| Dataset | License | Description |
|---|---|---|
| deepset/prompt-injections | CC BY 4.0 | Benign queries + prompt injection examples |
| hackaprompt/hackaprompt-dataset | Apache 2.0 | Red-teaming competition submissions |
| lakera-ai/gandalf_ignore_instructions | CC BY 4.0 | Instruction-override attempts from Lakera's Gandalf challenge |
Base model
distilbert/distilbert-base-uncased