AI & ML interests

Semantic Search, Language models, Domain adaptation, Question Answering

Recent Activity

anakin87ย  updated a Space about 1 month ago
deepset/should-i-follow
anakin87ย  updated a Space about 1 month ago
deepset/autoquizzer
View all activity

anakin87ย 
posted an update 3 days ago
view post
Post
3166
A small model that struggled against a random opponent now beats GPT-5-mini at tic-tac-toe

I took LiquidAI/LFM2-2.6B and trained it through play.

๐Ÿง‘โ€๐Ÿณ Here's how:

1๏ธโƒฃ Build a solid RL env with Verifiers (Prime Intellect)
2๏ธโƒฃ Generate synthetic data: <200 games sampled from GPT-5-mini playing in the env
3๏ธโƒฃ SFT warm-up to teach format
4๏ธโƒฃ Group-based RL (CISPO) against opponents making 20-70% random moves
5๏ธโƒฃ RL again with stronger opponents (0-25% random moves) + 1.25 temperature to push exploration and shake off suboptimal strategies

Done! Beats GPT-5-mini ๐Ÿ†

---

๐ŸŽฎ Play against the model: anakin87/LFM2-2.6B-mr-tictactoe

๐Ÿค— Model: anakin87/LFM2-2.6B-mr-tictactoe

๐Ÿ“š Walkthrough/course: https://github.com/anakin87/llm-rl-environments-lil-course

๐Ÿค— Dataset and checkpoints: https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
anakin87ย 
posted an update 4 days ago
view post
Post
77
Local Gemma 4 agent ๐Ÿ’Ž๐Ÿ•ต๏ธ๐Ÿ—บ๏ธ
drop in a mysterious map, get the location, live weather, and top spots to visit

I've been exploring what google/gemma-4-E4B-it can do in a local agentic setup and put together a ๐Ÿ““ ๐™ฃ๐™ค๐™ฉ๐™š๐™—๐™ค๐™ค๐™  with Gemma + Haystack AI Framework covering 4 demos.

๐Ÿ““ https://t.ly/04Ty5

Another interesting one is the ๐—š๐—ถ๐˜๐—›๐˜‚๐—ฏ ๐—”๐—ด๐—ฒ๐—ป๐˜.

I initially tried to load all tools from the GitHub MCP server, quickly filling the context available on Colab -> unusable, forgetful agent โŒ

Then I used the ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ง๐—ผ๐—ผ๐—น๐˜€๐—ฒ๐˜ ๐Ÿ”Ž ๐Ÿงฐ
It dynamically discovers the right tools from the GitHub MCP server on the fly, loading only what it actually needs for the task at hand, keeping context lean.

Now it actually works.

The notebook also contains
๐Ÿ’Ž Multimodal weather agent: the mystery map demo above
๐Ÿ’Ž Visual Question Answering from a paper
๐Ÿ’Ž RAG on Rock music
anakin87ย 
posted an update 6 days ago
view post
Post
10367
How LLM training with RL Environments works?

It all starts with ๐—ฅ๐—ฒ๐—ถ๐—ป๐—ณ๐—ผ๐—ฟ๐—ฐ๐—ฒ๐—บ๐—ฒ๐—ป๐˜ ๐—Ÿ๐—ฒ๐—ฎ๐—ฟ๐—ป๐—ถ๐—ป๐—ด ๐˜„๐—ถ๐˜๐—ต ๐—ฉ๐—ฒ๐—ฟ๐—ถ๐—ณ๐—ถ๐—ฎ๐—ฏ๐—น๐—ฒ ๐—ฅ๐—ฒ๐˜„๐—ฎ๐—ฟ๐—ฑ๐˜€
- question asked
- model generates reasoning + answer
- answer checked against ground truth
- reward drives RL training


In this setup, the environment is simple: fixed questions and answers, rollout logic, reward(s)

Consider a more complex tic-tac-toe env โŒโญ•
It adds:
- dynamic game generation/handling
- tunable opponent skill
- multi-turn interactions

(envs can also include tools)

---

What happens at training?

We use ๐—š๐—ฟ๐—ผ๐˜‚๐—ฝ ๐—ฅ๐—ฒ๐—น๐—ฎ๐˜๐—ถ๐˜ƒ๐—ฒ ๐—ฃ๐—ผ๐—น๐—ถ๐—ฐ๐˜† ๐—ข๐—ฝ๐˜๐—ถ๐—บ๐—ถ๐˜‡๐—ฎ๐˜๐—ถ๐—ผ๐—ป with a tic-tac-toe env

No critic model needed, the group is the baseline
Simpler than PPO

1๏ธโƒฃ Rollout generation: from the same board, model plays N games via sampling
2๏ธโƒฃ Each game scored with deterministic rewards (win, format, ...)
3๏ธโƒฃ Mean score computed across the group
4๏ธโƒฃ Each rollout's advantage = its score minus the group mean
5๏ธโƒฃ Model updated to favor trajectories above baseline

๐Ÿ” Repeat


For a deep dive, check out
๐ŸŒฑ https://github.com/anakin87/llm-rl-environments-lil-course
a free hands-on course on RL environments for LLMs
  • 2 replies
ยท
anakin87ย 
posted an update 10 days ago
view post
Post
1613
Your RL environment is an SFT data factory ๐Ÿญ

In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.

When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.


If you've built an RL env, generating SFT synthetic data is basically free.

An env already has: task data, rollout logic, rewards.

1๏ธโƒฃ pick a strong model
2๏ธโƒฃ run it through the env
3๏ธโƒฃ filter rollouts by reward

works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)

๐Ÿง‘โ€๐Ÿ’ป Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md
anakin87ย 
posted an update 15 days ago
view post
Post
4144
๐ŸŒ€ Let LLMs wander - Engineering RL Environments

Reinforcement Learning Environments are little worlds
where models can act, get rewards, and learn.

I've been exploring how to design them, figuring out what works and what doesn't.

If you want to learn how to build them, I recorded a practical intro video.

You'll also see how to turn Liquid AI LFM2-2.6B into a Tic-tac-toe master ๐Ÿ™‚

๐ŸŽฅ Engineering RL Environments video: https://www.youtube.com/watch?v=71V3fTaUp2Q

---

๐ŸŒฑ LLM RL Environments Lil Course: https://github.com/anakin87/llm-rl-environments-lil-course

๐Ÿค—๐Ÿ•น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe


๐Ÿ“š HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
anakin87ย 
posted an update 18 days ago
view post
Post
3296
๐Ÿ“ฃ I just published a free course on Reinforcement Learning Environments for Language Models!

๐Ÿ“Œ COURSE: https://github.com/anakin87/llm-rl-environments-lil-course

Over the past year, we've seen a shift in LLM Post-Training.
Previously, Supervised Fine-Tuning was the most important part: making models imitate curated Question-Answer pairs.

Now we also have Reinforcement Learning with Verifiable Rewards. With techniques like GRPO, models can learn through trial and error in dynamic environments. They can climb to new heights without relying on expensively prepared data.


But what actually are these environments in practiceโ“ And how do you build them effectivelyโ“

Fascinated by these concepts, I spent time exploring this space through experiments, post-training Small Language Models.
I've packaged everything I learned into this short course.


What you'll learn

๐Ÿ”น Agents, Environments, and LLMs: how to map Reinforcement Learning concepts to the LLM domain
๐Ÿ”น How to use Verifiers (open-source library by Prime Intellect) to build RL environments as software artifacts
๐Ÿ”น Common patterns: How to build single-turn, multi-turn, and tool-use environments

๐Ÿ”น Hands-on: turn a small language model (LFM2-2.6B by LiquidAI) into a Tic Tac Toe master
๐Ÿ”ธ Build the game Environment
๐Ÿ”ธ Use it to generate synthetic data for SFT warm-up
๐Ÿ”ธ Group-based Reinforcement Learning

If you're interested in building "little worlds" where LLMs can learn, this course is for you.

---

๐Ÿค—๐Ÿ•น๏ธ Play against the trained model: anakin87/LFM2-2.6B-mr-tictactoe

๐Ÿ“š HF collection (datasets + models): https://huggingface.co/collections/anakin87/lfm2-26b-mr-tic-tac-toe
  • 1 reply
ยท
bilgeyucelย 
updated a Space 2 months ago
anakin87ย 
posted an update 5 months ago
view post
Post
394
๐Ÿ’ญ Do thinking traces make Language Models learn better? Curious what others think

๐—ฆ๐—ฐ๐—ฒ๐—ป๐—ฎ๐—ฟ๐—ถ๐—ผ
You take an instruction-following LM.
You want to train it with a GRPO-style RL algorithm on a task like Tic Tac Toe.
Rewards are outcome-based, applied only at the end of each episode: win/loss/draw, format adherence...

During training, the model could just output answers, but a common choice is to make it also output thinking traces.

๐—ง๐—ต๐—ฒ ๐—พ๐˜‚๐—ฒ๐˜€๐˜๐—ถ๐—ผ๐—ป
Does forcing the model to produce thinking traces during training actually improve learningโ“

๐Ÿ’ฌ I'd like to hear your thoughts. Share ideas and links to relevant papers and resources.

From what I've understood so far, the answer seems to be ๐˜†๐—ฒ๐˜€.

1๏ธโƒฃ If you force the model to think during training, it becomes a model that thinks at inference time. It naturally allocates more budget (tokens) to a problem, which tends to improve performance.

2๏ธโƒฃ While the model's "reasoning" already exists in its activation space, using explicit thinking traces as a scratchpad allows training to steer and shape that reasoning.

3๏ธโƒฃ As the model produces more traces during training, the RL algorithm can progressively give higher rewards to the reasoning patterns that lead to better outcomes.
anakin87ย 
posted an update 5 months ago
anakin87ย 
posted an update 5 months ago
view post
Post
2898
LLMs can leak their post-training data (RL included) ๐Ÿ’ง

New interesting paper on this topic from Google DeepMind: Extracting alignment data in open models (2510.18554)

It's known that Language Models memorize data that can be extracted via prompting.

In this paper, the authors investigate this aspect:
- using open models, where prompting can be fully customized by the user, including special tokens.
- focusing on open-source models like Olmo, where full training data is available.


๐Ÿ“ค How do they extract data?

During post-training (like SFT), new tokens such as <|user|> are introduced.

The authors hypothesize prompting the model with these tokens can make it output its alignment data (remember Magpie?).

For example, for SFT, their extraction prompt is <|endoftext|><|user|>.


๐Ÿ“ Evaluating memorization

The authors compare each sampled example with the original data using vector search with embedding similarity.

They find that many outputs are semantically very similar to the original data, even if the exact words differ.

Traditional string-matching algorithms underestimate memorization by 10x.


๐Ÿ” What about RL?

Surprisingly, the same technique works to extract data from Reinforcement Learning (PPO/GRPO) phases.

This is counter-intuitive because the RL objective is not designed to increase sequence likelihoods (unlike SFT).

Practical limitation: in this case, extraction relies on using the initial part of the training prompt, which is not generally public.


๐Ÿ“ˆ Are the extracted data effective for post-training?

Both in SFT and RL, the extracted data can be used to fine-tune models to similar performance to the originals.

The authors suggest that model distillation, where a stronger model is used to drive the training of a weaker one, may be a form of indirect training on the original dataset.

anakin87ย 
posted an update 8 months ago
view post
Post
506
Your Language Model needs better (open) environments to learn ๐ŸŒ€

๐Ÿ“ https://huggingface.co/blog/anakin87/environments-hub

RL environments help LLMs practice, reason, and improve.
I explored the Environments Hub and wrote a walkthrough showing how to train and evaluate models using these open environments.

1๏ธโƒฃ ๐—ช๐—ต๐˜† ๐—ฅ๐—Ÿ ๐—บ๐—ฎ๐˜๐˜๐—ฒ๐—ฟ๐˜€ ๐—ณ๐—ผ๐—ฟ ๐—Ÿ๐—Ÿ๐— ๐˜€

DeepSeek-R1 made clear that Reinforcement Learning can be used to incentivize reasoning in LLMs.
In GRPO, the model generates multiple answers and learns to prefer the better ones from rewards.


2๏ธโƒฃ ๐—ช๐—ต๐—ฎ๐˜ ๐—ฒ๐—ป๐˜ƒ๐—ถ๐—ฟ๐—ผ๐—ป๐—บ๐—ฒ๐—ป๐˜๐˜€ ๐—ฎ๐—ฟ๐—ฒ
In classic RL, the environment is the world where the Agent lives, interacts, and get rewards to learn.

We can also think of them as software packages, containing data, harness and scoring rules - for the model
to learn and be evaluated.

Nowadays, the Agent is not just the LLM. It can use tools, from a weather API to a terminal.

This makes environments for training and evaluation more complex and critical.


3๏ธโƒฃ ๐“๐ก๐ž ๐จ๐ฉ๐ž๐ง ๐œ๐ก๐š๐ฅ๐ฅ๐ž๐ง๐ ๐ž

Big labs are advancing, but open models and the community still face a fragmented ecosystem.
We risk becoming users of systems built with tools we can't access or fully understand.


4๏ธโƒฃ ๐„๐ง๐ฏ๐ข๐ซ๐จ๐ง๐ฆ๐ž๐ง๐ญ๐ฌ ๐‡๐ฎ๐›
That's why, I was excited when Prime Intellect released the Environments Hub.

It's a place where people share RL environments: tasks you can use to train LLMs with RL (GRPO-style) or evaluate Agents.
Plus, the Verifiers library (@willcb ) standardizes the creation of RL environments and evaluations.
They can help to keep science and experimentation open. ๐Ÿ”ฌ


I explored the Hub and wrote a hands-on walkthrough ๐Ÿ“
- RL + LLMs basics
- Environments Hub navigation
- Evaluating models/Agents
- GRPO Training a tiny model on an alphabetical sort task

Take a look!

๐Ÿ“ https://huggingface.co/blog/anakin87/environments-hub
anakin87ย 
posted an update 8 months ago
view post
Post
4747
Want to quickly try Gemma 3 270m? ๐Ÿ’Ž๐Ÿ’ฌ

I made a simple Space to do that: anakin87/gemma-3-270m-it

โšก Fast: Flash Attention, Zero GPU
โš™๏ธ Configurable
anakin87ย 
posted an update 9 months ago
view post
Post
405
๐Ÿ•ต๏ธ๐ŸŒ Building Browser Agents - notebook

No API? No problem.
Browser Agents can use websites like you do: click, type, wait, read.

๐Ÿ““ Step-by-step notebook: https://colab.research.google.com/github/deepset-ai/haystack-cookbook/blob/main/notebooks/browser_agents.ipynb

๐ŸŽฅ In the video, the Agent:
- Goes to Hugging Face Spaces
- Finds black-forest-labs/FLUX.1-schnell
- Expands a short prompt ("my holiday on Lake Como") into a detailed image generation prompt
- Waits for the image
- Returns the image URL


## What else can it do?
Great for information gathering and summarization

๐Ÿ—ž๏ธ๐Ÿ—ž๏ธ Compare news websites and create a table of shared stories with links
โ–ถ๏ธ Find content creator social profiles from YouTube videos
๐Ÿ›๏ธ Find a product's price range on Amazon
๐Ÿš‚ ๐ŸšŒ Gather public transportation travel options


## How is it built?
๐Ÿ—๏ธ Haystack โ†’ Agent execution logic
๐Ÿง  Google Gemini 2.5 Flash โ†’ Good and fast LLM with a generous free tier
๐Ÿ› ๏ธ Playwright MCP server โ†’ Browser automation tools: navigate, click, type, wait...

Even without vision capabilities, this setup can get quite far.


## Next steps
- Try a local open model
- Move from notebook to real deployment
- Incorporate vision

And you? Have you built something similar? What's in your stack?

anakin87ย 
posted an update 9 months ago
view post
Post
1098
Haystack can now see ๐Ÿ‘€

The latest release of the Haystack OSS LLM framework adds a long-requested feature: image support!

๐Ÿ““ Notebooks below

This isn't just about passing images to an LLM. We built several features to enable practical multimodal use cases.

What's new?
๐Ÿง  Support for multiple LLM providers: OpenAI, Amazon Bedrock, Google Gemini, Mistral, NVIDIA, OpenRouter, Ollama and more (support for Hugging Face API coming ๐Ÿ”œ)
๐ŸŽ›๏ธ Prompt template language to handle structured inputs, including images
๐Ÿ“„ PDF and image converters
๐Ÿ” Image embedders using CLIP-like models
๐Ÿงพ LLM-based extractor to pull text from images
๐Ÿงฉ Components to build multimodal RAG pipelines and Agents


I had the chance of leading this effort with @sjrhuschlee (great collab).

๐Ÿ““ Below you can find two notebooks to explore the new features:
๓ ฏโ€ข๓ ๓  Introduction to Multimodal Text Generation https://haystack.deepset.ai/cookbook/multimodal_intro
๓ ฏโ€ข๓ ๓  Creating Vision+Text RAG Pipelines https://haystack.deepset.ai/tutorials/46_multimodal_rag

(๐Ÿ–ผ๏ธ image by @bilgeyucel )
anakin87ย 
posted an update 10 months ago
view post
Post
458
๐Ÿ›ก๏ธ AI Guardrails with Open Language Models - Tutorial

๐Ÿ““ https://haystack.deepset.ai/cookbook/safety_moderation_open_lms

How do you ensure your AI application is safe from harmful or inappropriate user inputs?

This is a core requirement for real-world AI deployments. Luckily, several open Language Models are built specifically for safety moderation.

I've been exploring them and put together a hands-on tutorial using the Haystack framework to build your own AI guardrails.

In the notebook, you'll learn how to use and customize:
๐Ÿ”น Meta Llama Guard (via Hugging Face API)
๐Ÿ”น IBM Granite Guardian (via Ollama), which can also evaluate RAG specific risk dimensions
๐Ÿ”น Google ShieldGemma (via Ollama)
๐Ÿ”น Nvidia NemoGuard models family, including a model for topic control

You'll also see how to integrate content moderation into a ๐Ÿ”Ž RAG pipeline.
anakin87ย 
posted an update 10 months ago
view post
Post
1234
๐Ÿงฐ Free up space on the Hub with super_squash_history ๐Ÿงน

As you may know, Hugging Face Hub has storage limits on private repos (100 GB for free users, 1 TB for PROs).

This weekend I did some cleanup on my private repos
I went 1.58 TB down to 1 GB. ๐Ÿ˜…

Besides deleting old, unused models, the main tool I used was a lesser-known command:
super_squash_history.

When you train a model, you often push multiple checkpoints to the Hub.
Each checkpoint = a commit.
A 2.6B model in BF16 is ~5 GB.
So 10 checkpoints = 50 GB. That adds up fast.

While full commit history can be useful for rollbacks, it's often unnecessary for older experiments where only the final model matters.

In these cases, you can use super_squash_history: it reduces your entire repo history to a single commit.

https://huggingface.co/docs/huggingface_hub/main/en/package_reference/hf_api#huggingface_hub.HfApi.super_squash_history

โš ๏ธ super_squash_history is a non-revertible operation. Once squashed, the commit history cannot be retrieved.

Hope this is useful to others.
  • 2 replies
ยท

robert

#30 opened 12 months ago by
arvindparihar