{"any-to-any":{"datasets":[{"description":"A dataset with multiple modality input and output pairs.","id":"PKU-Alignment/align-anything"}],"demo":{"inputs":[{"filename":"any-to-any-input.jpg","type":"img"},{"label":"Text Prompt","content":"What is the significance of this place?","type":"text"}],"outputs":[{"label":"Generated Text","content":"The place in the picture is Osaka Castle, located in Osaka, Japan. Osaka Castle is a historic castle that was originally built in the 16th century by Toyotomi Hideyoshi, a powerful warlord of the time. It is one of the most famous landmarks in Osaka and is known for its distinctive white walls and black roof tiles. The castle has been rebuilt several times over the centuries and is now a popular tourist attraction, offering visitors a glimpse into Japan's rich history and culture.","type":"text"},{"filename":"any-to-any-output.wav","type":"audio"}]},"metrics":[],"models":[{"description":"Strong model that can take in video, audio, image, text and output text and natural speech.","id":"Qwen/Qwen2.5-Omni-7B"},{"description":"Robust model that can take in image and text and generate image and text.","id":"OmniGen2/OmniGen2"},{"description":"Any-to-any model with speech, video, audio, image and text understanding capabilities.","id":"openbmb/MiniCPM-o-2_6"},{"description":"A model that can understand image and text and generate image and text.","id":"ByteDance-Seed/BAGEL-7B-MoT"}],"spaces":[{"description":"An application to chat with an any-to-any (image & text) model.","id":"OmniGen2/OmniGen2"}],"summary":"Any-to-any models can understand two or more modalities and output two or more modalities.","widgetModels":[],"youtubeId":"","id":"any-to-any","label":"Any-to-Any","libraries":["transformers"]},"audio-classification":{"datasets":[{"description":"A benchmark of 10 different audio tasks.","id":"s3prl/superb"},{"description":"A dataset of YouTube clips and their sound categories.","id":"agkphysics/AudioSet"}],"demo":{"inputs":[{"filename":"audio.wav","type":"audio"}],"outputs":[{"data":[{"label":"Up","score":0.2},{"label":"Down","score":0.8}],"type":"chart"}]},"metrics":[{"description":"","id":"accuracy"},{"description":"","id":"recall"},{"description":"","id":"precision"},{"description":"","id":"f1"}],"models":[{"description":"An easy-to-use model for command recognition.","id":"speechbrain/google_speech_command_xvector"},{"description":"An emotion recognition model.","id":"ehcalabres/wav2vec2-lg-xlsr-en-speech-emotion-recognition"},{"description":"A language identification model.","id":"facebook/mms-lid-126"}],"spaces":[{"description":"An application that can classify music into different genre.","id":"kurianbenoy/audioclassification"}],"summary":"Audio classification is the task of assigning a label or class to a given audio. It can be used for recognizing which command a user is giving or the emotion of a statement, as well as identifying a speaker.","widgetModels":["MIT/ast-finetuned-audioset-10-10-0.4593"],"youtubeId":"KWwzcmG98Ds","id":"audio-classification","label":"Audio Classification","libraries":["speechbrain","transformers","transformers.js"]},"audio-to-audio":{"datasets":[{"description":"512-element X-vector embeddings of speakers from CMU ARCTIC dataset.","id":"Matthijs/cmu-arctic-xvectors"}],"demo":{"inputs":[{"filename":"input.wav","type":"audio"}],"outputs":[{"filename":"label-0.wav","type":"audio"},{"filename":"label-1.wav","type":"audio"}]},"metrics":[{"description":"The Signal-to-Noise ratio is the relationship between the target signal level and the background noise level. It is calculated as the logarithm of the target signal divided by the background noise, in decibels.","id":"snri"},{"description":"The Signal-to-Distortion ratio is the relationship between the target signal and the sum of noise, interference, and artifact errors","id":"sdri"}],"models":[{"description":"A speech enhancement model.","id":"ResembleAI/resemble-enhance"},{"description":"A model that can change the voice in a speech recording.","id":"microsoft/speecht5_vc"}],"spaces":[{"description":"An application for speech separation.","id":"younver/speechbrain-speech-separation"},{"description":"An application for audio style transfer.","id":"nakas/audio-diffusion_style_transfer"}],"summary":"Audio-to-Audio is a family of tasks in which the input is an audio and the output is one or multiple generated audios. Some example tasks are speech enhancement and source separation.","widgetModels":["speechbrain/sepformer-wham"],"youtubeId":"iohj7nCCYoM","id":"audio-to-audio","label":"Audio-to-Audio","libraries":["asteroid","fairseq","speechbrain"]},"audio-text-to-text":{"datasets":[{"description":"A dataset containing audio conversations with question–answer pairs.","id":"nvidia/AF-Think"},{"description":"A more advanced and comprehensive dataset that contains characteristics of the audio as well","id":"tsinghua-ee/QualiSpeech"}],"demo":{"inputs":[{"filename":"audio.wav","type":"audio"},{"label":"Text Prompt","content":"What is the gender of the speaker?","type":"text"}],"outputs":[{"label":"Generated Text","content":"The gender of the speaker is female.","type":"text"}]},"metrics":[],"models":[{"description":"A lightweight model that has capabilities of taking both audio and text as inputs and generating responses.","id":"fixie-ai/ultravox-v0_5-llama-3_2-1b"},{"description":"A multimodal model that supports voice chat and audio analysis.","id":"Qwen/Qwen2-Audio-7B-Instruct"},{"description":"A model for audio understanding, speech translation, and transcription.","id":"mistralai/Voxtral-Small-24B-2507"},{"description":"A new model capable of audio question answering and reasoning.","id":"nvidia/audio-flamingo-3"}],"spaces":[{"description":"A space that takes input as both audio and text and generates answers.","id":"iamomtiwari/ATTT"},{"description":"A web application that demonstrates chatting with the Qwen2Audio Model.","id":"freddyaboulton/talk-to-qwen-webrtc"}],"summary":"Audio-text-to-text models take both an audio clip and a text prompt as input, and generate natural language text as output. These models can answer questions about spoken content, summarize meetings, analyze music, or interpret speech beyond simple transcription. They are useful for applications that combine speech understanding with reasoning or conversation.","widgetModels":[],"youtubeId":"","id":"audio-text-to-text","label":"Audio-Text-to-Text","libraries":["transformers"]},"automatic-speech-recognition":{"datasets":[{"description":"31,175 hours of multilingual audio-text dataset in 108 languages.","id":"mozilla-foundation/common_voice_17_0"},{"description":"Multilingual and diverse audio dataset with 101k hours of audio.","id":"amphion/Emilia-Dataset"},{"description":"A dataset with 44.6k hours of English speaker data and 6k hours of other language speakers.","id":"parler-tts/mls_eng"},{"description":"A multilingual audio dataset with 370K hours of audio.","id":"espnet/yodas"}],"demo":{"inputs":[{"filename":"input.flac","type":"audio"}],"outputs":[{"label":"Transcript","content":"Going along slushy country roads and speaking to damp audiences in...","type":"text"}]},"metrics":[{"description":"","id":"wer"},{"description":"","id":"cer"}],"models":[{"description":"A powerful ASR model by OpenAI.","id":"openai/whisper-large-v3"},{"description":"A good generic speech model by MetaAI for fine-tuning.","id":"facebook/w2v-bert-2.0"},{"description":"An end-to-end model that performs ASR and Speech Translation by MetaAI.","id":"facebook/seamless-m4t-v2-large"},{"description":"A powerful multilingual ASR and Speech Translation model by Nvidia.","id":"nvidia/canary-1b"},{"description":"Powerful speaker diarization model.","id":"pyannote/speaker-diarization-3.1"}],"spaces":[{"description":"A powerful general-purpose speech recognition application.","id":"hf-audio/whisper-large-v3"},{"description":"Latest ASR model from Useful Sensors.","id":"mrfakename/Moonshinex"},{"description":"A high quality speech and text translation model by Meta.","id":"facebook/seamless_m4t"},{"description":"A powerful multilingual ASR and Speech Translation model by Nvidia","id":"nvidia/canary-1b"}],"summary":"Automatic Speech Recognition (ASR), also known as Speech to Text (STT), is the task of transcribing a given audio to text. It has many applications, such as voice user interfaces.","widgetModels":["openai/whisper-large-v3"],"youtubeId":"TksaY_FDgnk","id":"automatic-speech-recognition","label":"Automatic Speech Recognition","libraries":["espnet","nemo","speechbrain","transformers","transformers.js"]},"depth-estimation":{"datasets":[{"description":"NYU Depth V2 Dataset: Video dataset containing both RGB and depth sensor data.","id":"sayakpaul/nyu_depth_v2"},{"description":"Monocular depth estimation benchmark based without noise and errors.","id":"depth-anything/DA-2K"}],"demo":{"inputs":[{"filename":"depth-estimation-input.jpg","type":"img"}],"outputs":[{"filename":"depth-estimation-output.png","type":"img"}]},"metrics":[],"models":[{"description":"Cutting-edge depth estimation model.","id":"depth-anything/Depth-Anything-V2-Large"},{"description":"A strong monocular depth estimation model.","id":"jingheya/lotus-depth-g-v1-0"},{"description":"A depth estimation model that predicts depth in videos.","id":"tencent/DepthCrafter"},{"description":"A robust depth estimation model.","id":"apple/DepthPro-hf"}],"spaces":[{"description":"An application that predicts the depth of an image and then reconstruct the 3D model as voxels.","id":"radames/dpt-depth-estimation-3d-voxels"},{"description":"An application for bleeding-edge depth estimation.","id":"akhaliq/depth-pro"},{"description":"An application on cutting-edge depth estimation in videos.","id":"tencent/DepthCrafter"},{"description":"A human-centric depth estimation application.","id":"facebook/sapiens-depth"}],"summary":"Depth estimation is the task of predicting depth of the objects present in an image.","widgetModels":[""],"youtubeId":"","id":"depth-estimation","label":"Depth Estimation","libraries":["transformers","transformers.js"]},"document-question-answering":{"datasets":[{"description":"Largest document understanding dataset.","id":"HuggingFaceM4/Docmatix"},{"description":"Dataset from the 2020 DocVQA challenge. The documents are taken from the UCSF Industry Documents Library.","id":"eliolio/docvqa"}],"demo":{"inputs":[{"label":"Question","content":"What is the idea behind the consumer relations efficiency team?","type":"text"},{"filename":"document-question-answering-input.png","type":"img"}],"outputs":[{"label":"Answer","content":"Balance cost efficiency with quality customer service","type":"text"}]},"metrics":[{"description":"The evaluation metric for the DocVQA challenge is the Average Normalized Levenshtein Similarity (ANLS). This metric is flexible to character regognition errors and compares the predicted answer with the ground truth answer.","id":"anls"},{"description":"Exact Match is a metric based on the strict character match of the predicted answer and the right answer. For answers predicted correctly, the Exact Match will be 1. Even if only one character is different, Exact Match will be 0","id":"exact-match"}],"models":[{"description":"A robust document question answering model.","id":"impira/layoutlm-document-qa"},{"description":"A document question answering model specialized in invoices.","id":"impira/layoutlm-invoices"},{"description":"A special model for OCR-free document question answering.","id":"microsoft/udop-large"},{"description":"A powerful model for document question answering.","id":"google/pix2struct-docvqa-large"}],"spaces":[{"description":"A robust document question answering application.","id":"impira/docquery"},{"description":"An application that can answer questions from invoices.","id":"impira/invoices"},{"description":"An application to compare different document question answering models.","id":"merve/compare_docvqa_models"}],"summary":"Document Question Answering (also known as Document Visual Question Answering) is the task of answering questions on document images. Document question answering models take a (document, question) pair as input and return an answer in natural language. Models usually rely on multi-modal features, combining text, position of words (bounding-boxes) and image.","widgetModels":["impira/layoutlm-invoices"],"youtubeId":"","id":"document-question-answering","label":"Document Question Answering","libraries":["transformers","transformers.js"]},"visual-document-retrieval":{"datasets":[{"description":"A large dataset used to train visual document retrieval models.","id":"vidore/colpali_train_set"}],"demo":{"inputs":[{"filename":"input.png","type":"img"},{"label":"Question","content":"Is the model in this paper the fastest for inference?","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"Page 10","score":0.7},{"label":"Page 11","score":0.06},{"label":"Page 9","score":0.003}]}]},"isPlaceholder":false,"metrics":[{"description":"NDCG@k scores ranked recommendation lists for top-k results. 0 is the worst, 1 is the best.","id":"Normalized Discounted Cumulative Gain at K"}],"models":[{"description":"Very accurate visual document retrieval model for multilingual queries and documents.","id":"vidore/colqwen2-v1.0"},{"description":"Very fast and efficient visual document retrieval model that can also take in other modalities like audio.","id":"Tevatron/OmniEmbed-v0.1"}],"spaces":[{"description":"A leaderboard of visual document retrieval models.","id":"vidore/vidore-leaderboard"},{"description":"Visual retrieval augmented generation demo based on ColQwen2 model.","id":"vidore/visual-rag-tool"}],"summary":"Visual document retrieval is the task of searching for relevant image-based documents, such as PDFs. These models take a text query and multiple documents as input and return the top-most relevant documents and relevancy scores as output.","widgetModels":[""],"youtubeId":"","id":"visual-document-retrieval","label":"Visual Document Retrieval","libraries":["transformers"]},"feature-extraction":{"datasets":[{"description":"Wikipedia dataset containing cleaned articles of all languages. Can be used to train `feature-extraction` models.","id":"wikipedia"}],"demo":{"inputs":[{"label":"Input","content":"India, officially the Republic of India, is a country in South Asia.","type":"text"}],"outputs":[{"table":[["Dimension 1","Dimension 2","Dimension 3"],["2.583383083343506","2.757075071334839","0.9023529887199402"],["8.29393482208252","1.1071064472198486","2.03399395942688"],["-0.7754912972450256","-1.647324562072754","-0.6113331913948059"],["0.07087723910808563","1.5942802429199219","1.4610432386398315"]],"type":"tabular"}]},"metrics":[],"models":[{"description":"A powerful feature extraction model for natural language processing tasks.","id":"thenlper/gte-large"},{"description":"A strong feature extraction model for retrieval.","id":"Alibaba-NLP/gte-Qwen1.5-7B-instruct"}],"spaces":[{"description":"A leaderboard to rank text feature extraction models based on a benchmark.","id":"mteb/leaderboard"},{"description":"A leaderboard to rank best feature extraction models based on human feedback.","id":"mteb/arena"}],"summary":"Feature extraction is the task of extracting features learnt in a model.","widgetModels":["facebook/bart-base"],"id":"feature-extraction","label":"Feature Extraction","libraries":["sentence-transformers","transformers","transformers.js"]},"fill-mask":{"datasets":[{"description":"A common dataset that is used to train models for many languages.","id":"wikipedia"},{"description":"A large English dataset with text crawled from the web.","id":"c4"}],"demo":{"inputs":[{"label":"Input","content":"The <mask> barked at me","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"wolf","score":0.487},{"label":"dog","score":0.061},{"label":"cat","score":0.058},{"label":"fox","score":0.047},{"label":"squirrel","score":0.025}]}]},"metrics":[{"description":"Cross Entropy is a metric that calculates the difference between two probability distributions. Each probability distribution is the distribution of predicted words","id":"cross_entropy"},{"description":"Perplexity is the exponential of the cross-entropy loss. It evaluates the probabilities assigned to the next word by the model. Lower perplexity indicates better performance","id":"perplexity"}],"models":[{"description":"State-of-the-art masked language model.","id":"answerdotai/ModernBERT-large"},{"description":"A multilingual model trained on 100 languages.","id":"FacebookAI/xlm-roberta-base"}],"spaces":[],"summary":"Masked language modeling is the task of masking some of the words in a sentence and predicting which words should replace those masks. These models are useful when we want to get a statistical understanding of the language in which the model is trained in.","widgetModels":["distilroberta-base"],"youtubeId":"mqElG5QJWUg","id":"fill-mask","label":"Fill-Mask","libraries":["transformers","transformers.js"]},"image-classification":{"datasets":[{"description":"Benchmark dataset used for image classification with images that belong to 100 classes.","id":"cifar100"},{"description":"Dataset consisting of images of garments.","id":"fashion_mnist"}],"demo":{"inputs":[{"filename":"image-classification-input.jpeg","type":"img"}],"outputs":[{"type":"chart","data":[{"label":"Egyptian cat","score":0.514},{"label":"Tabby cat","score":0.193},{"label":"Tiger cat","score":0.068}]}]},"metrics":[{"description":"","id":"accuracy"},{"description":"","id":"recall"},{"description":"","id":"precision"},{"description":"","id":"f1"}],"models":[{"description":"A strong image classification model.","id":"google/vit-base-patch16-224"},{"description":"A robust image classification model.","id":"facebook/deit-base-distilled-patch16-224"},{"description":"A strong image classification model.","id":"facebook/convnext-large-224"}],"spaces":[{"description":"A leaderboard to evaluate different image classification models.","id":"timm/leaderboard"}],"summary":"Image classification is the task of assigning a label or class to an entire image. Images are expected to have only one class for each image. Image classification models take an image as input and return a prediction about which class the image belongs to.","widgetModels":["google/vit-base-patch16-224"],"youtubeId":"tjAIM7BOYhw","id":"image-classification","label":"Image Classification","libraries":["keras","timm","transformers","transformers.js"]},"image-feature-extraction":{"datasets":[{"description":"ImageNet-1K is a image classification dataset in which images are used to train image-feature-extraction models.","id":"imagenet-1k"}],"demo":{"inputs":[{"filename":"mask-generation-input.png","type":"img"}],"outputs":[{"table":[["Dimension 1","Dimension 2","Dimension 3"],["0.21236686408519745","1.0919708013534546","0.8512550592422485"],["0.809657871723175","-0.18544459342956543","-0.7851548194885254"],["1.3103108406066895","-0.2479034662246704","-0.9107287526130676"],["1.8536205291748047","-0.36419737339019775","0.09717650711536407"]],"type":"tabular"}]},"metrics":[],"models":[{"description":"A powerful image feature extraction model.","id":"timm/vit_large_patch14_dinov2.lvd142m"},{"description":"A strong image feature extraction model.","id":"nvidia/MambaVision-T-1K"},{"description":"A robust image feature extraction model.","id":"facebook/dino-vitb16"},{"description":"Cutting-edge image feature extraction model.","id":"apple/aimv2-large-patch14-336-distilled"},{"description":"Strong image feature extraction model that can be used on images and documents.","id":"OpenGVLab/InternViT-6B-448px-V1-2"}],"spaces":[{"description":"A leaderboard to evaluate different image-feature-extraction models on classification performances","id":"timm/leaderboard"}],"summary":"Image feature extraction is the task of extracting features learnt in a computer vision model.","widgetModels":[],"id":"image-feature-extraction","label":"Image Feature Extraction","libraries":["timm","transformers"]},"image-segmentation":{"datasets":[{"description":"Scene segmentation dataset.","id":"scene_parse_150"}],"demo":{"inputs":[{"filename":"image-segmentation-input.jpeg","type":"img"}],"outputs":[{"filename":"image-segmentation-output.png","type":"img"}]},"metrics":[{"description":"Average Precision (AP) is the Area Under the PR Curve (AUC-PR). It is calculated for each semantic class separately","id":"Average Precision"},{"description":"Mean Average Precision (mAP) is the overall average of the AP values","id":"Mean Average Precision"},{"description":"Intersection over Union (IoU) is the overlap of segmentation masks. Mean IoU is the average of the IoU of all semantic classes","id":"Mean Intersection over Union"},{"description":"APα is the Average Precision at the IoU threshold of a α value, for example, AP50 and AP75","id":"APα"}],"models":[{"description":"Solid panoptic segmentation model trained on COCO.","id":"tue-mps/coco_panoptic_eomt_large_640"},{"description":"Background removal model.","id":"briaai/RMBG-1.4"},{"description":"A multipurpose image segmentation model for high resolution images.","id":"ZhengPeng7/BiRefNet"},{"description":"Powerful human-centric image segmentation model.","id":"facebook/sapiens-seg-1b"},{"description":"Panoptic segmentation model trained on the COCO (common objects) dataset.","id":"facebook/mask2former-swin-large-coco-panoptic"}],"spaces":[{"description":"A semantic segmentation application that can predict unseen instances out of the box.","id":"facebook/ov-seg"},{"description":"One of the strongest segmentation applications.","id":"jbrinkma/segment-anything"},{"description":"A human-centric segmentation model.","id":"facebook/sapiens-pose"},{"description":"An instance segmentation application to predict neuronal cell types from microscopy images.","id":"rashmi/sartorius-cell-instance-segmentation"},{"description":"An application that segments videos.","id":"ArtGAN/Segment-Anything-Video"},{"description":"An panoptic segmentation application built for outdoor environments.","id":"segments/panoptic-segment-anything"}],"summary":"Image Segmentation divides an image into segments where each pixel in the image is mapped to an object. This task has multiple variants such as instance segmentation, panoptic segmentation and semantic segmentation.","widgetModels":["nvidia/segformer-b0-finetuned-ade-512-512"],"youtubeId":"dKE8SIt9C-w","id":"image-segmentation","label":"Image Segmentation","libraries":["transformers","transformers.js"]},"image-to-image":{"datasets":[{"description":"Synthetic dataset, for image relighting","id":"VIDIT"},{"description":"Multiple images of celebrities, used for facial expression translation","id":"huggan/CelebA-faces"},{"description":"12M image-caption pairs.","id":"Spawning/PD12M"}],"demo":{"inputs":[{"filename":"image-to-image-input.jpeg","type":"img"}],"outputs":[{"filename":"image-to-image-output.png","type":"img"}]},"isPlaceholder":false,"metrics":[{"description":"Peak Signal to Noise Ratio (PSNR) is an approximation of the human perception, considering the ratio of the absolute intensity with respect to the variations. Measured in dB, a high value indicates a high fidelity.","id":"PSNR"},{"description":"Structural Similarity Index (SSIM) is a perceptual metric which compares the luminance, contrast and structure of two images. The values of SSIM range between -1 and 1, and higher values indicate closer resemblance to the original image.","id":"SSIM"},{"description":"Inception Score (IS) is an analysis of the labels predicted by an image classification model when presented with a sample of the generated images.","id":"IS"}],"models":[{"description":"An image-to-image model to improve image resolution.","id":"fal/AuraSR-v2"},{"description":"Powerful image editing model.","id":"black-forest-labs/FLUX.1-Kontext-dev"},{"description":"Virtual try-on model.","id":"yisol/IDM-VTON"},{"description":"Image re-lighting model.","id":"kontext-community/relighting-kontext-dev-lora-v3"},{"description":"Strong model for inpainting and outpainting.","id":"black-forest-labs/FLUX.1-Fill-dev"},{"description":"Strong model for image editing using depth maps.","id":"black-forest-labs/FLUX.1-Depth-dev-lora"}],"spaces":[{"description":"Image editing application.","id":"black-forest-labs/FLUX.1-Kontext-Dev"},{"description":"Image relighting application.","id":"lllyasviel/iclight-v2-vary"},{"description":"An application for image upscaling.","id":"jasperai/Flux.1-dev-Controlnet-Upscaler"}],"summary":"Image-to-image is the task of transforming an input image through a variety of possible manipulations and enhancements, such as super-resolution, image inpainting, colorization, and more.","widgetModels":["Qwen/Qwen-Image"],"youtubeId":"","id":"image-to-image","label":"Image-to-Image","libraries":["diffusers","transformers","transformers.js"]},"image-text-to-text":{"datasets":[{"description":"Instructions composed of image and text.","id":"liuhaotian/LLaVA-Instruct-150K"},{"description":"Collection of image-text pairs on scientific topics.","id":"DAMO-NLP-SG/multimodal_textbook"},{"description":"A collection of datasets made for model fine-tuning.","id":"HuggingFaceM4/the_cauldron"},{"description":"Screenshots of websites with their HTML/CSS codes.","id":"HuggingFaceM4/WebSight"}],"demo":{"inputs":[{"filename":"image-text-to-text-input.png","type":"img"},{"label":"Text Prompt","content":"Describe the position of the bee in detail.","type":"text"}],"outputs":[{"label":"Answer","content":"The bee is sitting on a pink flower, surrounded by other flowers. The bee is positioned in the center of the flower, with its head and front legs sticking out.","type":"text"}]},"metrics":[],"models":[{"description":"Small and efficient yet powerful vision language model.","id":"HuggingFaceTB/SmolVLM-Instruct"},{"description":"Cutting-edge reasoning vision language model.","id":"zai-org/GLM-4.5V"},{"description":"Cutting-edge small vision language model to convert documents to text.","id":"rednote-hilab/dots.ocr"},{"description":"Small yet powerful model.","id":"Qwen/Qwen2.5-VL-3B-Instruct"},{"description":"Image-text-to-text model with agentic capabilities.","id":"microsoft/Magma-8B"}],"spaces":[{"description":"Leaderboard to evaluate vision language models.","id":"opencompass/open_vlm_leaderboard"},{"description":"An application that compares object detection capabilities of different vision language models.","id":"sergiopaniego/vlm_object_understanding"},{"description":"An application to compare different OCR models.","id":"prithivMLmods/Multimodal-OCR"}],"summary":"Image-text-to-text models take in an image and text prompt and output text. These models are also called vision-language models, or VLMs. The difference from image-to-text models is that these models take an additional text input, not restricting the model to certain use cases like image captioning, and may also be trained to accept a conversation as input.","widgetModels":["zai-org/GLM-4.5V"],"youtubeId":"IoGaGfU1CIg","id":"image-text-to-text","label":"Image-Text-to-Text","libraries":["transformers"]},"image-text-to-image":{"datasets":[],"demo":{"inputs":[{"filename":"image-text-to-image-input.jpeg","type":"img"},{"label":"Input","content":"A city above clouds, pastel colors, Victorian style","type":"text"}],"outputs":[{"filename":"image-text-to-image-output.png","type":"img"}]},"metrics":[{"description":"The Fréchet Inception Distance (FID) calculates the distance between distributions between synthetic and real samples. A lower FID score indicates better similarity between the distributions of real and generated images.","id":"FID"},{"description":"CLIP Score measures the similarity between the generated image and the text prompt using CLIP embeddings. A higher score indicates better alignment with the text prompt.","id":"CLIP"}],"models":[{"description":"A powerful model for image-text-to-image generation.","id":"black-forest-labs/FLUX.2-dev"}],"spaces":[{"description":"An application for image-text-to-image generation.","id":"black-forest-labs/FLUX.2-dev"}],"summary":"Image-text-to-image models take an image and a text prompt as input and generate a new image based on the reference image and text instructions. These models are useful for image editing, style transfer, image variations, and guided image generation tasks.","widgetModels":["black-forest-labs/FLUX.2-dev"],"id":"image-text-to-image","label":"Image-Text-to-Image","libraries":["diffusers"]},"image-text-to-video":{"datasets":[],"demo":{"inputs":[{"filename":"image-text-to-video-input.jpg","type":"img"},{"label":"Input","content":"Darth Vader is surfing on the waves.","type":"text"}],"outputs":[{"filename":"image-text-to-video-output.gif","type":"img"}]},"metrics":[{"description":"Frechet Video Distance uses a model that captures coherence for changes in frames and the quality of each frame. A smaller score indicates better video generation.","id":"fvd"},{"description":"CLIPSIM measures similarity between video frames and text using an image-text similarity model. A higher score indicates better video generation.","id":"clipsim"}],"models":[{"description":"A powerful model for image-text-to-video generation.","id":"Lightricks/LTX-Video"}],"spaces":[{"description":"An application for image-text-to-video generation.","id":"Lightricks/ltx-video-distilled"}],"summary":"Image-text-to-video models take an reference image and a text instructions as and generate a video based on them. These models are useful for animating still images, creating dynamic content from static references, and generating videos with specific motion or transformation guidance.","widgetModels":["Lightricks/LTX-Video"],"id":"image-text-to-video","label":"Image-Text-to-Video","libraries":["diffusers"]},"image-to-text":{"datasets":[{"description":"Dataset from 12M image-text of Reddit","id":"red_caps"},{"description":"Dataset from 3.3M images of Google","id":"datasets/conceptual_captions"}],"demo":{"inputs":[{"filename":"savanna.jpg","type":"img"}],"outputs":[{"label":"Detailed description","content":"a herd of giraffes and zebras grazing in a field","type":"text"}]},"metrics":[],"models":[{"description":"Strong OCR model.","id":"allenai/olmOCR-7B-0725"},{"description":"Powerful image captioning model.","id":"fancyfeast/llama-joycaption-beta-one-hf-llava"}],"spaces":[{"description":"SVG generator app from images.","id":"multimodalart/OmniSVG-3B"},{"description":"An application that converts documents to markdown.","id":"numind/NuMarkdown-8B-Thinking"},{"description":"An application that can caption images.","id":"fancyfeast/joy-caption-beta-one"}],"summary":"Image to text models output a text from a given image. Image captioning or optical character recognition can be considered as the most common applications of image to text.","widgetModels":["Salesforce/blip-image-captioning-large"],"youtubeId":"","id":"image-to-text","label":"Image-to-Text","libraries":["transformers","transformers.js"]},"image-to-video":{"datasets":[{"description":"A benchmark dataset for reference image controlled video generation.","id":"ali-vilab/VACE-Benchmark"},{"description":"A dataset of video generation style preferences.","id":"Rapidata/sora-video-generation-style-likert-scoring"},{"description":"A dataset with videos and captions throughout the videos.","id":"BestWishYsh/ChronoMagic"}],"demo":{"inputs":[{"filename":"image-to-video-input.jpg","type":"img"},{"label":"Optional Text Prompt","content":"This penguin is dancing","type":"text"}],"outputs":[{"filename":"image-to-video-output.gif","type":"img"}]},"metrics":[{"description":"Fréchet Video Distance (FVD) measures the perceptual similarity between the distributions of generated videos and a set of real videos, assessing overall visual quality and temporal coherence of the video generated from an input image.","id":"fvd"},{"description":"CLIP Score measures the semantic similarity between a textual prompt (if provided alongside the input image) and the generated video frames. It evaluates how well the video's generated content and motion align with the textual description, conditioned on the initial image.","id":"clip_score"},{"description":"First Frame Fidelity, often measured using LPIPS (Learned Perceptual Image Patch Similarity), PSNR, or SSIM, quantifies how closely the first frame of the generated video matches the input conditioning image.","id":"lpips"},{"description":"Identity Preservation Score measures the consistency of identity (e.g., a person's face or a specific object's characteristics) between the input image and throughout the generated video frames, often calculated using features from specialized models like face recognition (e.g., ArcFace) or re-identification models.","id":"identity_preservation"},{"description":"Motion Score evaluates the quality, realism, and temporal consistency of motion in the video generated from a static image. This can be based on optical flow analysis (e.g., smoothness, magnitude), consistency of object trajectories, or specific motion plausibility assessments.","id":"motion_score"}],"models":[{"description":"LTX-Video, a 13B parameter model for high quality video generation","id":"Lightricks/LTX-Video-0.9.7-dev"},{"description":"A 14B parameter model for reference image controlled video generation","id":"Wan-AI/Wan2.1-VACE-14B"},{"description":"An image-to-video generation model using FramePack F1 methodology with Hunyuan-DiT architecture","id":"lllyasviel/FramePack_F1_I2V_HY_20250503"},{"description":"A distilled version of the LTX-Video-0.9.7-dev model for faster inference","id":"Lightricks/LTX-Video-0.9.7-distilled"},{"description":"An image-to-video generation model by Skywork AI, 14B parameters, producing 720p videos.","id":"Skywork/SkyReels-V2-I2V-14B-720P"},{"description":"Image-to-video variant of Tencent's HunyuanVideo.","id":"tencent/HunyuanVideo-I2V"},{"description":"A 14B parameter model for 720p image-to-video generation by Wan-AI.","id":"Wan-AI/Wan2.1-I2V-14B-720P"},{"description":"A Diffusers version of the Wan2.1-I2V-14B-720P model for 720p image-to-video generation.","id":"Wan-AI/Wan2.1-I2V-14B-720P-Diffusers"}],"spaces":[{"description":"An application to generate videos fast.","id":"Lightricks/ltx-video-distilled"},{"description":"Generate videos with the FramePack-F1","id":"linoyts/FramePack-F1"},{"description":"Generate videos with the FramePack","id":"lisonallen/framepack-i2v"},{"description":"Wan2.1 with CausVid LoRA","id":"multimodalart/wan2-1-fast"},{"description":"A demo for Stable Video Diffusion","id":"multimodalart/stable-video-diffusion"}],"summary":"Image-to-video models take a still image as input and generate a video. These models can be guided by text prompts to influence the content and style of the output video.","widgetModels":[],"id":"image-to-video","label":"Image-to-Video","libraries":["diffusers"]},"keypoint-detection":{"datasets":[{"description":"A dataset of hand keypoints of over 500k examples.","id":"Vincent-luo/hagrid-mediapipe-hands"}],"demo":{"inputs":[{"filename":"keypoint-detection-input.png","type":"img"}],"outputs":[{"filename":"keypoint-detection-output.png","type":"img"}]},"metrics":[],"models":[{"description":"A robust keypoint detection model.","id":"magic-leap-community/superpoint"},{"description":"A robust keypoint matching model.","id":"magic-leap-community/superglue_outdoor"},{"description":"Strong keypoint detection model used to detect human pose.","id":"qualcomm/RTMPose-Body2d"},{"description":"Powerful keypoint matching model.","id":"ETH-CVG/lightglue_disk"}],"spaces":[{"description":"An application that detects hand keypoints in real-time.","id":"datasciencedojo/Hand-Keypoint-Detection-Realtime"},{"description":"An application for keypoint detection and matching.","id":"ETH-CVG/LightGlue"}],"summary":"Keypoint detection is the task of identifying meaningful distinctive points or features in an image.","widgetModels":[],"youtubeId":"","id":"keypoint-detection","label":"Keypoint Detection","libraries":["transformers"]},"mask-generation":{"datasets":[{"description":"Widely used benchmark dataset for multiple Vision tasks.","id":"merve/coco2017"},{"description":"Medical Imaging dataset of the Human Brain for segmentation and mask generating tasks","id":"rocky93/BraTS_segmentation"}],"demo":{"inputs":[{"filename":"mask-generation-input.png","type":"img"}],"outputs":[{"filename":"mask-generation-output.png","type":"img"}]},"metrics":[{"description":"IoU is used to measure the overlap between predicted mask and the ground truth mask.","id":"Intersection over Union (IoU)"}],"models":[{"description":"Small yet powerful mask generation model.","id":"Zigeng/SlimSAM-uniform-50"},{"description":"Very strong mask generation model.","id":"facebook/sam2-hiera-large"}],"spaces":[{"description":"An application that combines a mask generation model with a zero-shot object detection model for text-guided image segmentation.","id":"merve/OWLSAM2"},{"description":"An application that compares the performance of a large and a small mask generation model.","id":"merve/slimsam"},{"description":"An application based on an improved mask generation model.","id":"SkalskiP/segment-anything-model-2"},{"description":"An application to remove objects from videos using mask generation models.","id":"SkalskiP/SAM_and_ProPainter"}],"summary":"Mask generation is the task of generating masks that identify a specific object or region of interest in a given image. Masks are often used in segmentation tasks, where they provide a precise way to isolate the object of interest for further processing or analysis.","widgetModels":[],"youtubeId":"","id":"mask-generation","label":"Mask Generation","libraries":["transformers"]},"object-detection":{"datasets":[{"description":"Widely used benchmark dataset for multiple vision tasks.","id":"merve/coco2017"},{"description":"Multi-task computer vision benchmark.","id":"merve/pascal-voc"}],"demo":{"inputs":[{"filename":"object-detection-input.jpg","type":"img"}],"outputs":[{"filename":"object-detection-output.jpg","type":"img"}]},"metrics":[{"description":"The Average Precision (AP) metric is the Area Under the PR Curve (AUC-PR). It is calculated for each class separately","id":"Average Precision"},{"description":"The Mean Average Precision (mAP) metric is the overall average of the AP values","id":"Mean Average Precision"},{"description":"The APα metric is the Average Precision at the IoU threshold of a α value, for example, AP50 and AP75","id":"APα"}],"models":[{"description":"Solid object detection model pre-trained on the COCO 2017 dataset.","id":"facebook/detr-resnet-50"},{"description":"Accurate object detection model.","id":"IDEA-Research/dab-detr-resnet-50"},{"description":"Fast and accurate object detection model.","id":"PekingU/rtdetr_v2_r50vd"},{"description":"Object detection model for low-lying objects.","id":"StephanST/WALDO30"}],"spaces":[{"description":"Real-time object detection demo.","id":"Roboflow/RF-DETR"},{"description":"An application that contains various object detection models to try from.","id":"Gradio-Blocks/Object-Detection-With-DETR-and-YOLOS"},{"description":"A cutting-edge object detection application.","id":"sunsmarterjieleaf/yolov12"},{"description":"An object tracking, segmentation and inpainting application.","id":"VIPLab/Track-Anything"},{"description":"Very fast object tracking application based on object detection.","id":"merve/RT-DETR-tracking-coco"}],"summary":"Object Detection models allow users to identify objects of certain defined classes. Object detection models receive an image as input and output the images with bounding boxes and labels on detected objects.","widgetModels":["facebook/detr-resnet-50"],"youtubeId":"WdAeKSOpxhw","id":"object-detection","label":"Object Detection","libraries":["transformers","transformers.js","ultralytics"]},"video-classification":{"datasets":[{"description":"Benchmark dataset used for video classification with videos that belong to 400 classes.","id":"kinetics400"}],"demo":{"inputs":[{"filename":"video-classification-input.gif","type":"img"}],"outputs":[{"type":"chart","data":[{"label":"Playing Guitar","score":0.514},{"label":"Playing Tennis","score":0.193},{"label":"Cooking","score":0.068}]}]},"metrics":[{"description":"","id":"accuracy"},{"description":"","id":"recall"},{"description":"","id":"precision"},{"description":"","id":"f1"}],"models":[{"description":"Strong Video Classification model trained on the Kinetics 400 dataset.","id":"google/vivit-b-16x2-kinetics400"},{"description":"Strong Video Classification model trained on the Kinetics 400 dataset.","id":"microsoft/xclip-base-patch32"}],"spaces":[{"description":"An application that classifies video at different timestamps.","id":"nateraw/lavila"},{"description":"An application that classifies video.","id":"fcakyon/video-classification"}],"summary":"Video classification is the task of assigning a label or class to an entire video. Videos are expected to have only one class for each video. Video classification models take a video as input and return a prediction about which class the video belongs to.","widgetModels":[],"youtubeId":"","id":"video-classification","label":"Video Classification","libraries":["transformers"]},"question-answering":{"datasets":[{"description":"A famous question answering dataset based on English articles from Wikipedia.","id":"squad_v2"},{"description":"A dataset of aggregated anonymized actual queries issued to the Google search engine.","id":"natural_questions"}],"demo":{"inputs":[{"label":"Question","content":"Which name is also used to describe the Amazon rainforest in English?","type":"text"},{"label":"Context","content":"The Amazon rainforest, also known in English as Amazonia or the Amazon Jungle","type":"text"}],"outputs":[{"label":"Answer","content":"Amazonia","type":"text"}]},"metrics":[{"description":"Exact Match is a metric based on the strict character match of the predicted answer and the right answer. For answers predicted correctly, the Exact Match will be 1. Even if only one character is different, Exact Match will be 0","id":"exact-match"},{"description":" The F1-Score metric is useful if we value both false positives and false negatives equally. The F1-Score is calculated on each word in the predicted sequence against the correct answer","id":"f1"}],"models":[{"description":"A robust baseline model for most question answering domains.","id":"deepset/roberta-base-squad2"},{"description":"Small yet robust model that can answer questions.","id":"distilbert/distilbert-base-cased-distilled-squad"},{"description":"A special model that can answer questions from tables.","id":"google/tapas-base-finetuned-wtq"}],"spaces":[{"description":"An application that can answer a long question from Wikipedia.","id":"deepset/wikipedia-assistant"}],"summary":"Question Answering models can retrieve the answer to a question from a given text, which is useful for searching for an answer in a document. Some question answering models can generate answers without context!","widgetModels":["deepset/roberta-base-squad2"],"youtubeId":"ajPx5LwJD-I","id":"question-answering","label":"Question Answering","libraries":["adapter-transformers","allennlp","transformers","transformers.js"]},"reinforcement-learning":{"datasets":[{"description":"A curation of widely used datasets for Data Driven Deep Reinforcement Learning (D4RL)","id":"edbeeching/decision_transformer_gym_replay"}],"demo":{"inputs":[{"label":"State","content":"Red traffic light, pedestrians are about to pass.","type":"text"}],"outputs":[{"label":"Action","content":"Stop the car.","type":"text"},{"label":"Next State","content":"Yellow light, pedestrians have crossed.","type":"text"}]},"metrics":[{"description":"Accumulated reward across all time steps discounted by a factor that ranges between 0 and 1 and determines how much the agent optimizes for future relative to immediate rewards. Measures how good is the policy ultimately found by a given algorithm considering uncertainty over the future.","id":"Discounted Total Reward"},{"description":"Average return obtained after running the policy for a certain number of evaluation episodes. As opposed to total reward, mean reward considers how much reward a given algorithm receives while learning.","id":"Mean Reward"},{"description":"Measures how good a given algorithm is after a predefined time. Some algorithms may be guaranteed to converge to optimal behavior across many time steps. However, an agent that reaches an acceptable level of optimality after a given time horizon may be preferable to one that ultimately reaches optimality but takes a long time.","id":"Level of Performance After Some Time"}],"models":[{"description":"A Reinforcement Learning model trained on expert data from the Gym Hopper environment","id":"edbeeching/decision-transformer-gym-hopper-expert"},{"description":"A PPO agent playing seals/CartPole-v0 using the stable-baselines3 library and the RL Zoo.","id":"HumanCompatibleAI/ppo-seals-CartPole-v0"}],"spaces":[{"description":"An application for a cute puppy agent learning to catch a stick.","id":"ThomasSimonini/Huggy"},{"description":"An application to play Snowball Fight with a reinforcement learning agent.","id":"ThomasSimonini/SnowballFight"}],"summary":"Reinforcement learning is the computational approach of learning from action by interacting with an environment through trial and error and receiving rewards (negative or positive) as feedback","widgetModels":[],"youtubeId":"q0BiUn5LiBc","id":"reinforcement-learning","label":"Reinforcement Learning","libraries":["transformers","stable-baselines3","ml-agents","sample-factory"]},"sentence-similarity":{"datasets":[{"description":"Bing queries with relevant passages from various web sources.","id":"microsoft/ms_marco"}],"demo":{"inputs":[{"label":"Source sentence","content":"Machine learning is so easy.","type":"text"},{"label":"Sentences to compare to","content":"Deep learning is so straightforward.","type":"text"},{"label":"","content":"This is so difficult, like rocket science.","type":"text"},{"label":"","content":"I can't believe how much I struggled with this.","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"Deep learning is so straightforward.","score":0.623},{"label":"This is so difficult, like rocket science.","score":0.413},{"label":"I can't believe how much I struggled with this.","score":0.256}]}]},"metrics":[{"description":"Reciprocal Rank is a measure used to rank the relevancy of documents given a set of documents. Reciprocal Rank is the reciprocal of the rank of the document retrieved, meaning, if the rank is 3, the Reciprocal Rank is 0.33. If the rank is 1, the Reciprocal Rank is 1","id":"Mean Reciprocal Rank"},{"description":"The similarity of the embeddings is evaluated mainly on cosine similarity. It is calculated as the cosine of the angle between two vectors. It is particularly useful when your texts are not the same length","id":"Cosine Similarity"}],"models":[{"description":"This model works well for sentences and paragraphs and can be used for clustering/grouping and semantic searches.","id":"sentence-transformers/all-mpnet-base-v2"},{"description":"A multilingual robust sentence similarity model.","id":"BAAI/bge-m3"},{"description":"A robust sentence similarity model.","id":"HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5"}],"spaces":[{"description":"An application that leverages sentence similarity to answer questions from YouTube videos.","id":"Gradio-Blocks/Ask_Questions_To_YouTube_Videos"},{"description":"An application that retrieves relevant PubMed abstracts for a given online article which can be used as further references.","id":"Gradio-Blocks/pubmed-abstract-retriever"},{"description":"An application that leverages sentence similarity to summarize text.","id":"nickmuchi/article-text-summarizer"},{"description":"A guide that explains how Sentence Transformers can be used for semantic search.","id":"sentence-transformers/Sentence_Transformers_for_semantic_search"}],"summary":"Sentence Similarity is the task of determining how similar two texts are. Sentence similarity models convert input texts into vectors (embeddings) that capture semantic information and calculate how close (similar) they are between them. This task is particularly useful for information retrieval and clustering/grouping.","widgetModels":["sentence-transformers/all-MiniLM-L6-v2"],"youtubeId":"VCZq5AkbNEU","id":"sentence-similarity","label":"Sentence Similarity","libraries":["sentence-transformers","spacy","transformers.js"]},"summarization":{"canonicalId":"text-generation","datasets":[{"description":"News articles in five different languages along with their summaries. Widely used for benchmarking multilingual summarization models.","id":"mlsum"},{"description":"English conversations and their summaries. Useful for benchmarking conversational agents.","id":"samsum"}],"demo":{"inputs":[{"label":"Input","content":"The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. It was the first structure to reach a height of 300 metres. Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct.","type":"text"}],"outputs":[{"label":"Output","content":"The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. It was the first structure to reach a height of 300 metres.","type":"text"}]},"metrics":[{"description":"The generated sequence is compared against its summary, and the overlap of tokens are counted. ROUGE-N refers to overlap of N subsequent tokens, ROUGE-1 refers to overlap of single tokens and ROUGE-2 is the overlap of two subsequent tokens.","id":"rouge"}],"models":[{"description":"A strong summarization model trained on English news articles. Excels at generating factual summaries.","id":"facebook/bart-large-cnn"},{"description":"A summarization model trained on medical articles.","id":"Falconsai/medical_summarization"}],"spaces":[{"description":"An application that can summarize long paragraphs.","id":"pszemraj/summarize-long-text"},{"description":"A much needed summarization application for terms and conditions.","id":"ml6team/distilbart-tos-summarizer-tosdr"},{"description":"An application that summarizes long documents.","id":"pszemraj/document-summarization"},{"description":"An application that can detect errors in abstractive summarization.","id":"ml6team/post-processing-summarization"}],"summary":"Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text.","widgetModels":["facebook/bart-large-cnn"],"youtubeId":"yHnr5Dk2zCI","id":"summarization","label":"Summarization","libraries":["transformers","transformers.js"]},"table-question-answering":{"datasets":[{"description":"The WikiTableQuestions dataset is a large-scale dataset for the task of question answering on semi-structured tables.","id":"wikitablequestions"},{"description":"WikiSQL is a dataset of 80654 hand-annotated examples of questions and SQL queries distributed across 24241 tables from Wikipedia.","id":"wikisql"}],"demo":{"inputs":[{"table":[["Rank","Name","No.of reigns","Combined days"],["1","lou Thesz","3","3749"],["2","Ric Flair","8","3103"],["3","Harley Race","7","1799"]],"type":"tabular"},{"label":"Question","content":"What is the number of reigns for Harley Race?","type":"text"}],"outputs":[{"label":"Result","content":"7","type":"text"}]},"metrics":[{"description":"Checks whether the predicted answer(s) is the same as the ground-truth answer(s).","id":"Denotation Accuracy"}],"models":[{"description":"A table question answering model that is capable of neural SQL execution, i.e., employ TAPEX to execute a SQL query on a given table.","id":"microsoft/tapex-base"},{"description":"A robust table question answering model.","id":"google/tapas-base-finetuned-wtq"}],"spaces":[{"description":"An application that answers questions based on table CSV files.","id":"katanaml/table-query"}],"summary":"Table Question Answering (Table QA) is the answering a question about an information on a given table.","widgetModels":["google/tapas-base-finetuned-wtq"],"id":"table-question-answering","label":"Table Question Answering","libraries":["transformers"]},"tabular-classification":{"datasets":[{"description":"A comprehensive curation of datasets covering all benchmarks.","id":"inria-soda/tabular-benchmark"}],"demo":{"inputs":[{"table":[["Glucose","Blood Pressure ","Skin Thickness","Insulin","BMI"],["148","72","35","0","33.6"],["150","50","30","0","35.1"],["141","60","29","1","39.2"]],"type":"tabular"}],"outputs":[{"table":[["Diabetes"],["1"],["1"],["0"]],"type":"tabular"}]},"metrics":[{"description":"","id":"accuracy"},{"description":"","id":"recall"},{"description":"","id":"precision"},{"description":"","id":"f1"}],"models":[{"description":"Breast cancer prediction model based on decision trees.","id":"scikit-learn/cancer-prediction-trees"}],"spaces":[{"description":"An application that can predict defective products on a production line.","id":"scikit-learn/tabular-playground"},{"description":"An application that compares various tabular classification techniques on different datasets.","id":"scikit-learn/classification"}],"summary":"Tabular classification is the task of classifying a target category (a group) based on set of attributes.","widgetModels":["scikit-learn/tabular-playground"],"youtubeId":"","id":"tabular-classification","label":"Tabular Classification","libraries":["sklearn"]},"tabular-regression":{"datasets":[{"description":"A comprehensive curation of datasets covering all benchmarks.","id":"inria-soda/tabular-benchmark"}],"demo":{"inputs":[{"table":[["Car Name","Horsepower","Weight"],["ford torino","140","3,449"],["amc hornet","97","2,774"],["toyota corolla","65","1,773"]],"type":"tabular"}],"outputs":[{"table":[["MPG (miles per gallon)"],["17"],["18"],["31"]],"type":"tabular"}]},"metrics":[{"description":"","id":"mse"},{"description":"Coefficient of determination (or R-squared) is a measure of how well the model fits the data. Higher R-squared is considered a better fit.","id":"r-squared"}],"models":[{"description":"Fish weight prediction based on length measurements and species.","id":"scikit-learn/Fish-Weight"}],"spaces":[{"description":"An application that can predict weight of a fish based on set of attributes.","id":"scikit-learn/fish-weight-prediction"}],"summary":"Tabular regression is the task of predicting a numerical value given a set of attributes.","widgetModels":["scikit-learn/Fish-Weight"],"youtubeId":"","id":"tabular-regression","label":"Tabular Regression","libraries":["sklearn"]},"text-classification":{"datasets":[{"description":"A widely used dataset used to benchmark multiple variants of text classification.","id":"nyu-mll/glue"},{"description":"A text classification dataset used to benchmark natural language inference models","id":"stanfordnlp/snli"}],"demo":{"inputs":[{"label":"Input","content":"I love Hugging Face!","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"POSITIVE","score":0.9},{"label":"NEUTRAL","score":0.1},{"label":"NEGATIVE","score":0}]}]},"metrics":[{"description":"","id":"accuracy"},{"description":"","id":"recall"},{"description":"","id":"precision"},{"description":"The F1 metric is the harmonic mean of the precision and recall. It can be calculated as: F1 = 2 * (precision * recall) / (precision + recall)","id":"f1"}],"models":[{"description":"A robust model trained for sentiment analysis.","id":"distilbert/distilbert-base-uncased-finetuned-sst-2-english"},{"description":"A sentiment analysis model specialized in financial sentiment.","id":"ProsusAI/finbert"},{"description":"A sentiment analysis model specialized in analyzing tweets.","id":"cardiffnlp/twitter-roberta-base-sentiment-latest"},{"description":"A model that can classify languages.","id":"papluca/xlm-roberta-base-language-detection"},{"description":"A model that can classify text generation attacks.","id":"meta-llama/Prompt-Guard-86M"}],"spaces":[{"description":"An application that can classify financial sentiment.","id":"IoannisTr/Tech_Stocks_Trading_Assistant"},{"description":"A dashboard that contains various text classification tasks.","id":"miesnerjacob/Multi-task-NLP"},{"description":"An application that analyzes user reviews in healthcare.","id":"spacy/healthsea-demo"}],"summary":"Text Classification is the task of assigning a label or class to a given text. Some use cases are sentiment analysis, natural language inference, and assessing grammatical correctness.","widgetModels":["distilbert/distilbert-base-uncased-finetuned-sst-2-english"],"youtubeId":"leNG9fN9FQU","id":"text-classification","label":"Text Classification","libraries":["adapter-transformers","setfit","spacy","transformers","transformers.js"]},"text-generation":{"datasets":[{"description":"Multilingual dataset used to evaluate text generation models.","id":"CohereForAI/Global-MMLU"},{"description":"High quality multilingual data used to train text-generation models.","id":"HuggingFaceFW/fineweb-2"},{"description":"Truly open-source, curated and cleaned dialogue dataset.","id":"HuggingFaceH4/ultrachat_200k"},{"description":"A reasoning dataset.","id":"open-r1/OpenThoughts-114k-math"},{"description":"A multilingual instruction dataset with preference ratings on responses.","id":"allenai/tulu-3-sft-mixture"},{"description":"A large synthetic dataset for alignment of text generation models.","id":"HuggingFaceTB/smoltalk"},{"description":"A dataset made for training text generation models solving math questions.","id":"HuggingFaceTB/finemath"}],"demo":{"inputs":[{"label":"Input","content":"Once upon a time,","type":"text"}],"outputs":[{"label":"Output","content":"Once upon a time, we knew that our ancestors were on the verge of extinction. The great explorers and poets of the Old World, from Alexander the Great to Chaucer, are dead and gone. A good many of our ancient explorers and poets have","type":"text"}]},"metrics":[{"description":"Cross Entropy is a metric that calculates the difference between two probability distributions. Each probability distribution is the distribution of predicted words","id":"Cross Entropy"},{"description":"The Perplexity metric is the exponential of the cross-entropy loss. It evaluates the probabilities assigned to the next word by the model. Lower perplexity indicates better performance","id":"Perplexity"}],"models":[{"description":"A text-generation model trained to follow instructions.","id":"google/gemma-2-2b-it"},{"description":"Powerful text generation model for coding.","id":"Qwen/Qwen3-Coder-480B-A35B-Instruct"},{"description":"Great text generation model with top-notch tool calling capabilities.","id":"openai/gpt-oss-120b"},{"description":"Powerful text generation model.","id":"zai-org/GLM-4.5"},{"description":"A powerful small model with reasoning capabilities.","id":"Qwen/Qwen3-4B-Thinking-2507"},{"description":"Strong conversational model that supports very long instructions.","id":"Qwen/Qwen2.5-7B-Instruct-1M"},{"description":"Text generation model used to write code.","id":"Qwen/Qwen2.5-Coder-32B-Instruct"},{"description":"Powerful reasoning based open large language model.","id":"deepseek-ai/DeepSeek-R1"}],"spaces":[{"description":"An application that writes and executes code from text instructions and supports many models.","id":"akhaliq/anycoder"},{"description":"An application that builds websites from natural language prompts.","id":"enzostvs/deepsite"},{"description":"A leaderboard for comparing chain-of-thought performance of models.","id":"logikon/open_cot_leaderboard"},{"description":"An text generation based application based on a very powerful LLaMA2 model.","id":"ysharma/Explore_llamav2_with_TGI"},{"description":"An text generation based application to converse with Zephyr model.","id":"HuggingFaceH4/zephyr-chat"},{"description":"A leaderboard that ranks text generation models based on blind votes from people.","id":"lmsys/chatbot-arena-leaderboard"},{"description":"An chatbot to converse with a very powerful text generation model.","id":"mlabonne/phixtral-chat"}],"summary":"Generating text is the task of generating new text given another text. These models can, for example, fill in incomplete text or paraphrase.","widgetModels":["mistralai/Mistral-Nemo-Instruct-2407"],"youtubeId":"e9gNEAlsOvU","id":"text-generation","label":"Text Generation","libraries":["transformers","transformers.js"]},"text-ranking":{"datasets":[{"description":"Bing queries with relevant passages from various web sources.","id":"microsoft/ms_marco"}],"demo":{"inputs":[{"label":"Source sentence","content":"Machine learning is so easy.","type":"text"},{"label":"Sentences to compare to","content":"Deep learning is so straightforward.","type":"text"},{"label":"","content":"This is so difficult, like rocket science.","type":"text"},{"label":"","content":"I can't believe how much I struggled with this.","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"Deep learning is so straightforward.","score":2.2006407},{"label":"This is so difficult, like rocket science.","score":-6.2634873},{"label":"I can't believe how much I struggled with this.","score":-10.251488}]}]},"metrics":[{"description":"Discounted Cumulative Gain (DCG) measures the gain, or usefulness, of search results discounted by their position. The normalization is done by dividing the DCG by the ideal DCG, which is the DCG of the perfect ranking.","id":"Normalized Discounted Cumulative Gain"},{"description":"Reciprocal Rank is a measure used to rank the relevancy of documents given a set of documents. Reciprocal Rank is the reciprocal of the rank of the document retrieved, meaning, if the rank is 3, the Reciprocal Rank is 0.33. If the rank is 1, the Reciprocal Rank is 1","id":"Mean Reciprocal Rank"},{"description":"Mean Average Precision (mAP) is the overall average of the Average Precision (AP) values, where AP is the Area Under the PR Curve (AUC-PR)","id":"Mean Average Precision"}],"models":[{"description":"An extremely efficient text ranking model trained on a web search dataset.","id":"cross-encoder/ms-marco-MiniLM-L6-v2"},{"description":"A strong multilingual text reranker model.","id":"Alibaba-NLP/gte-multilingual-reranker-base"},{"description":"An efficient text ranking model that punches above its weight.","id":"Alibaba-NLP/gte-reranker-modernbert-base"}],"spaces":[],"summary":"Text Ranking is the task of ranking a set of texts based on their relevance to a query. Text ranking models are trained on large datasets of queries and relevant documents to learn how to rank documents based on their relevance to the query. This task is particularly useful for search engines and information retrieval systems.","widgetModels":["cross-encoder/ms-marco-MiniLM-L6-v2"],"youtubeId":"","id":"text-ranking","label":"Text Ranking","libraries":["sentence-transformers","transformers"]},"text-to-image":{"datasets":[{"description":"RedCaps is a large-scale dataset of 12M image-text pairs collected from Reddit.","id":"red_caps"},{"description":"Conceptual Captions is a dataset consisting of ~3.3M images annotated with captions.","id":"conceptual_captions"},{"description":"12M image-caption pairs.","id":"Spawning/PD12M"}],"demo":{"inputs":[{"label":"Input","content":"A city above clouds, pastel colors, Victorian style","type":"text"}],"outputs":[{"filename":"image.jpeg","type":"img"}]},"metrics":[{"description":"The Inception Score (IS) measure assesses diversity and meaningfulness. It uses a generated image sample to predict its label. A higher score signifies more diverse and meaningful images.","id":"IS"},{"description":"The Fréchet Inception Distance (FID) calculates the distance between distributions between synthetic and real samples. A lower FID score indicates better similarity between the distributions of real and generated images.","id":"FID"},{"description":"R-precision assesses how the generated image aligns with the provided text description. It uses the generated images as queries to retrieve relevant text descriptions. The top 'r' relevant descriptions are selected and used to calculate R-precision as r/R, where 'R' is the number of ground truth descriptions associated with the generated images. A higher R-precision value indicates a better model.","id":"R-Precision"}],"models":[{"description":"One of the most powerful image generation models that can generate realistic outputs.","id":"black-forest-labs/FLUX.1-Krea-dev"},{"description":"A powerful image generation model.","id":"Qwen/Qwen-Image"},{"description":"Powerful and fast image generation model.","id":"ByteDance/SDXL-Lightning"},{"description":"A powerful text-to-image model.","id":"ByteDance/Hyper-SD"}],"spaces":[{"description":"A powerful text-to-image application.","id":"stabilityai/stable-diffusion-3-medium"},{"description":"A text-to-image application to generate comics.","id":"jbilcke-hf/ai-comic-factory"},{"description":"An application to match multiple custom image generation models.","id":"multimodalart/flux-lora-lab"},{"description":"A powerful yet very fast image generation application.","id":"latent-consistency/lcm-lora-for-sdxl"},{"description":"A gallery to explore various text-to-image models.","id":"multimodalart/LoraTheExplorer"},{"description":"An application for `text-to-image`, `image-to-image` and image inpainting.","id":"ArtGAN/Stable-Diffusion-ControlNet-WebUI"},{"description":"An application to generate realistic images given photos of a person and a prompt.","id":"InstantX/InstantID"}],"summary":"Text-to-image is the task of generating images from input text. These pipelines can also be used to modify and edit images based on text prompts.","widgetModels":["black-forest-labs/FLUX.1-dev"],"youtubeId":"","id":"text-to-image","label":"Text-to-Image","libraries":["diffusers"]},"text-to-speech":{"canonicalId":"text-to-audio","datasets":[{"description":"10K hours of multi-speaker English dataset.","id":"parler-tts/mls_eng_10k"},{"description":"Multi-speaker English dataset.","id":"mythicinfinity/libritts_r"},{"description":"Multi-lingual dataset.","id":"facebook/multilingual_librispeech"}],"demo":{"inputs":[{"label":"Input","content":"I love audio models on the Hub!","type":"text"}],"outputs":[{"filename":"audio.wav","type":"audio"}]},"metrics":[{"description":"The Mel Cepstral Distortion (MCD) metric is used to calculate the quality of generated speech.","id":"mel cepstral distortion"}],"models":[{"description":"Small yet powerful TTS model.","id":"KittenML/kitten-tts-nano-0.1"},{"description":"Bleeding edge TTS model.","id":"ResembleAI/chatterbox"},{"description":"A massively multi-lingual TTS model.","id":"fishaudio/fish-speech-1.5"},{"description":"A text-to-dialogue model.","id":"nari-labs/Dia-1.6B-0626"}],"spaces":[{"description":"An application for generate high quality speech in different languages.","id":"hexgrad/Kokoro-TTS"},{"description":"A multilingual text-to-speech application.","id":"fishaudio/fish-speech-1"},{"description":"Performant TTS application.","id":"ResembleAI/Chatterbox"},{"description":"An application to compare different TTS models.","id":"TTS-AGI/TTS-Arena-V2"},{"description":"An application that generates podcast episodes.","id":"ngxson/kokoro-podcast-generator"}],"summary":"Text-to-Speech (TTS) is the task of generating natural sounding speech given text input. TTS models can be extended to have a single model that generates speech for multiple speakers and multiple languages.","widgetModels":["suno/bark"],"youtubeId":"NW62DpzJ274","id":"text-to-speech","label":"Text-to-Speech","libraries":["espnet","tensorflowtts","transformers","transformers.js"]},"text-to-video":{"datasets":[{"description":"Microsoft Research Video to Text is a large-scale dataset for open domain video captioning","id":"iejMac/CLIP-MSR-VTT"},{"description":"UCF101 Human Actions dataset consists of 13,320 video clips from YouTube, with 101 classes.","id":"quchenyuan/UCF101-ZIP"},{"description":"A high-quality dataset for human action recognition in YouTube videos.","id":"nateraw/kinetics"},{"description":"A dataset of video clips of humans performing pre-defined basic actions with everyday objects.","id":"HuggingFaceM4/something_something_v2"},{"description":"This dataset consists of text-video pairs and contains noisy samples with irrelevant video descriptions","id":"HuggingFaceM4/webvid"},{"description":"A dataset of short Flickr videos for the temporal localization of events with descriptions.","id":"iejMac/CLIP-DiDeMo"}],"demo":{"inputs":[{"label":"Input","content":"Darth Vader is surfing on the waves.","type":"text"}],"outputs":[{"filename":"text-to-video-output.gif","type":"img"}]},"metrics":[{"description":"Inception Score uses an image classification model that predicts class labels and evaluates how distinct and diverse the images are. A higher score indicates better video generation.","id":"is"},{"description":"Frechet Inception Distance uses an image classification model to obtain image embeddings. The metric compares mean and standard deviation of the embeddings of real and generated images. A smaller score indicates better video generation.","id":"fid"},{"description":"Frechet Video Distance uses a model that captures coherence for changes in frames and the quality of each frame. A smaller score indicates better video generation.","id":"fvd"},{"description":"CLIPSIM measures similarity between video frames and text using an image-text similarity model. A higher score indicates better video generation.","id":"clipsim"}],"models":[{"description":"A strong model for consistent video generation.","id":"tencent/HunyuanVideo"},{"description":"A text-to-video model with high fidelity motion and strong prompt adherence.","id":"Lightricks/LTX-Video"},{"description":"A text-to-video model focusing on physics-aware applications like robotics.","id":"nvidia/Cosmos-1.0-Diffusion-7B-Text2World"},{"description":"Very fast model for video generation.","id":"Lightricks/LTX-Video-0.9.8-13B-distilled"}],"spaces":[{"description":"An application that generates video from text.","id":"VideoCrafter/VideoCrafter"},{"description":"Consistent video generation application.","id":"Wan-AI/Wan2.1"},{"description":"A cutting edge video generation application.","id":"Pyramid-Flow/pyramid-flow"}],"summary":"Text-to-video models can be used in any application that requires generating consistent sequence of images from text. ","widgetModels":["Wan-AI/Wan2.2-TI2V-5B"],"id":"text-to-video","label":"Text-to-Video","libraries":["diffusers"]},"token-classification":{"datasets":[{"description":"A widely used dataset useful to benchmark named entity recognition models.","id":"eriktks/conll2003"},{"description":"A multilingual dataset of Wikipedia articles annotated for named entity recognition in over 150 different languages.","id":"unimelb-nlp/wikiann"}],"demo":{"inputs":[{"label":"Input","content":"My name is Omar and I live in Zürich.","type":"text"}],"outputs":[{"text":"My name is Omar and I live in Zürich.","tokens":[{"type":"PERSON","start":11,"end":15},{"type":"GPE","start":30,"end":36}],"type":"text-with-tokens"}]},"metrics":[{"description":"","id":"accuracy"},{"description":"","id":"recall"},{"description":"","id":"precision"},{"description":"","id":"f1"}],"models":[{"description":"A robust performance model to identify people, locations, organizations and names of miscellaneous entities.","id":"dslim/bert-base-NER"},{"description":"A strong model to identify people, locations, organizations and names in multiple languages.","id":"FacebookAI/xlm-roberta-large-finetuned-conll03-english"},{"description":"A token classification model specialized on medical entity recognition.","id":"blaze999/Medical-NER"},{"description":"Flair models are typically the state of the art in named entity recognition tasks.","id":"flair/ner-english"}],"spaces":[{"description":"An application that can recognizes entities, extracts noun chunks and recognizes various linguistic features of each token.","id":"spacy/gradio_pipeline_visualizer"}],"summary":"Token classification is a natural language understanding task in which a label is assigned to some tokens in a text. Some popular token classification subtasks are Named Entity Recognition (NER) and Part-of-Speech (PoS) tagging. NER models could be trained to identify specific entities in a text, such as dates, individuals and places; and PoS tagging would identify, for example, which words in a text are verbs, nouns, and punctuation marks.","widgetModels":["FacebookAI/xlm-roberta-large-finetuned-conll03-english"],"youtubeId":"wVHdVlPScxA","id":"token-classification","label":"Token Classification","libraries":["adapter-transformers","flair","spacy","span-marker","stanza","transformers","transformers.js"]},"translation":{"canonicalId":"text-generation","datasets":[{"description":"A dataset of copyright-free books translated into 16 different languages.","id":"Helsinki-NLP/opus_books"},{"description":"An example of translation between programming languages. This dataset consists of functions in Java and C#.","id":"google/code_x_glue_cc_code_to_code_trans"}],"demo":{"inputs":[{"label":"Input","content":"My name is Omar and I live in Zürich.","type":"text"}],"outputs":[{"label":"Output","content":"Mein Name ist Omar und ich wohne in Zürich.","type":"text"}]},"metrics":[{"description":"BLEU score is calculated by counting the number of shared single or subsequent tokens between the generated sequence and the reference. Subsequent n tokens are called “n-grams”. Unigram refers to a single token while bi-gram refers to token pairs and n-grams refer to n subsequent tokens. The score ranges from 0 to 1, where 1 means the translation perfectly matched and 0 did not match at all","id":"bleu"},{"description":"","id":"sacrebleu"}],"models":[{"description":"Very powerful model that can translate many languages between each other, especially low-resource languages.","id":"facebook/nllb-200-1.3B"},{"description":"A general-purpose Transformer that can be used to translate from English to German, French, or Romanian.","id":"google-t5/t5-base"}],"spaces":[{"description":"An application that can translate between 100 languages.","id":"Iker/Translate-100-languages"},{"description":"An application that can translate between many languages.","id":"Geonmo/nllb-translation-demo"}],"summary":"Translation is the task of converting text from one language to another.","widgetModels":["facebook/mbart-large-50-many-to-many-mmt"],"youtubeId":"1JvfrvZgi6c","id":"translation","label":"Translation","libraries":["transformers","transformers.js"]},"unconditional-image-generation":{"datasets":[{"description":"The CIFAR-100 dataset consists of 60000 32x32 colour images in 100 classes, with 600 images per class.","id":"cifar100"},{"description":"Multiple images of celebrities, used for facial expression translation.","id":"CelebA"}],"demo":{"inputs":[{"label":"Seed","content":"42","type":"text"},{"label":"Number of images to generate:","content":"4","type":"text"}],"outputs":[{"filename":"unconditional-image-generation-output.jpeg","type":"img"}]},"metrics":[{"description":"The inception score (IS) evaluates the quality of generated images. It measures the diversity of the generated images (the model predictions are evenly distributed across all possible labels) and their 'distinction' or 'sharpness' (the model confidently predicts a single label for each image).","id":"Inception score (IS)"},{"description":"The Fréchet Inception Distance (FID) evaluates the quality of images created by a generative model by calculating the distance between feature vectors for real and generated images.","id":"Frećhet Inception Distance (FID)"}],"models":[{"description":"High-quality image generation model trained on the CIFAR-10 dataset. It synthesizes images of the ten classes presented in the dataset using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.","id":"google/ddpm-cifar10-32"},{"description":"High-quality image generation model trained on the 256x256 CelebA-HQ dataset. It synthesizes images of faces using diffusion probabilistic models, a class of latent variable models inspired by considerations from nonequilibrium thermodynamics.","id":"google/ddpm-celebahq-256"}],"spaces":[{"description":"An application that can generate realistic faces.","id":"CompVis/celeba-latent-diffusion"}],"summary":"Unconditional image generation is the task of generating images with no condition in any context (like a prompt text or another image). Once trained, the model will create images that resemble its training data distribution.","widgetModels":[""],"youtubeId":"","id":"unconditional-image-generation","label":"Unconditional Image Generation","libraries":["diffusers"]},"video-text-to-text":{"datasets":[{"description":"Multiple-choice questions and answers about videos.","id":"lmms-lab/Video-MME"},{"description":"A dataset of instructions and question-answer pairs about videos.","id":"lmms-lab/VideoChatGPT"},{"description":"Large video understanding dataset.","id":"HuggingFaceFV/finevideo"}],"demo":{"inputs":[{"filename":"video-text-to-text-input.gif","type":"img"},{"label":"Text Prompt","content":"What is happening in this video?","type":"text"}],"outputs":[{"label":"Answer","content":"The video shows a series of images showing a fountain with water jets and a variety of colorful flowers and butterflies in the background.","type":"text"}]},"metrics":[],"models":[{"description":"A robust video-text-to-text model.","id":"Vision-CAIR/LongVU_Qwen2_7B"},{"description":"Strong video-text-to-text model with reasoning capabilities.","id":"GoodiesHere/Apollo-LMMs-Apollo-7B-t32"},{"description":"Strong video-text-to-text model.","id":"HuggingFaceTB/SmolVLM2-2.2B-Instruct"}],"spaces":[{"description":"An application to chat with a video-text-to-text model.","id":"llava-hf/video-llava"},{"description":"A leaderboard for various video-text-to-text models.","id":"opencompass/openvlm_video_leaderboard"},{"description":"An application to generate highlights from a video.","id":"HuggingFaceTB/SmolVLM2-HighlightGenerator"}],"summary":"Video-text-to-text models take in a video and a text prompt and output text. These models are also called video-language models.","widgetModels":[""],"youtubeId":"","id":"video-text-to-text","label":"Video-Text-to-Text","libraries":["transformers"]},"video-to-video":{"datasets":[{"description":"Dataset with detailed annotations for training and benchmarking video instance editing.","id":"suimu/VIRESET"},{"description":"Dataset to evaluate models on long video generation and understanding.","id":"zhangsh2001/LongV-EVAL"},{"description":"Collection of 104 demo videos from the SeedVR/SeedVR2 series showcasing model outputs.","id":"Iceclear/SeedVR_VideoDemos"}],"demo":{"inputs":[{"filename":"input.gif","type":"img"}],"outputs":[{"filename":"output.gif","type":"img"}]},"metrics":[],"models":[{"description":"Model for editing outfits, character, and scenery in videos.","id":"decart-ai/Lucy-Edit-Dev"},{"description":"Framework that uses 3D mesh proxies for precise, consistent video editing.","id":"LeoLau/Shape-for-Motion"},{"description":"Model for generating physics-aware videos from input videos and control conditions.","id":"nvidia/Cosmos-Transfer2.5-2B"},{"description":"A model to upscale videos at input, designed for seamless use with ComfyUI.","id":"numz/SeedVR2_comfyUI"}],"spaces":[{"description":"Interactive demo space for Lucy-Edit-Dev video editing.","id":"decart-ai/lucy-edit-dev"},{"description":"Demo space for SeedVR2-3B showcasing video upscaling and restoration.","id":"ByteDance-Seed/SeedVR2-3B"}],"summary":"Video-to-video models take one or more videos as input and generate new videos as output. They can enhance quality, interpolate frames, modify styles, or create new motion dynamics, enabling creative applications, video production, and research.","widgetModels":[],"youtubeId":"","id":"video-to-video","label":"Video-to-Video","libraries":["diffusers"]},"visual-question-answering":{"datasets":[{"description":"A widely used dataset containing questions (with answers) about images.","id":"Graphcore/vqa"},{"description":"A dataset to benchmark visual reasoning based on text in images.","id":"facebook/textvqa"}],"demo":{"inputs":[{"filename":"elephant.jpeg","type":"img"},{"label":"Question","content":"What is in this image?","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"elephant","score":0.97},{"label":"elephants","score":0.06},{"label":"animal","score":0.003}]}]},"isPlaceholder":false,"metrics":[{"description":"","id":"accuracy"},{"description":"Measures how much a predicted answer differs from the ground truth based on the difference in their semantic meaning.","id":"wu-palmer similarity"}],"models":[{"description":"A visual question answering model trained to convert charts and plots to text.","id":"google/deplot"},{"description":"A visual question answering model trained for mathematical reasoning and chart derendering from images.","id":"google/matcha-base"},{"description":"A strong visual question answering that answers questions from book covers.","id":"google/pix2struct-ocrvqa-large"}],"spaces":[{"description":"An application that compares visual question answering models across different tasks.","id":"merve/pix2struct"},{"description":"An application that can answer questions based on images.","id":"nielsr/vilt-vqa"},{"description":"An application that can caption images and answer questions about a given image. ","id":"Salesforce/BLIP"},{"description":"An application that can caption images and answer questions about a given image. ","id":"vumichien/Img2Prompt"}],"summary":"Visual Question Answering is the task of answering open-ended questions based on an image. They output natural language responses to natural language questions.","widgetModels":["dandelin/vilt-b32-finetuned-vqa"],"youtubeId":"","id":"visual-question-answering","label":"Visual Question Answering","libraries":["transformers","transformers.js"]},"zero-shot-classification":{"datasets":[{"description":"A widely used dataset used to benchmark multiple variants of text classification.","id":"nyu-mll/glue"},{"description":"The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information.","id":"nyu-mll/multi_nli"},{"description":"FEVER is a publicly available dataset for fact extraction and verification against textual sources.","id":"fever/fever"}],"demo":{"inputs":[{"label":"Text Input","content":"Dune is the best movie ever.","type":"text"},{"label":"Candidate Labels","content":"CINEMA, ART, MUSIC","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"CINEMA","score":0.9},{"label":"ART","score":0.1},{"label":"MUSIC","score":0}]}]},"metrics":[],"models":[{"description":"Powerful zero-shot text classification model.","id":"facebook/bart-large-mnli"},{"description":"Cutting-edge zero-shot multilingual text classification model.","id":"MoritzLaurer/ModernBERT-large-zeroshot-v2.0"},{"description":"Zero-shot text classification model that can be used for topic and sentiment classification.","id":"knowledgator/gliclass-modern-base-v2.0-init"}],"spaces":[],"summary":"Zero-shot text classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes.","widgetModels":["facebook/bart-large-mnli"],"id":"zero-shot-classification","label":"Zero-Shot Classification","libraries":["transformers","transformers.js"]},"zero-shot-image-classification":{"datasets":[{"description":"","id":""}],"demo":{"inputs":[{"filename":"image-classification-input.jpeg","type":"img"},{"label":"Classes","content":"cat, dog, bird","type":"text"}],"outputs":[{"type":"chart","data":[{"label":"Cat","score":0.664},{"label":"Dog","score":0.329},{"label":"Bird","score":0.008}]}]},"metrics":[{"description":"Computes the number of times the correct label appears in top K labels predicted","id":"top-K accuracy"}],"models":[{"description":"Multilingual image classification model for 80 languages.","id":"visheratin/mexma-siglip"},{"description":"Strong zero-shot image classification model.","id":"google/siglip2-base-patch16-224"},{"description":"Robust zero-shot image classification model.","id":"intfloat/mmE5-mllama-11b-instruct"},{"description":"Powerful zero-shot image classification model supporting 94 languages.","id":"jinaai/jina-clip-v2"},{"description":"Strong image classification model for biomedical domain.","id":"microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224"}],"spaces":[{"description":"An application that leverages zero-shot image classification to find best captions to generate an image. ","id":"pharma/CLIP-Interrogator"},{"description":"An application to compare different zero-shot image classification models. ","id":"merve/compare_clip_siglip"}],"summary":"Zero-shot image classification is the task of classifying previously unseen classes during training of a model.","widgetModels":["google/siglip-so400m-patch14-224"],"youtubeId":"","id":"zero-shot-image-classification","label":"Zero-Shot Image Classification","libraries":["transformers","transformers.js"]},"zero-shot-object-detection":{"datasets":[],"demo":{"inputs":[{"filename":"zero-shot-object-detection-input.jpg","type":"img"},{"label":"Classes","content":"cat, dog, bird","type":"text"}],"outputs":[{"filename":"zero-shot-object-detection-output.jpg","type":"img"}]},"metrics":[{"description":"The Average Precision (AP) metric is the Area Under the PR Curve (AUC-PR). It is calculated for each class separately","id":"Average Precision"},{"description":"The Mean Average Precision (mAP) metric is the overall average of the AP values","id":"Mean Average Precision"},{"description":"The APα metric is the Average Precision at the IoU threshold of a α value, for example, AP50 and AP75","id":"APα"}],"models":[{"description":"Solid zero-shot object detection model.","id":"openmmlab-community/mm_grounding_dino_large_all"},{"description":"Cutting-edge zero-shot object detection model.","id":"fushh7/LLMDet"}],"spaces":[{"description":"A demo to compare different zero-shot object detection models per output and latency.","id":"ariG23498/zero-shot-od"},{"description":"A demo that combines a zero-shot object detection and mask generation model for zero-shot segmentation.","id":"merve/OWLSAM"}],"summary":"Zero-shot object detection is a computer vision task to detect objects and their classes in images, without any prior training or knowledge of the classes. Zero-shot object detection models receive an image as input, as well as a list of candidate classes, and output the bounding boxes and labels where the objects have been detected.","widgetModels":[],"youtubeId":"","id":"zero-shot-object-detection","label":"Zero-Shot Object Detection","libraries":["transformers","transformers.js"]},"text-to-3d":{"datasets":[{"description":"A large dataset of over 10 million 3D objects.","id":"allenai/objaverse-xl"},{"description":"Descriptive captions for 3D objects in Objaverse.","id":"tiange/Cap3D"}],"demo":{"inputs":[{"label":"Prompt","content":"a cat statue","type":"text"}],"outputs":[{"label":"Result","content":"text-to-3d-3d-output-filename.glb","type":"text"}]},"metrics":[],"models":[{"description":"Text-to-3D mesh model by OpenAI","id":"openai/shap-e"},{"description":"Generative 3D gaussian splatting model.","id":"ashawkey/LGM"}],"spaces":[{"description":"Text-to-3D demo with mesh outputs.","id":"hysts/Shap-E"},{"description":"Text/image-to-3D demo with splat outputs.","id":"ashawkey/LGM"}],"summary":"Text-to-3D models take in text input and produce 3D output.","widgetModels":[],"youtubeId":"","id":"text-to-3d","label":"Text-to-3D","libraries":["diffusers"]},"image-to-3d":{"datasets":[{"description":"A large dataset of over 10 million 3D objects.","id":"allenai/objaverse-xl"},{"description":"A dataset of isolated object images for evaluating image-to-3D models.","id":"dylanebert/iso3d"}],"demo":{"inputs":[{"filename":"image-to-3d-image-input.png","type":"img"}],"outputs":[{"label":"Result","content":"image-to-3d-3d-output-filename.glb","type":"text"}]},"metrics":[],"models":[{"description":"Fast image-to-3D mesh model by Tencent.","id":"TencentARC/InstantMesh"},{"description":"3D world generation model.","id":"tencent/HunyuanWorld-1"},{"description":"A scaled up image-to-3D mesh model derived from TripoSR.","id":"hwjiang/Real3D"},{"description":"Consistent image-to-3d generation model.","id":"stabilityai/stable-point-aware-3d"}],"spaces":[{"description":"Leaderboard to evaluate image-to-3D models.","id":"dylanebert/3d-arena"},{"description":"Image-to-3D demo with mesh outputs.","id":"TencentARC/InstantMesh"},{"description":"Image-to-3D demo.","id":"stabilityai/stable-point-aware-3d"},{"description":"Image-to-3D demo with mesh outputs.","id":"hwjiang/Real3D"},{"description":"Image-to-3D demo with splat outputs.","id":"dylanebert/LGM-mini"}],"summary":"Image-to-3D models take in image input and produce 3D output.","widgetModels":[],"youtubeId":"","id":"image-to-3d","label":"Image-to-3D","libraries":["diffusers"]}}