Instructions to use EMBEDDIA/sloberta with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use EMBEDDIA/sloberta with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("fill-mask", model="EMBEDDIA/sloberta")# Load model directly from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta") model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta") - Inference
- Notebooks
- Google Colab
- Kaggle
Usage
Load in transformers library with:
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("EMBEDDIA/sloberta")
model = AutoModelForMaskedLM.from_pretrained("EMBEDDIA/sloberta")
SloBERTa
SloBERTa model is a monolingual Slovene BERT-like model. It is closely related to French Camembert model https://camembert-model.fr/. The corpora used for training the model have 3.47 billion tokens in total. The subword vocabulary contains 32,000 tokens. The scripts and programs used for data preparation and training the model are available on https://github.com/clarinsi/Slovene-BERT-Tool
SloBERTa was trained for 200,000 iterations or about 98 epochs.
Corpora
The following corpora were used for training the model:
- Gigafida 2.0
- Kas 1.0
- Janes 1.0 (only Janes-news, Janes-forum, Janes-blog, Janes-wiki subcorpora)
- Slovenian parliamentary corpus siParl 2.0
- slWaC
- Downloads last month
- 1,299