The names of the models are as clever as they are descriptive. For example, distilroberta is a BERT model. It is designed to understand and generate human language by considering the context of words and sentences from both directions (left to right and right to left).
RoBERTa is a more robust pretraining of the original BERT model (thus the “ro”). Knowledge distillation is a common technique used to reduce model size while maintaining model performance, for example,
DistilBERT, which is distilled during pre-training and is 40% smaller than BERT. This model combines the robustly optimized pretraining approach of RoBERTa and pretraining distillation of DistilBERT for a smaller, more robust model called
DistilRoBERTa. We can also gather from the title that the base DistilRoBERTa was fine-tuned for sentiment analysis on financial news.
Some of the models in the search results, such as heBERT for Hebrew text, Roberta-base-Indonesian for Indonesian text, and CAMeLBERT for Arabic text, have been trained in specific languages, which you can eliminate if your data isn’t in those languages. However, note that these models are typically more performant than multilingual ones in that specific language.
Clicking on a model name leads to a
model card containing documentation about its performance characteristics, including how it was trained, its reported accuracy, the training dataset and training hyperparameters, and framework versions.