We use MT-bench, a set of challenging multi-turn open-ended questions to evaluate models. To automate the evaluation process, we prompt strong LLMs like GPT-4 to act as judges and assess the quality of the models' responses. See instructions for running MT-bench at fastchat/llm_judge. MT-bench is the new recommended way to benchmark your models. More @Wikipedia
Hover over any link to get a description of the article. Please note that search keywords are sometimes hidden within the full article and don't appear in the description or title.