DeepSeek-R1 and Ollama3.2 - Judge rates

Go back to list

Our evaluation framework relies on an experimental judge system, which is still being developed and refined. While this system may provide some useful insights into the performance of LLMs like DeepSeek-R1, it is not yet mature or reliable enough to provide definitive or absolute judgments about their capabilities. As such, the results presented here should be considered preliminary and subject to revision as our judge system continues to evolve and improve.

	total_rating_mean
question_category	coding	extraction	humanities	math	reasoning	roleplay	stem	writing
model
deepseek-r1:1.5b	17.50	19.62	14.86	19.30	20.50	18.70	20.00	19.71
deepseek-r1:14b	20.60	19.69	19.27	19.20	19.40	19.50	19.80	18.77
deepseek-r1:32b	18.50	19.87	20.10	18.00	19.10	20.46	20.86	21.77
deepseek-r1:7b	22.33	20.33	19.00	18.00	18.40	21.58	20.00	18.93
deepseek-r1:8b	19.00	21.47	34.40	18.60	18.92	18.40	18.62	19.38
llama3.1	18.62	19.08	20.80	19.33	17.69	20.20	20.60	19.07
llama3.2:1b	19.27	19.87	22.67	18.85	19.87	18.67	21.00	18.40
llama3.2:3b	18.75	20.60	21.53	19.69	18.93	19.13	19.80	20.93
phi4:14b	18.58	20.87	18.46	18.73	19.87	19.93	21.33	20.60