DeepSeek-R1 and Ollama3.2 - Judge rates
Go back to listOur evaluation framework relies on an experimental judge system, which is still being developed and refined. While this system may provide some useful insights into the performance of LLMs like DeepSeek-R1, it is not yet mature or reliable enough to provide definitive or absolute judgments about their capabilities. As such, the results presented here should be considered preliminary and subject to revision as our judge system continues to evolve and improve.
total_rating_mean | ||||||||
---|---|---|---|---|---|---|---|---|
question_category | coding | extraction | humanities | math | reasoning | roleplay | stem | writing |
model | ||||||||
deepseek-r1:1.5b | 17.50 | 19.62 | 14.86 | 19.30 | 20.50 | 18.70 | 20.00 | 19.71 |
deepseek-r1:14b | 20.60 | 19.69 | 19.27 | 19.20 | 19.40 | 19.50 | 19.80 | 18.77 |
deepseek-r1:32b | 18.50 | 19.87 | 20.10 | 18.00 | 19.10 | 20.46 | 20.86 | 21.77 |
deepseek-r1:7b | 22.33 | 20.33 | 19.00 | 18.00 | 18.40 | 21.58 | 20.00 | 18.93 |
deepseek-r1:8b | 19.00 | 21.47 | 34.40 | 18.60 | 18.92 | 18.40 | 18.62 | 19.38 |
llama3.1 | 18.62 | 19.08 | 20.80 | 19.33 | 17.69 | 20.20 | 20.60 | 19.07 |
llama3.2:1b | 19.27 | 19.87 | 22.67 | 18.85 | 19.87 | 18.67 | 21.00 | 18.40 |
llama3.2:3b | 18.75 | 20.60 | 21.53 | 19.69 | 18.93 | 19.13 | 19.80 | 20.93 |
phi4:14b | 18.58 | 20.87 | 18.46 | 18.73 | 19.87 | 19.93 | 21.33 | 20.60 |