DeepSeek-R1 and Ollama3.2 - Judge rates

Go back to list

Our evaluation framework relies on an experimental judge system, which is still being developed and refined. While this system may provide some useful insights into the performance of LLMs like DeepSeek-R1, it is not yet mature or reliable enough to provide definitive or absolute judgments about their capabilities. As such, the results presented here should be considered preliminary and subject to revision as our judge system continues to evolve and improve. 

total_rating_mean
question_category coding extraction humanities math reasoning roleplay stem writing
model
deepseek-r1:1.5b 17.50 19.62 14.86 19.30 20.50 18.70 20.00 19.71
deepseek-r1:14b 20.60 19.69 19.27 19.20 19.40 19.50 19.80 18.77
deepseek-r1:32b 18.50 19.87 20.10 18.00 19.10 20.46 20.86 21.77
deepseek-r1:7b 22.33 20.33 19.00 18.00 18.40 21.58 20.00 18.93
deepseek-r1:8b 19.00 21.47 34.40 18.60 18.92 18.40 18.62 19.38
llama3.1 18.62 19.08 20.80 19.33 17.69 20.20 20.60 19.07
llama3.2:1b 19.27 19.87 22.67 18.85 19.87 18.67 21.00 18.40
llama3.2:3b 18.75 20.60 21.53 19.69 18.93 19.13 19.80 20.93
phi4:14b 18.58 20.87 18.46 18.73 19.87 19.93 21.33 20.60