http://dwnl3pr7ombq2abcycysnlqcl7rd6lqbavnir2swapbfyzrz4e7ae4id.onion/blog/what_model_to_use.md
Many standard tests like MMLU, BBH, and GSM8K are becoming saturated, with top models scoring above 90%. This saturation makes it increasingly difficult to meaningfully differentiate between models. The latest generation of models like GPT-4o, Claude 3.5 Sonnet, and DeepSeek v3 each claim superiority in different areas.