Apple researchers have published a study detailing key limitations in LLMs, or large language models, from major AI labs like OpenAI. The study, worked on by scientists from the tech giant and published this month, reveals a new benchmark used to evaluate LLMs’ mathematical reasoning skills. That benchmark has highlighted limitations in some of the world’s top LLMs, including OpenAI’s 4o and o1 models. Specifically, the paper found that changing the wording of questions or adding unrelated phrases could drastically change the results.