LLM inference prices have fallen rapidly but unequally across tasks

The data on LLM API pricing comes from both Artificial Analysis and our own database of API prices. Our data contributed prices for GPT-3, GPT-3.5 and Llama 2 models, because prices for those models were not publicly available from Artificial Analysis. These prices made up 9 of the 36 unique observations shown in this data insight. If the price of an LLM changed after its initial release, then we recorded that as a unique observation with the same model name, but a new price and “release” date. Where prices differed by their maximum allowed context length, we only included the prices for the shortest context.

We aggregated prices over different LLM API providers and token types according to Artificial Analysis’ methodology, which is as follows. All prices were a 3:1 weighted average of input and output token prices. If the LLM had a first-party API, (e.g. OpenAI for o1), then we used the prices from that API. If a first-party API was not available (e.g. Meta’s Llama models), then we used the median of prices across providers.

We excluded reasoning models from our analysis of per-token prices. Reasoning models tend to generate a much larger number of tokens than other models, making these models cost more in total to evaluate on a benchmark. This makes it misleading to compare reasoning models to other models on price per token, at a given performance level.

The benchmark data covers 6 benchmarks: GPQA Diamond (PhD-level science questions), MMLU (general knowledge), MATH-500 (math problems), MATH level 5 (advanced math problems), HumanEval (coding), and Chatbot Arena Elo (head-to-head comparisons of chatbots, with human judges). Like the price data, the benchmark data is sourced from both Artificial Analysis and our benchmark database. In the combined dataset, most of the benchmark scores are as reported by Artificial Analysis. The methodology to obtain those scores is described by Artificial Analysis (except we use MMLU rather than MMLU-Pro).

We used data from our own evaluations for MATH level 5 (a benchmark not reported by Artificial Analysis), and for cases where Artificial Analysis did not have scores for GPQA Diamond. For other benchmarks where Artificial Analysis had missing scores, we used scores from Papers with Code or scores reported by the developer of the model. Our methodology for MATH level 5 and GPQA Diamond evaluations is described here. For the GPT-4-0314 model, direct evaluation on GPQA Diamond was unavailable. However, we know that GPT-4-0613 performed similarly to GPT-4-0314 across many other benchmarks (see here, p.5). Based on that, we assumed that GPT-4-0314 had the same GPQA Diamond score as GPT-4-0613.