The four families that matter
For practical AU production work in 2026, four families cover essentially every workload:
OpenAI GPT (closed-weight, commercial API)
GPT-5 and GPT-4o are the current commercial workhorses. Available via OpenAI directly, Azure OpenAI (with AU-region routing through Australia East), and various aggregators. Strong general capability, particularly on code generation, structured output, and tool use. Mature ecosystem (function calling, JSON mode, vision, embeddings, fine-tuning, batch APIs). The default for "this needs to work, ship now" workloads.
Anthropic Claude (closed-weight, commercial API)
Claude Opus 4.7 and Claude Sonnet 4.6 are the current frontier. Available via Anthropic directly, AWS Bedrock (ap-southeast-2 for AU-region inference), and Google Vertex. Particularly strong on long-context reasoning, code refactoring, agentic loops, and following complex multi-step instructions. The default for "this needs to be reliable across a hundred-step plan".
Google Gemini (closed-weight, commercial API)
Gemini 2.5 series via Vertex AI in australia-southeast1. Useful for businesses already on Google Workspace because the integration story is built-in. Strong on multi-modal (image, video, audio), strong on long context. Less mature ecosystem than OpenAI/Anthropic for tool use and agent patterns, but improving fast.
Open-weight models (Llama, Mistral, GPT-OSS, Gemma, Qwen)
Weights downloaded, deployed inside your own VPC or on-premise. Llama 3.x and 4.x lead capability. Mistral and Mixtral lead efficiency. GPT-OSS sits in the strong-default bucket. Gemma covers small-resource deployments. Qwen leads multilingual. None match the frontier closed models on the hardest reasoning, but all are sufficient for most operational workloads. Lower cost-per-token but real operational cost in GPU capacity and ops.
The five questions that decide
Pick your model by walking these five questions in order. They cut the space down fast.
1. Where does the data have to live?
If your data can route to AU-region commercial endpoints, you have all four families available. If it can't (PROTECTED workloads, sovereign requirements, some healthcare and legal contexts), you need open-weight inside your VPC. This single question filters the candidates down by about half for any deployment touching sensitive AU data.
2. How hard is the task?
Most operational workloads (summarisation, retrieval, structured extraction, drafting, classification, routine code generation) are well within the capability of any current frontier model and most strong open-weight models. The "best model" choice rarely matters for these.
Harder workloads (long-horizon agentic plans, complex multi-step reasoning, novel code generation against unfamiliar codebases, hard math, niche domain reasoning) do separate the models. For these, GPT-5 and Claude Opus 4.7 are typically meaningfully ahead of open-weight options. If your workload sits here and sovereignty isn't a hard constraint, the cost premium of a frontier closed model is justified.
Be honest with yourself about which bucket your actual workload is in. Most "we need the best model" conversations are actually "we need any model that works reliably".
3. What's the cost shape?
For workloads with low token volume per query and moderate query rate, commercial APIs are cheaper than running your own GPU instances. The crossover happens around 5-15 million tokens per day depending on model size, hardware choice, and instance utilisation.
Above the crossover, in-VPC open-weight deployment is meaningfully cheaper. We've seen mid-sized Bedstone OS deployments save 60-75% on model API costs by moving from commercial to open-weight, with the operational overhead amortising in under three months.
Below the crossover, commercial APIs are almost always the right call. Operational overhead of running GPU inference doesn't pay back at low volume.
4. What's the lock-in risk?
Commercial APIs lock you to a vendor's pricing, terms, and capability roadmap. Open-weight gives you portability but locks you to the operational overhead of running your own inference.
For most AU mid-market operators, vendor lock-in is real but manageable: model APIs are increasingly interchangeable at the interface level (OpenAI-compatible APIs are the de-facto standard), and a well-designed retrieval and prompt layer can swap models in hours.
The lock-in that actually hurts is at the prompt engineering and tool design level. Models behave differently enough that prompts tuned for one don't always transfer cleanly. Design with model abstraction in mind from day one.
5. How fast do you need it to change?
Commercial frontier models improve every quarter. Open-weight catches up about 6-12 months later. If your business depends on absolute frontier capability — fastest reasoning, longest context, best agentic loops — commercial is the only option.
If you can sit one generation behind and still meet your business need, open-weight is cost-competitive and removes vendor risk. Most operational workloads are comfortable one generation behind.
Common AU operator picks by scenario
Internal AI workspace (Bedstone OS-style)
Default: Claude Sonnet 4.6 via Bedrock ap-southeast-2 for general use, with Claude Opus 4.7 routed for the hardest queries. AU-region inference, enterprise terms, strong on the long-context retrieval that internal workspaces depend on.
Sovereign variant: Llama 3.x 70B in your VPC on 2-4 A100 instances. ~70% of the capability of Sonnet 4.6 at most operational tasks; the gap is real but acceptable for most internal-search workloads. Operate retainer covers model updates.
Customer-facing AI agent
Default: GPT-5 via Azure OpenAI Australia East. Best balance of capability, latency, and ecosystem maturity for agent loops. Function calling and JSON mode are battle-tested.
Cost-sensitive variant: GPT-4o-mini or Claude Sonnet 4.6 for most queries, with escalation to GPT-5 or Claude Opus 4.7 for harder ones. Routing logic adds engineering complexity but saves 60-80% on token costs.
Document analysis / legal / contract review
Default: Claude Opus 4.7 or Sonnet 4.6 via Bedrock. Long context, strong reasoning over structured-document content, and the most reliable at following complex instruction sets across long inputs.
Sovereign variant: Llama 4 70B+ in VPC with strong retrieval scaffolding. The retrieval design matters more than the model for document workloads.
Code generation / engineering productivity
Default: Claude Opus 4.7 or GPT-5. The frontier models lead on code quality and the gap to open-weight is meaningful here. Worth paying for if engineering productivity is the business case.
Healthcare / regulated PHI workloads
Default: Llama 3.x 70B or GPT-OSS in your VPC. Most healthcare engagements can't send PHI to commercial APIs even with enterprise terms. The capability gap is acceptable because most healthcare AI workloads (summarisation, structured extraction, retrieval) sit in the easier bucket.
Government / PROTECTED workloads
Default: Open-weight in IRAP-assessed sovereign cloud. Pattern C from the sovereign AI piece. Model selection inside the assessed boundary, typically Llama 3.x or GPT-OSS depending on resource budget.
What to NOT optimise for
- Benchmark scores. Public benchmarks rarely match the workload you actually have. Evaluate against your own data on your own task.
- The cheapest token price. Token cost is one input, not the only one. Operational cost of running open-weight, opportunity cost of weaker reasoning, and engineering cost of swapping providers all matter.
- The most-talked-about model. By the time a model has hype it's already 6-12 months from being the best choice for production. Stability and tooling maturity matter more than novelty.
- Single-vendor commitment. Even if you start on one provider, design the deployment so you can swap models in hours. The frontier moves; your architecture shouldn't bet on a specific provider winning.
- Local rendering of "open source AI Australia" stories. Open-weight is real and useful but it isn't a magic word that solves vendor risk by itself. The operations are real; the cost is real; the gap to frontier closed models is real.
The integration layer is where the work lives
Across every model-selection conversation we have, the most expensive single decision is not which model to pick. It is how the retrieval, prompt, evaluation, and routing layers are built. A mid-tier model with a well-designed retrieval layer outperforms a frontier model with a poor one, on almost every business workload we've measured.
Concretely: spend more design budget on retrieval scoping, prompt templates, evaluation harnesses, and the routing logic that picks the right model for the right query. The model decision is reversible in hours. The retrieval and evaluation design takes months to rebuild.
How we approach the decision in client work
For most Bedstone client engagements, the model decision runs:
- Sovereignty filter — does this workload require AU-only inference and what's the classification? Filters out roughly half the option space if sovereign.
- Capability bucket — operational, hard reasoning, or frontier? Most workloads are operational and most models can handle them.
- Cost shape — high-volume or low-volume? Sets the commercial-vs-open-weight decision.
- Default pick — usually Claude Sonnet 4.6 or GPT-5 for commercial deployments, Llama 3.x 70B or GPT-OSS for sovereign ones.
- Design for swap — abstract the model behind a thin interface so the decision can be revisited in three months without rebuilding the deployment.