Which LLMs Are Currently the Best? Dr. Maik Bunzel of mabucon Puts the AI Market in Perspective


The question sounds simple, but it isn't: Which language model is currently the best? ChatGPT? Claude Opus? Gemini? Grok? DeepSeek? Or perhaps an open-source model like Llama?
Dr. Maik Bunzel, founder of mabucon, has little time for blanket rankings. For him, what matters is not the name of the model, but the specific use case. "An LLM is not a magic bullet. It's a tool. And like any tool, you need to know what you want to use it for," explains Bunzel.
This is precisely where mabucon's work begins. The company develops AI agents that do not merely write texts or answer questions, but understand, plan, and execute business processes. The focus is on Agentic Coding, LLM orchestration, RAG pipelines, Tool-Calling, MCP servers, Guardrails, Evals, and Human-in-the-Loop. In short: AI systems that do not sit alongside a company like a chatbot, but are deeply integrated into its operations.
From model comparisons to real process questions
Many companies ask the wrong first question. They ask: "Which model should we use?" Bunzel would start differently: "Which process is costing you unnecessary time every single day?"
Because it is only from the process that it becomes clear which model actually makes sense. A company that creates quotes from emails and PDF attachments every day requires different capabilities than a company that wants to search internal knowledge bases, automate customer support, or generate up-to-date daily reports.
At mabucon, every project therefore begins with a potential analysis. Workflows are examined in detail, bottlenecks are made visible, and everything is evaluated by effort and impact. Only after that does the focus shift to architecture: which model handles which part? Where are fast responses needed? Where is deep reasoning required? Where does a human need to approve? Where are data protection, traceability, and logging critical?
"The best model is rarely a single model. In practice, the right orchestration almost always wins."
ChatGPT: the powerful all-rounder among AI models
For many users, OpenAI with ChatGPT has become synonymous with modern artificial intelligence. The current GPT models rank among the strongest all-rounders on the market. They excel particularly when tasks are wide-ranging: texts, analyses, strategy, coding, summaries, creative ideas, research preparation, and structured communication.
Bunzel sees ChatGPT as especially strong wherever companies need a versatile AI that can be deployed productively right away. For law firms, consultancies, agencies, and knowledge workers, this is a significant advantage. A model that can structure legal briefs, develop marketing copy, explain spreadsheets, review code, and outline processes delivers immediate value.
From Bunzel's perspective, the weakness lies not in the model's capabilities, but in the temptation to treat ChatGPT as a standalone solution.
"If you simply open a chat and leave employees to their own devices, you don't get process automation. You get better individual work. That's useful — but it's not scaling yet."
- Strengths: highly versatile, excels at writing, analysis, strategy, coding and content production.
- Weaknesses: without clean integration, it often remains isolated, single-task work.
- Typical use cases: knowledge work, law firm workflows, SEO content, internal assistants, marketing and process planning.
ChatGPT is an excellent core model for knowledge work, content production, strategic analysis and internal assistants. In real agent systems, however, it should be combined with company knowledge, clearly defined tool access, approval workflows and evaluation loops.
Claude Opus: strong at language, code and extended reasoning
Anthropic's Claude, particularly the Opus models, is frequently perceived as especially strong at complex writing, coding and longer reasoning processes. Claude often formulates responses elegantly, in a structured and natural way. For large documents, legal analyses, technical specifications and extended chains of argumentation, this is a clear advantage.
Bunzel views Claude as the model for demanding tasks where precision, style and endurance are required.
"Claude is strong when you want to cleanly process long, complex subject matter. Especially for documents, concepts and software projects, this can be extremely valuable."
- Strengths: high-quality language, long-document analysis, coding, structured argumentation.
- Weaknesses: not automatically the best choice for every workflow; availability and integration must be evaluated.
- Typical use cases: code reviews, technical concepts, legal analyses, long-form content and document processing.
For mabucon, Claude is therefore a candidate for demanding document work, code reviews, structured analyses and high-quality text production. In multi-agent systems, Claude can serve as the "thinker," while other models handle fast, routine tasks.
Gemini: strong in the Google ecosystem and at multimodality
Google Gemini plays to its strengths wherever Google services, search, documents, spreadsheets, emails, YouTube and multimodal data are involved. Gemini can be particularly interesting for companies that already rely heavily on Google Workspace, or wherever text, images, video and search converge.
Bunzel describes Gemini as a model with great potential for work environments where information is distributed across many Google-adjacent systems.
"When a company organizes its daily operations in Gmail, Drive, Docs, Sheets and Meet, Gemini becomes strategically relevant. Not just because of the model itself, but because of the ecosystem."
- Strengths: Google integration, search, multimodality, Workspace affinity, processing of various media formats.
- Weaknesses: quality can vary depending on the interface, model variant and integration.
- Typical use cases: Google Workspace, YouTube workflows, multimodal research, internal knowledge search and operational automation.
For very precise legal or highly regulated processes, Bunzel would not deploy Gemini blindly on its own, but always combine it with validation, source verification, and human approvals. In agent architectures, Gemini can be particularly powerful when it comes to searching large information spaces, processing multimodal content, and automating Google-adjacent workflows.
DeepSeek: compelling on cost, technology, and self-hosted deployments
DeepSeek has established itself as a serious contender, primarily due to strong reasoning and coding capabilities at an often attractive cost profile. For technical teams, DeepSeek is interesting when a high volume of model calls is required or when cost per request plays a decisive role.
Bunzel sees an important point for real-world practice here:
"In genuine automation, benchmarks aren't everything. When an agent processes thousands of operations per month, cost, speed, and stability suddenly become strategic factors."
- Strengths: strong cost-to-value ratio, solid technical capabilities, well-suited for high volumes of model calls.
- Weaknesses: data privacy, governance, and trust must be examined with particular care.
- Typical use cases: technical prototypes, cost-sensitive automation, coding tasks, and internal testing.
The weakness lies in governance, data privacy, and trust. Companies need to examine carefully where data is processed, which compliance requirements apply, and whether the model is suitable for sensitive information. Especially in law firms, medicine, finance, or internal corporate data scenarios, a low-cost model alone is not sufficient.
Mistral: European alternative with enterprise potential
Mistral is particularly interesting for organizations that place greater emphasis on European providers, data privacy, and controllable deployments. The models are capable, the ecosystem is growing, and for many enterprise applications Mistral can be a strategically smart alternative.
Bunzel highlights the advantage of European AI strategies:
"Not every company wants to — or can — bind its core processes entirely to US platforms. Especially when it comes to sensitive data, regulatory requirements, and long-term independence, alternatives deserve to be taken seriously."
- Strengths: EU alignment, enterprise focus, controllable deployments, data privacy perspective.
- Weaknesses: not at the level of the absolute top models in every benchmark.
- Typical use cases: internal assistants, privacy-conscious automation, enterprise AI, and specialized workflows.
Mistral is not the strongest model in every benchmark. But in practice, it is not always about using the most powerful model available. Often, a very good model is sufficient when architecture, data foundation, process understanding, and control are right.
Llama and open-weight models: control over convenience
Meta Llama and other open-weight models are particularly relevant for companies that want maximum control over their AI infrastructure. They can be self-hosted, customized, and embedded into internal systems. This is technically more demanding, but offers strategic independence.
Bunzel does not see open-weight models as a replacement for all cloud models, but rather as an important building block.
"If a company has its own data spaces, internal knowledge systems, or particularly sensitive processes, a self-controlled model can make sense. But one has to be honest: operating it requires expertise."
- Strengths: Control, adaptability, data sovereignty, and proprietary infrastructure.
- Weaknesses: Hosting, security, monitoring, updates, and evaluation all require significant effort.
- Typical use cases: Internal knowledge systems, custom RAG systems, data privacy projects, and specialized enterprise AI.
The strength lies in control, adaptability, and data sovereignty. The weakness lies in the overhead: hosting, monitoring, security, updates, evaluation, and fine-tuning all need to be handled professionally.
Grok: strong on real-time, trends, and social media
Grok by xAI is particularly interesting when current debates, social media dynamics, and rapid trend analysis come into play. For companies that work heavily with X, public discourse, memes, or day-to-day sentiment, Grok can be a valuable asset.
Bunzel would not, however, deploy Grok as the first choice for highly precise specialist work.
"For a feel for trends and public debates, Grok can be exciting. For legal, medical, or business-critical processes, you need more control."
- Strengths: Real-time awareness, social media dynamics, trend analysis, and public discourse.
- Weaknesses: Less suited as a foundation for precise specialist work and regulated business processes.
- Typical use cases: Social media monitoring, trend radar, public discourse, content ideas, and rapid market sentiment.
This makes Grok more of a radar than a foundation: strong when it comes to speed, culture, and public discussion; weaker when it comes to reliable expert decisions.
Why mabucon does not believe in a single model
The key point in Bunzel's analysis is: The future belongs not to the one best model, but to the intelligent combination of multiple models. An agent can use a fast model for classification, a powerful reasoning model for difficult decisions, a cost-efficient model for routine tasks, and a particularly secure model for sensitive data.
Added to this are RAG pipelines — systems that retrieve corporate knowledge in a targeted manner. Tool-Calling connects the agent to CRM, ERP, accounting, inboxes, and internal systems. Guardrails set boundaries. Evals verify quality. Human-in-the-Loop ensures that humans make the decisions at critical junctures.
The difference between a chatbot and a productive AI agent is simple: the chatbot responds. The agent delivers.
Dr. Maik Bunzel: from building a law firm to process intelligence
Bunzel's focus on processes doesn't come from theory. Over the years, he built a nationally operating specialist law firm with offices in Cottbus, Berlin, and Kiel, and managed several thousand mandates. Anyone working in a regulated, document-heavy, and communication-intensive environment quickly learns where time is lost: in capturing, sorting, reviewing, routing, tracking, and documenting.
This experience gave rise to mabucon. The company translates the thinking behind law firm development, legal practice, structure, and scaling into autonomous AI systems for businesses. Precision is not a marketing buzzword here. For Bunzel, it is a craft. An AI agent cannot work approximately correctly. It must be transparent, compliant, and verifiable.
The most powerful LLM is only as good as the system behind it
Anyone asking today which LLM is the best won't find a simple answer. ChatGPT is the strong all-rounder. Claude excels at language, code, and long-form analysis. Gemini shines within the Google ecosystem and in multimodality. DeepSeek is compelling in terms of cost and technology. Mistral offers European enterprise perspectives. Llama stands for control and self-managed deployments. Grok is interesting for trends and real-time debates.
But for Dr. Maik Bunzel, that is only the surface. What matters is what businesses build from it. A single model might save a few minutes. A cleanly orchestrated agent can transform entire workflows.
The real question, therefore, is not: which LLM is the best? The better question is: which process in your business should no longer be running manually tomorrow?