How LLMs help us with programming

We were recently asked by a global chemical company if we could give their engineers an introduction to programming with Python and Large Language Models (LLM). Their expectations of what they wanted to achieve in a short space of time with an LLM were almost limitless.

However, programming with LLMs is often difficult and not very intuitive. It usually takes considerable effort to find out their limitations. In addition, there are hardly any instructions on how best to use them. We have been working with LLMs in programming for two and a half years now. In the following we try to pass on some of our experience with LLMs to you.

What can you expect from LLMs?

The hype surrounding Artificial General Intelligence (AGI) raises expectations that LLMs cannot fulfil: If you assume that LLMs will realise your software project perfectly without you having to do anything yourself, you will soon be disappointed.

Instead, you can expand your skills with the help of an LLM. Think of an LLM as a pair programming assistantthat can look things up at lightning speed, provide relevant examples and complete complex tasks without complaint. However, it will definitely make minoror major mistakes, for example hallucinating a library or method. We have made anote of such examples so that we can then look at newer and larger LLMs to see whether they provide more useful results for a task.

Vibe coding

Andrej Karpathy coined the term vibe coding in February 2025:

There’s a new kind of coding I call “vibe coding”, where you fully give in to the vibes, embrace exponentials, and forget that the code even exists … I ask for the dumbest things like “decrease the padding on the sidebar by half” because I’m too lazy to find it. I “Accept All” always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it … Sometimes the LLMs can’t fix a bug so I just work around it or ask for random changes until it goes away. It’s not too bad for throwaway weekend projects, but still quite amusing. [1]

This is a good way to explore the LLMs – and it’s really fun. You’ll also develop a feel for what works and what doesn’t with LLMs.

For our professional work, however, we use a different approach to get robust, maintainable code.

1. Prototyping

Most of our projects start with open questions for a prototype:

Is what we have in mind possible?
What options are there for realising it?
Which of these options are the best?

So we use LLMs as part of this initial research phase.

The next step is then a prototype that proves that the main requirements of the project can be met. We often find that with an LLM, a Minimum Viable Product, a working prototype, can be created within a few minutes.

Takes the training cut-off date into account

A key feature of any model is its training cut-off date. This is the date on which the data which it was trained is no longer collected. For the OpenAI models, this is October 2023; Anthropic and Gemini usually have more recent data.

So the OpenAI models will hardly know anything about uv, as the first commits are from October 2023. The same applies to the extensive changes in version 2 of pandas or NumPy.

LLMs can still help you work with libraries that exist outside of their trainingdata, but you will need to do significantly more work and provide up-to-date examples of how these libraries should be used as part of your prompt.

2. Specification

Once the initial prototype has been created, we change the mode drastically. Forproduction code, we are much more nitpicky to the LLM: based on our detailed instructions, the LLM writes the code much faster than we could do it ourselves. LLMs can handle such specific instructions very well, and we can focus on the software design.

This brings us to the most important point you need to understand when working with LLMs:

Context is crucial

To achieve good results with an LLM, the biggest effort is to manage the context, in other words the text that is part of your current conversation.

This context is not just limited to the prompt you have fed the LLM with, but tothe entire thread of the conversation. So when you start a new conversation, you should reset this context.

Tools like Cursor and VS Code Copilot automatically pull in context from the current editor session and file layout, and you can sometimes use mechanisms like Cursor’s @ commands to pull in additional files or documentation. Claude Projects, on the other hand, allows you to pre-populate the context with a large amount of text – including a new option to import code directly from a GitHub repository (see Using the GitHub Integration).

One of the reasons we mostly work directly with ChatGPT and Claude is to better understand what goes into the context. LLM tools that hide this context from us are not very helpful. So we can take advantage of the fact that in complex programming tasks, previous answers are also part of the context:

first we can have a simplified variant written
then we can check whether it works
and finally work iteratively towards the more complex implementation.

We often start a new chat by inserting existing code to provide context, and then we work through the changes with the LLM. Or we bring in several complete examples and then ask the LLM to use them as inspiration for a new project.

3. Code review and refactoring

The one thing you must not outsource to the LLM is the testing and code review of what the LLM has delivered. Our job as software engineers is to deliver working systems. If we haven’t seen how it works, we can’t rely on it working at all. This quality assurance is reserved for us humans.

If we don’t like what an LLM has written, we simply let them rework it. “Use vectorisation with NumPy for the code”. The code an LLM produces the first time is rarely the final implementation, but it can rewrite it dozens of times for you without getting frustrated or bored. If a first result does not lead to the desired result, this is not a failure, but the beginning of a process to push the code in the desired direction.

Tools for code execution

A growing number of LLMs for programming now also offer the option of executing the code for you. However, you should be careful with some of them, as the wrong command can cause real damage. Therefore, we are currently sticking with those that execute the code in a sandbox:

ChatGPT Code Interpreter: allows you to write Python code with ChatGPT and then execute it directly in a Kubernetes sandbox VM managed by OpenAI.
Claude Artifacts: can create a complete HTML, JavaScript and CSS web application that is displayed in the Claude interface. This web application is displayed in an iframe sandbox, which severely limits the possibilities, but prevents problems such as accidental disclosure of your private Claude data.
ChatGPT Canvas: is a newer ChatGPT feature with similar capabilities to Claude Artifacts, but we have not yet been able to test it sufficiently.

4. Documentation and tests

Finally, the LLM can also write the docstrings, the Sphinx documentation and the tests for pytest. Good LLMs for programming are excellent for catching possible exceptions, adding accurate documentation and commenting the code with the relevant types.

[1]	https://x.com/karpathy/status/1886192184808149383