LLM evaluation for SAP coding

I came accross a good question from Stephan Heinberg 🔗. He asks: what’s better for coding? Joule or one of the alternatives AI’s that you can use for your coding like the ones from AWS 🔗, Google 🔗, Microsoft 🔗, or any other alternative.

Which LLM is good at solving SAP problems?

That is a very good question.

The answer, however, is almost impossible to give. First, the quality of the LLMs used change over time. These LLMs get constantly optimized. If you use Google today and it performs better than anything else you compared, the next month another LLM might be better. When Google started to enter the space, the coding generated by AI was good for the “normal” programming languages like TypeScipt or Java and for the ususal frameworks. The generated ABAP coding however was lacking behind compared to other offers. Today, Google is able to product good enough ABAP coding. Microsoft is still returning code like sy-udate-month. of course, this will change again over time.

Benchmarks

To know if a given LLM is producing a good result, benchmarking is needed. This is a common practice for LLMs and there are several web sites posting the results: 1 🔗, 2 🔗, 2 🔗, 3 🔗, 4 🔗, 5 🔗. Benchmarks are run by letting the LLM solve problems and validate those. The dataset for BigCodeBench 🔗 shows that there is a prompt on what to do and a test to validate the result. These kind of benchmarks might also try to solve GitHub issues or run those tests again several programming languages.

This approach works well enough so the “normal” developer can find out if a given LLM is good in solving problems using a given programming language. The problem is: the validation is done using popular languages like python, JavaScript, C, Java, etc. You won’t find ABAP there. There is a benchmark for ABAP for LLMs 🔗. The researchers 🔗 created a dataset for ABAP 🔗. It is good for testing the ABAP coding capability: can the LLM generate valid ABAP code / classes. It is not, however, about: can it rrefactor a report? Can it create an ABAP Unit Test from a requirement document. And it is “limited” to ABAP. You also won’t find complex problems that demand a working UI5, Fiori Elements, CAP or RAP app, or including other SAP solutions like API Hub, SAC, anything BTP, Fiori app extensibility. This is a gap where the SAP user groups should invest in.

Customer defined SAP benchmark for LLMs

Create a customer focused LLM benchmark for SAP

A common set of tasks should be curated and used to evaluate LLMs for their usage for SAP coding tasks. For each one, the provided solution must be evaluted against a (or more?) ssample solution. The challenges to be solved could include topics like:

code analysis
unit test generation
code optimization
refactor code
explanation of features
migration of legacy code, reports, interfaces to modern SAP
generate modern apps with RAP, CAP, Fiori Element
write technical documentation and architectural diagrams
analyze functional requirements
implement solutions, on premise, hybrid and cloud
extend apps
…

The list can easily be extended by more tasks to be solved. Clustering those tasks into topic areas will allow to identify better the sweet spot of an LLM. One might be better at writing unit tests or explaining code, another better at writing Fiori Elements app with RAP, or with providing the correct architecture for a BTP based solution.

As in the SAP world, solutions can get really fast very complex, this test should serve as the golden standard for any LLM that wants to enter the SAP universe. It can help to identify which LLM is currently performing best in which area or maybe for all topics.

A task too much for one person. Just getting together a large enough sample base is a challenge, not to mention the part of evaluting the results, even with automated tests. To know that the AI generated solution works, someone needs to first come up with the solution and implement it. Good news: there are so many SAP user groups around the world, representing a large amount of SAP customers. Should be easy for them to come up with something like a common test scenario to benchmarks LLMs on how well they serve for solving SAP specific problems. Nice side effect: SAP’s Joule must also demonstrate its capabilities. And with this, customers will get an objective way to evalute and chose the LLM(s) that best work for them.