Training your LLM with SAP

Published by Tobias Hofmann on

13 min read

The latest SAP DEVELOPER LICENSE AGREEMENT 3.2 makes it unnecessarily hard to use software released under it together with AI. The software released under the license cannot be used to train or refine an AI model. As the license states: “You are expressly prohibited from using the Software, Tools or APIs as well as any Customer Applications or any part thereof for the purpose of training (developing) artificial intelligence models or systems (“AI Training”).

While now any software developed using anything under this license, most famously any CAP 9.x based app, is out of AI usage that has something to do with training, refinement or model development. What about other SAP resources? If you want to use SAP resources to train or refine your AI model? What about SAP Help or the SAP homepage as input for an AI model? What about the common usage scenario as asking your AI a question and having an AI agent search for the answer on the internet, like SAP Help?

Crawlers

Officially, OpenAI follows robots.txt rules. Therefore, it makes sense to investigate the SAP Help robots.txt file to find out if SAP is blocking OpenAI or other AI tools. ChatGPT-User is the bot that visits a web page when asked by the user to find out more information. GPTBot is the bot that adds information to the AI model. The agent used by Google is Google-Extended.

The robots.txt file from SAP Help looks like this:

# robots.txt for https://help.sap.com
#
User-agent: *
Disallow: /

# Search Engines
User-agent: Googlebot
Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

User-agent: Bingbot
Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

# AI and Chat Agents
User-agent: Google-Extended
Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

User-agent: GPTBot
Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

User-agent: ChatGPT-User
Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

Sitemap:
https://help.sap.com/http.svc/sitemapxml/sitemaps/sitemap_index.xml

The rules follow the same schema:

Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

This is good as everything is allowed, except pages ending on frameset.htm or every *.htm file in folder /doc. Frameset sites are SAP Help sites with the old design. These are older pages.

Ein Bild, das Text, Screenshot, Software, Schrift enthält.

KI-generierte Inhalte können fehlerhaft sein.

The files in the doc folder? This looks strange. Why htm and not also html files? These are not the normal SAP Help documentation pages. They are stored under docs and not doc and follow a different schema: https://help.sap.com/docs/<product>/<id>.html

SAP Documentation

The typical URL of an SAP product is /docs/<PRODUCT NAME>/… For SAP S/4HANA Cloud Private Edition it is docs/SAP_S4HANA_CLOUD_PE

Other examples:

  • SAP ERP: docs/SAP_ERP
  • SAP HANA Cloud: /docs/hana-cloud
  • SAP Business ByDesign: /docs/SAP_BUSINESS_BYDESIGN
  • SAP S/4HANA Cloud Public Edition: docs/SAP_S4HANA_CLOUD

As the folder is named docs and not doc, AI bots are allowed to crawl the site. You can use the product documentation as input for your AI project. But when all sites start with docs, why disallow doc?

Doc folder

Using Google: site:https://help.sap.com/doc/, it seems that in the doc folder PDFs are stored.

Ein Bild, das Text, Screenshot, Dokument, Schrift enthält.

KI-generierte Inhalte können fehlerhaft sein.

Refining the search to look for HTML files shows that there are SAP Help documentation files under the folder doc.

These are older pages as you can see from the design. It makes sense to disallow them for AI, as the frameset.htm sites with the same older design are also blocked.

Ein Bild, das Text, Screenshot, Software, Webseite enthält.

KI-generierte Inhalte können fehlerhaft sein.

But is this all that is stored under doc? What else can you find there? Again, internet search helps to find content.

JavaDoc

JavaDoc is hosted in the path /doc/. The page itself ends on html and is therefore not blocked.

SAP Mobile Services

The SDK documentation for the SAP Mobile Services SDK for

These sites are under doc, but end on html and not htm. These are not blocked. If you want an AI to learn how to write code, having access to the SDK and API information is useful. Thinking about coding and AI and SAP, ABAP is the top priority. Can the AI access the ABAP documentation and use it to learn?

ABAP Documentation

I continue to fail to find the ABAP keyword documentation at the SAP Help homepage as an easy to click on link. I must search for it or go through the documentation. Here the links to the documentation for cloud, on premise and private cloud.

ABAP BTP

For ABAP in BTP, the SAP Help page referring to the ABAP Keyword Documentation is here. The ABAP Keyword Documentation for ABAP Cloud: https://help.sap.com/doc/abapdocu_cp_index_htm/CLOUD/en-US/ABENABAP.html

S/4HANA on premise

ABAP Programming in the on premise S/4HANA system SAP Help points to: https://help.sap.com/doc/abapdocu_latest_index_htm/latest/en-US/index.htm

S/4HANA private cloud

SAP S/4HANA Private Cloud: https://help.sap.com/doc/abapdocu_latest_index_htm/latest/en-US/index.htm

The ABAP keyword documentation is stored under folder doc: /doc/abapdocu… The pages end with htm. With the exception of the latest ABAP Cloud documentation: pages end on html.

You shall not pass

Folder doc is marked as disallowed for AI stuff in robots.txt.

Allow: /
Disallow: /*/frameset.htm
Disallow: /doc/*.htm$

AI bots are not allowed to access content under folder doc and files ending on htm. Looking at the file names of the ABAP Docu, the scheme used to forbid access now makes sense. The ABAP docu uses .htm and not .html. But not for the latest version, the one that is using UI5. This version is ending on html. This rule is explicitly made to forbid AI from accessing the non-Cloud ABAP documentation.

This makes it harder to use AI with the ABAP information. If you want to get information about ABAP features, SAP Help might be out of reach for your release. In case you are chatting with your AI and want it to look up the information about CDS table entity for your S/4HANA ON/PE release, the robots.txt rule indicates the chat bot that the site should not be accessed. Alternative can be to check the documentation available in your SAP system, in case SAP – or your contract – allows AI to access it. You can rely on books or other sources, but these might not be as up to date as the official documentation.

You talking to me?

Lets try this out using ChatGPT. Let’s ask it some questions and see if the ABAP documentation is accessed or not.

Ein Bild, das Text, Screenshot, Schrift, Reihe enthält.

KI-generierte Inhalte können fehlerhaft sein.

Answer

Ein Bild, das Text, Screenshot, Schrift, Dokument enthält.

KI-generierte Inhalte können fehlerhaft sein.

The latest ABAP documentation is given as 7.58 from October 2023 and the sources used are Wikipedia or GitHub. Let’s try to find more about a recent feature like CDS table entity.

The bot searches the internet and is accessing some SAP pages.

Ein Bild, das Text, Screenshot, Schrift enthält.

KI-generierte Inhalte können fehlerhaft sein.

Answer

Ein Bild, das Text, Screenshot, Schrift, Dokument enthält.

KI-generierte Inhalte können fehlerhaft sein.

Sources contain SAP pages. Lets take a look at those.

Ein Bild, das Text, Screenshot, Dokument, Schrift enthält.

KI-generierte Inhalte können fehlerhaft sein.

The first two entries point to the ABAP Keyword Documentation:

Ein Bild, das Text, Screenshot, Schrift enthält.

KI-generierte Inhalte können fehlerhaft sein.

The links are:

The links go to the latest version of the ABAP Cloud keyword documentation.

Ein Bild, das Text, Screenshot, Software, Zahl enthält.

KI-generierte Inhalte können fehlerhaft sein.

The chat bot searches the internet, goes to SAP Help and the ABAP Keyword Documentation and accesses the latest version.

ChatGPT offers to give the direct deep links to the official examples.

The referenced links go to the ABAP Keyword documentation.

It goes further and asks if it should get the code block examples directly from the SAP page.

Lets ask OpenAI to get the source code from the site.

Answer

ChatGPT does respect the robots.txt rule. The content is not crawled. Access to the latest version is allowed, to earlier versions not. When ChatGPT tried to get the information directly from the ABAP keyword documentation with access restrictions given by the site, it followed the robots.txt rule.

You can try this out by asking ChatGPT to “go the the abap keyword documentation for the on premise release 2023 and look up an example for CDS view entity”. The sources used contain a wide range of different sources, like developers, learning, product documentation. The sources also contained in my case a reference to ABAP Docu:

https://help.sap.com/doc/abapdocu_cp_index_htm/CLOUD/en-US/index.htm?file=abencds_define_view_entity.htm&utm_source=chatgpt.com

This is in the blocked list: doc and html. If this was not crawled by OpenAI, maybe before SAP blocked access or if referenced by another source? I have no idea. But this shows that for the results, you are mostly not getting something back that depends on the ABAP docu.

What now

The rules defined by the site owner are not legally binding (says AI, consult your lawyer). The intention of the rules is to indicate what to crawl and what not. In case a 3rd party site contains a link to a URL marked as disallowed, crawlers are going to index it anyway. The rules are a way of nicely asking bots to follow the rules. It is impolite to ignore them, but there is not much that can be done. However, the rules can be enforced by power. I’d not ignore the rules. Specially when you depend on doing business with SAP and you sell a product trained with forbidden resources. This might get you into some discussions. Just know what you are doing.

If you want to train your LLM or use it with SAP resources: follow the robots.txt rules. SAP Help is (still) very open, as is sap.com. As long as SAP is not restricting the usage of their resources, the information available there combined with other resources should be good enough. CAP documentation is open, SAPUI5 documentation too. The Fiori design is now hosted under sap.com, so it can be used by AI. The Fiori App Library does not have a robots.txt file. There are many LLMs out there that know how to code in Java, JavaScript, TypeScript and how to do UX design. These do not know very well what ABAP is. I guess SAP wants developers you to use/buy Joule for ABAP development. What you should not use is the blocked versions of the ABAP documentation for your AI. ABAP is SAP’s proprietary language and it makes sense for SAP to control it. Annoying for everyone else. I guess the available ABAP resouces outside ABAP Keyword Documentation are more than enough and good enough to train a LLM to be helpful for any ABAP developer. It’s just a matter of time. It will be interesting to see if SAP will try to limit access to ABAP for AI stuff that is not from SAP. I guess a stricter robots.txt rule is not what SAP will have in mind.

Let the world know

Tobias Hofmann

Doing stuff with SAP since 1998. Open, web, UX, cloud. I am not a Basis guy, but very knowledgeable about Basis stuff, as it's the foundation of everything I do (DevOps). Performance is king, and unit tests is something I actually do. Developing HTML5 apps when HTML5 wasn't around. HCP/SCP user since 2012, NetWeaver since 2002, ABAP since 1998.

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.