Training your LLM with SAP
The latest SAP DEVELOPER LICENSE AGREEMENT 3.2 makes it unnecessarily hard to use software released under it together with AI. The software released under the license cannot be used to train or refine an AI model. As the license states: “You are expressly prohibited from using the Software, Tools or APIs as well as any Customer Applications or any part thereof for the purpose of training (developing) artificial intelligence models or systems (“AI Training”).”
While now any software developed using anything under this license, most famously any CAP 9.x based app, is out of AI usage that has something to do with training, refinement or model development. What about other SAP resources? If you want to use SAP resources to train or refine your AI model? What about SAP Help or the SAP homepage as input for an AI model? What about the common usage scenario as asking your AI a question and having an AI agent search for the answer on the internet, like SAP Help?
Crawlers
Officially, OpenAI follows robots.txt rules. Therefore, it makes sense to investigate the SAP Help robots.txt file to find out if SAP is blocking OpenAI or other AI tools. ChatGPT-User is the bot that visits a web page when asked by the user to find out more information. GPTBot is the bot that adds information to the AI model. The agent used by Google is Google-Extended.
The robots.txt file from SAP Help looks like this:
# robots.txt for https://help.sap.com # User-agent: * Disallow: / # Search Engines User-agent: Googlebot Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$ User-agent: Bingbot Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$ # AI and Chat Agents User-agent: Google-Extended Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$ User-agent: GPTBot Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$ User-agent: ChatGPT-User Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$ Sitemap: https://help.sap.com/http.svc/sitemapxml/sitemaps/sitemap_index.xml
The rules follow the same schema:
Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$
This is good as everything is allowed, except pages ending on frameset.htm or every *.htm file in folder /doc. Frameset sites are SAP Help sites with the old design. These are older pages.
The files in the doc folder? This looks strange. Why htm and not also html files? These are not the normal SAP Help documentation pages. They are stored under docs and not doc and follow a different schema: https://help.sap.com/docs/<product>/<id>.html
SAP Documentation
The typical URL of an SAP product is /docs/<PRODUCT NAME>/… For SAP S/4HANA Cloud Private Edition it is docs/SAP_S4HANA_CLOUD_PE
Other examples:
- SAP ERP: docs/SAP_ERP
- SAP HANA Cloud: /docs/hana-cloud
- SAP Business ByDesign: /docs/SAP_BUSINESS_BYDESIGN
- SAP S/4HANA Cloud Public Edition: docs/SAP_S4HANA_CLOUD
As the folder is named docs and not doc, AI bots are allowed to crawl the site. You can use the product documentation as input for your AI project. But when all sites start with docs, why disallow doc?
Doc folder
Using Google: site:https://help.sap.com/doc/, it seems that in the doc folder PDFs are stored.
Refining the search to look for HTML files shows that there are SAP Help documentation files under the folder doc.
- https://help.sap.com/doc/saphelp_em900/9.0/en-US/4c/4e606921cf4d4ce10000000a15822b/content.htm
- https://help.sap.com/doc/saphelp_gbt10/1.0/en-US/4c/5bd7ac97817511e10000000a42189b/content.htm
- https://help.sap.com/doc/saphelp_nw73ehp1/7.31.19/en-US/4d/78548291c1262ae10000000a42189b/content.htm
- https://help.sap.com/doc/saphelp_nw75/7.5.5/DE-DE/c6/af09673ed74143b48e57f85a03f2f5/content.htm
These are older pages as you can see from the design. It makes sense to disallow them for AI, as the frameset.htm sites with the same older design are also blocked.
But is this all that is stored under doc? What else can you find there? Again, internet search helps to find content.
JavaDoc
JavaDoc is hosted in the path /doc/. The page itself ends on html and is therefore not blocked.
- https://help.sap.com/doc/javadocs_nw74_sps04/7.4.4/en-US/index.html
- https://help.sap.com/doc/javadocs_nw75_sps00/7.5.0/en-US/index.html
- Hybris JavaDoc: https://help.sap.com/doc/9fef7037b3304324b8891e84f19f2bf3/2211/en-US/index.html
SAP Mobile Services
The SDK documentation for the SAP Mobile Services SDK for
- Android: https://help.sap.com/doc/f53c64b93e5140918d676b927a3cd65b/Cloud/en-US/docs-en/guides/getting-started/android/overview.html
- iOS: https://help.sap.com/doc/f53c64b93e5140918d676b927a3cd65b/Cloud/en-US/docs-en/guides/getting-started/ios/introduction.html
- Mobile service documentation: https://help.sap.com/doc/f53c64b93e5140918d676b927a3cd65b/Cloud/en-US/docs-en/guides/index.html
These sites are under doc, but end on html and not htm. These are not blocked. If you want an AI to learn how to write code, having access to the SDK and API information is useful. Thinking about coding and AI and SAP, ABAP is the top priority. Can the AI access the ABAP documentation and use it to learn?
ABAP Documentation
I continue to fail to find the ABAP keyword documentation at the SAP Help homepage as an easy to click on link. I must search for it or go through the documentation. Here the links to the documentation for cloud, on premise and private cloud.
ABAP BTP
For ABAP in BTP, the SAP Help page referring to the ABAP Keyword Documentation is here. The ABAP Keyword Documentation for ABAP Cloud: https://help.sap.com/doc/abapdocu_cp_index_htm/CLOUD/en-US/ABENABAP.html
S/4HANA on premise
ABAP Programming in the on premise S/4HANA system SAP Help points to: https://help.sap.com/doc/abapdocu_latest_index_htm/latest/en-US/index.htm
S/4HANA private cloud
SAP S/4HANA Private Cloud: https://help.sap.com/doc/abapdocu_latest_index_htm/latest/en-US/index.htm
The ABAP keyword documentation is stored under folder doc: /doc/abapdocu… The pages end with htm. With the exception of the latest ABAP Cloud documentation: pages end on html.
You shall not pass
Folder doc is marked as disallowed for AI stuff in robots.txt.
Allow: / Disallow: /*/frameset.htm Disallow: /doc/*.htm$
AI bots are not allowed to access content under folder doc and files ending on htm. Looking at the file names of the ABAP Docu, the scheme used to forbid access now makes sense. The ABAP docu uses .htm and not .html. But not for the latest version, the one that is using UI5. This version is ending on html. This rule is explicitly made to forbid AI from accessing the non-Cloud ABAP documentation.
This makes it harder to use AI with the ABAP information. If you want to get information about ABAP features, SAP Help might be out of reach for your release. In case you are chatting with your AI and want it to look up the information about CDS table entity for your S/4HANA ON/PE release, the robots.txt rule indicates the chat bot that the site should not be accessed. Alternative can be to check the documentation available in your SAP system, in case SAP – or your contract – allows AI to access it. You can rely on books or other sources, but these might not be as up to date as the official documentation.
You talking to me?
Lets try this out using ChatGPT. Let’s ask it some questions and see if the ABAP documentation is accessed or not.
Answer
The latest ABAP documentation is given as 7.58 from October 2023 and the sources used are Wikipedia or GitHub. Let’s try to find more about a recent feature like CDS table entity.
The bot searches the internet and is accessing some SAP pages.
Answer
Sources contain SAP pages. Lets take a look at those.
The first two entries point to the ABAP Keyword Documentation:
The links are:
- https://help.sap.com/doc/abapdocu_cp_index_htm/CLOUD/en-US/ABENCDS_TABLE_ENTITY_GLOSRY.html?utm_source=chatgpt.com
- https://help.sap.com/doc/abapdocu_cp_index_htm/CLOUD/en-US/ABENCDS_TABLE_ENTITIES.html?utm_source=chatgpt.com
The links go to the latest version of the ABAP Cloud keyword documentation.
The chat bot searches the internet, goes to SAP Help and the ABAP Keyword Documentation and accesses the latest version.
ChatGPT offers to give the direct deep links to the official examples.
The referenced links go to the ABAP Keyword documentation.
It goes further and asks if it should get the code block examples directly from the SAP page.
Lets ask OpenAI to get the source code from the site.
Answer
ChatGPT does respect the robots.txt rule. The content is not crawled. Access to the latest version is allowed, to earlier versions not. When ChatGPT tried to get the information directly from the ABAP keyword documentation with access restrictions given by the site, it followed the robots.txt rule.
You can try this out by asking ChatGPT to “go the the abap keyword documentation for the on premise release 2023 and look up an example for CDS view entity”. The sources used contain a wide range of different sources, like developers, learning, product documentation. The sources also contained in my case a reference to ABAP Docu:
This is in the blocked list: doc and html. If this was not crawled by OpenAI, maybe before SAP blocked access or if referenced by another source? I have no idea. But this shows that for the results, you are mostly not getting something back that depends on the ABAP docu.
What now
The rules defined by the site owner are not legally binding (says AI, consult your lawyer). The intention of the rules is to indicate what to crawl and what not. In case a 3rd party site contains a link to a URL marked as disallowed, crawlers are going to index it anyway. The rules are a way of nicely asking bots to follow the rules. It is impolite to ignore them, but there is not much that can be done. However, the rules can be enforced by power. I’d not ignore the rules. Specially when you depend on doing business with SAP and you sell a product trained with forbidden resources. This might get you into some discussions. Just know what you are doing.
If you want to train your LLM or use it with SAP resources: follow the robots.txt rules. SAP Help is (still) very open, as is sap.com. As long as SAP is not restricting the usage of their resources, the information available there combined with other resources should be good enough. CAP documentation is open, SAPUI5 documentation too. The Fiori design is now hosted under sap.com, so it can be used by AI. The Fiori App Library does not have a robots.txt file. There are many LLMs out there that know how to code in Java, JavaScript, TypeScript and how to do UX design. These do not know very well what ABAP is. I guess SAP wants developers you to use/buy Joule for ABAP development. What you should not use is the blocked versions of the ABAP documentation for your AI. ABAP is SAP’s proprietary language and it makes sense for SAP to control it. Annoying for everyone else. I guess the available ABAP resouces outside ABAP Keyword Documentation are more than enough and good enough to train a LLM to be helpful for any ABAP developer. It’s just a matter of time. It will be interesting to see if SAP will try to limit access to ABAP for AI stuff that is not from SAP. I guess a stricter robots.txt rule is not what SAP will have in mind.
0 Comments