Automation ain't automation

In discussions with potential clients, it is not uncommon for us to hear that they have already introduced an automated SDS read-in. The problem has already been solved. On closer look, it quickly becomes clear that the implemented automated data extraction comes up with very little automation and does not deliver satisfying results. The different methods will be explained in the following. Above all, it should be shown that much more is possible than one might think.

OCR (Optical Character Recognition) and Safety Data Sheets

Yes, I know how your solution works, I also use OCR for SDS reading.

OCR refers to text recognition in images. I.e. an uploaded image is converted into a text format using OCR. There are already OCR solutions that additionally provide a lot of meta information about the texts. In the end, however, only texts are delivered. In the case of safety data sheets, there are always users who use OCR to make data sheets, from which no texts can be copied, readable. The data is manually marked and transferred from the supplied texts. After all, this is better than nothing.

As part of the data extraction process, we use many different methods. These include OCR to determine certain quality aspects. OCR alone is by far not sufficient to read or extract data from a safety data sheet in a structured manner.

Depending on the document quality, the individual characters may not be recognized cleanly. For example, it can happen that a one “1” becomes the small letter “l”. This happens with numerous other characters and words as well.

For professional applications, we at Datalyxt either use off-the-shelf solutions or use our own OCR system, depending on the requirements. For SDBs, for example, we have our own solution based on open source solutions that is specially trained for SDB character sets.

Data Scraping and Safety Data Sheets

That is data scraping. I don’t need it, there are countless free alternatives on the web.

Data scraping is often used in the context of web data (web scraping). Certain properties are captured from a website in a structured way. We’ve been doing this since 2015 as part of our web search engine SonarBox, supplying web data to numerous companies in an automated way. SdbHub actually emerged from the context of SonarBox. The underlying technology platform is very similar.

Web scraping also requires the definition of models or rules that cleanly capture data only from web pages. Depending on the case, learning a model may be relevant and in other cases a few heuristics and regular expressions may be sufficient. Especially when information needs to be captured across web pages, the development of appropriate cross-domain AI models with web page understanding is definitely necessary.

This is exactly the case with SDSs as well. There are no ready-made non-commercial (also non-commercial, since they are delivering only data from web pages) data scraper solutions you can use for SDS digitization that deliver high quality structured data.

GitHub and Safety Data Sheets

My colleagues said there are ready-made solutions on GitHub.

Yes, there are a couple of implementations on GitHub. One of them comes from my former students at KIT. Together with FZI and KIT, we had done a seminar in 2019. The students developed an approach here. They uploaded this to GitHub at that time. We always enjoy collaborating with students. Students like to work on exciting and concrete tasks from industry that are relevant to practice, and we receive valuable new impulses and ideas for innovative extraction and analysis methods through the collaboration.

No question, GitHub projects are exciting and we ourselves are glad they exist. However, the projects related to SDBs are far from reliable and professional.

Robotic Process Automation (RPA) and Safety Data Sheets

Do you use Robotic Process Automation (RPA) for data extraction from safety data sheets?

We do not use RPA directly for automated data extraction from SDSs. First, you have to be clear about what RPA even means and how it is used. This is what Wikipedia says:

…In traditional workflow automation tools, a software developer produces a list of actions to automate a task and interface to the back end system using internal application programming interfaces (APIs) or dedicated scripting language. In contrast, RPA systems develop the action list by watching the user perform that task in the application’s graphical user interface (GUI), and then perform the automation by repeating those tasks directly in the GUI. …

So in summary, it’s this: I have a task that I perform on the screen, I do it in front of the machine, and then the computer does it automatically. Of course, you can have such a solution implemented for your SDSs at great cost. However, first of all, all SDSs must then look approximately 100% the same (length of texts, position, page, etc.) and secondly, you must define at least one rule for each supplier.

If the SDSs are nearly 100% identical in appearance, these approaches work very well and provide high quality data. In all other cases, anything between 0% and 100% is possible. This approach is quite an interesting consideration, but it has more to do with roulette. It quickly reaches its limits. Neither can they scale across vendors, nor is maintenance easy.

We deployed a type of “RPA” for annotation of new SDSs. This is usually called as document annotation. That is also the term, we use both internally in our development team and externally with our clients. In other words, we annotate documents with tools including RPA techniques to provide our algorithms with new training data.

What makes SdbHub different?

What is the difference to SdbHub and how does it extract data from SDBs?

In the context of SDS, we use the term data extraction. However, the terms, data capturing, data acquisition, data read out, data retrievel, data analytics in the are very related and in a narrower sense also equivalent.

So don’t let the technocratic word salad confuse you: It’s all about whether you get SDS data in a structured, clean and simple way or not. Which “cutting-edge” technology is used and how, is initially irrelevant to you as a user. The only thing that matters is the results. This is exactly what we deliver with SdbHub through the different ways in which we combine machine learning techniques.

Is importing SDBs the same thing you do with SdbHub?

It depends on what is understood by importing. There are different expressions:

It can mean that the MSDS is available in the software as a PDF file. This means that the PDF file has been imported into the software, but no data has been read from it.
It can mean that the SDS is read in and it is verified whether any structured information, e.g. as XML, has been injected into the PDF file by the supplier. According to our own statistics, this currently occurs in one out of 200 SDSs (approx. 0.5%) and is therefore practically irrelevant.
It can mean that data fields are cleanly cut out of the SDS in a structured way and read out. This data is added to your MSDS and made accessible, for example, in your hazardous substance register. I.e. you do not have to enter the data manually.

SdbHub enables the latter.

In addition, we intend to release a version in the future, with which you can independently train the system specifically for your SDSs. This way you extend the model pre-trained by us with your data. This gives you additional optional highly individualized extraction options for your safety data sheets. You can use the annotation, but you don’t have to. After all, SdbHub is very accurate even without your customizations.