Data extraction from SDS (safety data sheet) based on machine learning techniques

Within the scope of our seminar organised together with FZI and KIT, several alternative approaches to our existing ones are currently being investigated by students (Adrian, Gerrit, Gregor) in order to evaluate to what extent they can be of additional help for the SDS data extraction platform SdbHub.

Enclosed is a very fresh excerpt from the draft paper (read at your own risk). More info about the final paper will follow as soon as it is finished and polished.

” The goal of this paper is twofold. On the one hand, we replicate an already existing Machine-Learningbased (ML-based) approach to extract information out of text documents to our use case. On the other hand, we also extend this approach by designing new features (ST) and incorporate them in our models.
In detail, we examine how document element extraction can be automated, using the example of safety data sheets, stored in the Portable Document Format (PDF) . For this purpose, we provide a new benchmark data set with of hundreds SDSs labeled with 13 different elements of interest. This data set can be used to test our SDS extraction algorithms and train other models. To represent the PDF data in an appropriate format, we receive the horizontal and vertical location of each symbol.

Due to the lack of existing labels in our data set, we annotate every token of interest with the according label using rule-based scripts. Finally, we train 46 separate Classifiers on different feature combinations for various label groups in order to assess which features work best for the extraction.

Additionally, we assess how our models perform when context is incorporated.

The results of our models show three major findings:

1-) Context incorporation leads to improvements for all feature sets.

2-) In the case of no window, we already achieve strong scores within all models containing our self-developed features.

3) Without context incorporation, HC features are not sufficient.

Overall, the main contributions of this paper are the following:

1) This paper focuses on automatic document information extraction using the example of SDSs which is of high practical value.

2) This paper adds to the existing literature of automated information extraction by introducing a new type of feature set.

3) This paper shows that replicating an existing ML-based approach can also work for a different use case to a certain extent.

4) This paper highlights the number of possible applications for information extraction out of documents and their importance to finding appropriate solutions.”

Alternative ML-Approaches

Data extraction from SDS (safety data sheet) based on machine learning techniques