Im Rahmen unseres gemeinsam mit dem FZI und dem KIT organisierten Seminars werden derzeit mehrere alternative Ansätze zu unseren bestehenden von Studenten (Adrian, Gerrit, Gregor) untersucht, um zu evaluieren, inwieweit sie eine zusätzliche Hilfe für die SDB-Datenextraktionsplattform SdbHub sein können.
Anbei ein ganz frischer Auszug aus dem Entwurf des Papers (Lesen auf eigene Gefahr). Weitere Informationen über das endgültige Papier werden folgen, sobald es fertiggestellt und poliert ist.
„The goal of this paper is twofold. On the one hand, we replicate an already existing Machine-Learningbased (ML-based) approach to extract information out of text documents to our use case. On the other hand, we also extend this approach by designing new features (ST) and incorporate them in our models.
In detail, we examine how document element extraction can be automated, using the example of safety data sheets, stored in the Portable Document Format (PDF) . For this purpose, we provide a new benchmark data set with of hundreds SDSs labeled with 13 different elements of interest. This data set can be used to test our SDS extraction algorithms and train other models. To represent the PDF data in an appropriate format, we receive the horizontal and vertical location of each symbol.Due to the lack of existing labels in our data set, we annotate every token of interest with the according label using rule-based scripts. Finally, we train 46 separate Classifiers on different feature combinations for various label groups in order to assess which features work best for the extraction.
Additionally, we assess how our models perform when context is incorporated.
The results of our models show three major findings:
1-) Context incorporation leads to improvements for all feature sets.
2-) In the case of no window, we already achieve strong scores within all models containing our self-developed features.
3) Without context incorporation, HC features are not sufficient.
Overall, the main contributions of this paper are the following:
1) This paper focuses on automatic document information extraction using the example of SDSs which is of high practical value.
2) This paper adds to the existing literature of automated information extraction by introducing a new type of feature set.
3) This paper shows that replicating an existing ML-based approach can also work for a different use case to a certain extent.
4) This paper highlights the number of possible applications for information extraction out of documents and their importance to finding appropriate solutions.“