PDF parsing is largely based on recognizing text. Digitally generated text can be read well, but scanned text cannot. This is why PDF parsers also use OCR (Optical Character Recognition) to recognize the text. Other aspects are also very important, such as the reliable recognition of headings, paragraphs, sections, tables, and other structural elements. Different parsers encounter challenges and difficulties here. The overarching task is to create a comprehensive overview of various libraries that function as parsers. Then identify the weaknesses and create an algorithm that addresses one of these weaknesses.


The main idea of this thesis is the development of a PDF parser, that is capable of identifying headlines, paragraphs, tables, sections, and so on.

Contact person: Thorsten Wittkopp (t.wittkopp@tu-berlin.de)

Start: immediately