PDF-Parsing-Algorithm

Assignment:

PDF parsing is largely based on recognizing text. Digitally generated text can be read well, but scanned text cannot. This is why PDF parsers also use OCR (Optical Character Recognition) to recognize the text. Other aspects are also very important, such as the reliable recognition of headings, paragraphs, sections, tables, and other structural elements. Different parsers encounter challenges and difficulties here. The overarching task is to create a comprehensive overview of various libraries that function as parsers. Then identify the weaknesses and create an algorithm that addresses one of these weaknesses.

Tasks:

The main idea of this thesis is the development of a PDF parser, that is capable of identifying headlines, paragraphs, tables, sections, and so on.

In-depth research, identification, and definition of quality metrics in the context of PDF parsing.
Systematic comparison of existing PDF parsers in terms of accuracy and the ability to recognize different text segments (headings, tables, etc.)
Identification of weaknesses of existing parsers
Develop a solution that solves one of the existing weaknesses.

Contact person: Thorsten Wittkopp (t.wittkopp@tu-berlin.de)

Start: immediately

Resources:

Ein Beispiel-Parser: https://github.com/VikParuchuri/marker