DHLAB paper wins top prize at CHR conference

Sven Najem-Meyer and Matteo Romanello © CDH / 2023 EPFL

Sven Najem-Meyer and Matteo Romanello © CDH / 2023 EPFL

Sven Najem-Meyer (Digital Humanities Lab at EPFL) and Matteo Romanello (Institute of Archeology and Classical Studies at UNIL) won the “Best Paper” award at the Third Conference on Computational Humanities Research (CHR) for their paper, “Page Layout Analysis of Text-heavy Historical Documents: a Comparison of Textual and Visual Approaches”.

The award-winning paper, authored by Najem-Meyer and his PhD advisor Romanello, compared three computational ways to process the layout of classical documents. At the origin of this research lies a recurrent problem in digital humanities. With most automatic text recognition programs, what comes out is a block of text without the context of its layout. Researchers however, are seldom interested in the entirety of the extracted text. Regions such as headers, page numbers or footnotes are often filtered out. Layout analysis programs allow the desired text to be taken without all the unnecessary extras.

However, classical texts like Greek literature pose a specific challenge for layout analysis, as they have very unusual characteristics. For example, a page might contain some Greek words, then a few words of English for translation, then Greek again, and then some comments in a different block. In addition, there has not been a lot of research on these kinds of documents.

“These are very peculiar texts,” Najem-Meyer explains. “These are not the kind of texts where you would have an off-the-shelf algorithm to work with.”

Figure from the paper of the main layout elements of a scholarly
commentary page © Najem-Meyer and Romanello / 2023 EPFL

Najem-Meyer and Romanello compared three approaches to analyzing the layout of classical Greek texts to see what worked best: one approach that just looked at the image and extracted blocks by region on the page, one that analyzed the text only, and a hybrid model that used image and text. While it might seem that the hybrid model would have worked best because it incorporated the most data, they found that image alone was actually the most effective, and that in fact the image was so much more informative to the model that the hybrid model barely even looked at the text.

“Surprisingly, more information did not mean better results in this case,” Najem-Meyer explains.

When awarding the prize, the jury noted that the comparison used in the paper was rigorous, and that the authors were transparent about the limitation of their results. “As a young researcher, I take it as a great encouragement,” Najem-Meyer says.

Author: Stephanie Parker

Source: College of humanities | CDH

This content is distributed under a Creative Commons CC BY-SA 4.0 license. You may freely reproduce the text, videos and images it contains, provided that you indicate the author’s name and place no restrictions on the subsequent use of the content. If you would like to reproduce an illustration that does not contain the CC BY-SA notice, you must obtain approval from the author.