EPFL student creates a new language-analysis program

Yale’s Beinecke Rare Book and Manuscript Library © iStock

Yale’s Beinecke Rare Book and Manuscript Library © iStock

Jonathan Besomi, a Master’s student at EPFL, has developed a program called Texthero that lets users generate representations of textual data with just a few lines of code, thereby simplifying the analysis of natural languages.

We now live in a data-filled age that has ushered in its own distinct challenges. One of the biggest is how to analyze vast reams of information. In response, Besomi, a Master’s student in data science, has developed Texthero, a program that simplifies the task of analyzing textual data. It was created in the spring of 2020 under the supervision of Kenneth Younge, Chair of Technology and Innovation Strategy at EPFL’s Management of Technology & Entrepreneurship Institute. Designed as open-source software and written in the Python programming language, Texthero swiftly won over developers around the world.

“Texthero has been downloaded over 23,000 times so far, and has been awarded 2,000 stars on the Github platform,” says Besomi. “It got a lot of attention as soon as we released it – people even began sharing it on social media, primarily Twitter and LinkedIn. This indicates that there was strong demand for such a program in the Python/NLP [Natural Language Processing] community.”

Rapid visual representations

Using Texthero, developers can quickly visualize and understand text-based datasets. “Our program takes a text made up of unstructured data, cleans it up, generates a representation of it by converting it into digital format, and finally visualizes it. In other words, Texthero gives users an overall idea of the structure of a completely unfamiliar text,” explains Besomi.

The rudiments of Texthero first came to Besomi when he was working with Professor Younge on Fastlaw, a program for analyzing legal texts. “Fastlaw is a ‘word-embedding’ tool that was trained on a large corpus of legal data provided by Harvard University’s Caselaw Access Project (CAP) – a project to make every ruling published by US courts freely available,” says Besomi. He and Younge presented their program to the Harvard Law School Library.

“As I was developing Fastlaw, I realized there was a need for software that could quickly pre-process, represent and visualize textual data,” says Besomi. Before Texthero, developers who wanted to process natural language were forced to use a series of applications, such as spaCy, scikit-learn, Gensim and NLTK. The process was both time-consuming and complex. “Now, with Texthero, just a few lines of code are enough to plot a text to be processed.”

A new version

To date, 16 developers have contributed to Texthero through pull requests on Github. They’ve fixed bugs, introduced new features and improved the documentation. “We're about to release a new version (1.1) that will boost text processing speeds even further,” says Besomi.

Besomi now wants to consolidate and expand the Texthero community through blog posts and tutorials, in order to increase uptake of his program. “When I think about the billions of pieces of data around us that we can't assimilate, it would seem that text analysis – in all its forms – is the wave of the future," says Besomi, who is currently completing an in-company internship at IBM Research Zurich and writing a thesis on text analysis. “I'm fascinated by these issues and pleased to have created a simple, straightforward program that makes natural language processing easier.”

Author: Leila Ueberschlag

Source: EPFL