Your favorite TV series? Find them thanks to subtitles!
What is your favorite TV series: Desperate Housewives, The Mentalist, Game of Thrones or Big Bang Theory? EPFL students have now developed a tool that indexes and classifies the words that frequently appear in the dialogues of TV series to identify themes. Instead of focusing on what other people like, this novel tool suggests TV series on the basis of themes associated with the scenario.
Do you enjoy TV series where doctors and nurses become romantically involved (but not too much!), where there is some police action (but no too much violence!) and a bit of humor? How does one identify TV series that have just the right dose of all of these ingredients?
For a course on Big Data given by the School of Computer and Communication Sciences, a group of students developed their own unique project to do exactly that. Raphaël Von Aarburg, student leader, noticed that the usual tools used to recommend TV series all worked using the same formula: comparing the choices of Internet users among themselves. Based on the shared tastes of fans, these recommendations did not take the scenario into account. This was due to the fact that unlike textual information, it is difficult to analyze the content of TV series because they are comprised of two different formats: one sound track and one video track.
The solution? Using subtitles to analyze TV series
How can these large information flows comprising the content of TV series be processed and analyzed? “Our initiative was to analyze the subtitles contained in the audio track of TV series. Our group of eight students then got together to work on this new idea,” explains Raphaël Von Aarburg.
The students decided to write a software program to analyze the ingredients that go into a given scenario such as humor, romance, suspense or drama. More ambitious, the software program was also intended to identify narrative aspects such as drugs, crime or power through dialogues. For example, the TV series South Park is comprised of the following themes: 58% cartoon, 17% vice, 4% (counter-)terrorism, 4% politics et 17% less important themes. The next step was to use this data to recommend a given TV series to Internet users on the basis of the percentage of ingredients that they choose from among the 25 themes proposed by the software: College Life, Sexy, Family, Science, Science Fiction, Crime, Medical, Magic, Supernatural, Action, War, Investigation, etc.
Using mathematics to interpret words
The first challenge for students was to find a sequence of instructions (i.e. an adequate algorithm) that could group words into themes. “The difficulty was to find mathematical equations capable of categorizing words! This algorithm already existed in part: the Latent Dirichlet Allocation (LDA). However, this algorithm still needed to be decoded, adapted and implemented,” explains Simon-Pierre Genot, one of the students who found this phase of the project exciting. Adaptation meant adjusting the different operations in this algorithm so that several networked computers could perform these operations simultaneously. In IT terms, this is referred to as "running the algorithm in parallel". Doing this enables one to implement the algorithm, combining the power of several computers and delivering results far more quickly than what would be possible with a single machine.
Second challenge: the eight students had to retrieve the data (subtitles of TV series in English posted on various websites). To do this, the students wrote a series of scripts (i.e. programming code) to launch and coordinate execution of their software program. These scripts run in sequence to automatically search for all subtitles on the web. Once the data had been collected, the students began the cleaning process, i.e. automatically correcting errors such as spelling mistakes, removing duplicates, identifying onomatopoeias (Crash!, Bang!, etc.) and information for the hearing impaired… In addition, the students had to eliminate all sorts of words such as connecting words (however, because, but, therefore, etc.), direct and indirect articles (the, a, an, some, any, etc.) and keep only essential words. “The power of this algorithm is that it does not merely sort words according to themes. It looks for significant words such as “vampire” or “surgery”, for example, not words like “Hello”, which can be found in all TV series,” explains Khalil Hajji, one of the students who devoted considerable time to the project. Preference is given to meaningful nouns and adjectives that allow one to characterize and categorize a given TV series. For the Big Bang Theory, for example, the top words are: earth, school, date, class, planet, party, mom, sex, universe, cool, kiss, fun, etc.
A minuscule error in the code is enough
“About a week before the deadline, nothing worked! The results produced by the algorithm were not relevant!” exclaims Khalil Hajji. Working hard, the students went through all of the project phases with a fine-toothed comb and found an error in one of the lines of code drawn from a publication. It was a small error that nevertheless prevented the algorithm from running in parallel. Although the algorithm was perfectly correct, the error in the simultaneous deployment on several computers skewed the students’ results. “We wrote to the author of the publication to report the error and he was very thankful,” recalls Raphaël Von Aarburg.
In order to make use of the results of this project, the students designed a website (www.submetrics.org). By entering the TV series of your choice in the search engine, the tool displays the themes associated with this TV series, their degree of importance, as well as the words most frequently used in the dialogues. The results are often rather funny. For example, the top words for the Game of Thrones are: sir, power, death, brother, lord, magic and king. For comparison, the top words for Breaking Bad are: car, cop, gun, detective, police, shoot and drug.
“We used graphs to present data in a clear and appealing way,” explains Claire Musso, who greatly contributed to the final phase of the project. “It’s a bit like a map of the various TV series: each point represents a TV series. Between each TV series, there are connecting lines showing similarities. This allows us to see that TV series that take place at a hospital are “spatially and thematically” distant from the other TV series. In contrast, TV series set in police, criminal or political contexts tend to be very close on the map.”
British versus American English
“A very funny aspect in the results,” describes Simon-Pierre Genot “is that the algorithm is able to distinguish between British and American English with words such as "mate", "lad", "blimey", "o'clock" or even by paying attention to the number of repetitions of the word "tea". We therefore have a theme called "British" that includes such TV series as Doctor Who or Downton Abbey.”
The website also recommends similar TV series in terms of content preferred by Internet users. “Ideally,” explains Claire Musso “we would like to find a way to combine our tool with typical recommendation tools. This should help to optimize the results of both…“
And the students all concur: “There were eight of us but we managed to work together in an excellent atmosphere and the support and guidance given to us by the PhD students was outstanding! We would therefore like to thank Professor Christoph Koch and his assistants for their help, Mohammed El Seidy and Amir Shaikha.”
Project website: www.submetrics.org
IC Students: Claire Musso, Florian Simond, Grigory Rozhdestvenskiy, Khalil Hajji, Nassim Drissi El Kamili, Nils Bouchardon, Simon-Pierre Génot and Raphaël von Aarburg