Spring clean your data to simplify your life and help the environment
A team headed by Prof. Pierre Gönczy at EPFL’s School of Life Sciences (SV) took advantage of a system migration to delete old and redundant data from their database – slashing their data storage needs by a third. Their move will not only cut carbon emissions going forward, but will also make life easier for the scientists themselves.
In the run-up to Digital Cleanup Day on 16 March, you may be thinking about how you can best declutter your digital environment – sort through your inbox, for example, or dig through your photo archives. No matter what method you choose, you’ll certainly feel better after your virtual spring cleaning. At EPFL, Prof. Gönczy along with Léo Burgy, a bioinformatics researcher at the Gönczy Lab of Cell and Developmental Biology, and Nicolas Argento, a team leader at the SV’s IT department, embarked on a digital cleanup of their research database. By restructuring it and eliminating redundancies and superfluous entries, they were able to shrink its size from some 150 terabytes (TB) to less than 100.
“Our approach was analogous to initiatives to restore Lake Geneva’s shoreline – if there wasn’t any trash lying around, there wouldn’t be a need to clean it up,” says Gönczy. Over the past 20-plus years, his research group had built up a huge collection of images and other forms of data. Some of their files were redundant with just miniscule differences, some were exactly the same but had been renamed, and some had been backed up without giving it much thought.
We knew that our filing system was disorganized, but we were surprised by the scope of the problem.
The cleanup effort started in the summer of 2023 when two servers that the research group had been using for hot data storage were migrated to the research computing platform (RCP) at EPFL’s newly-built data center. “We took that opportunity to standardize our data architecture and then apply the approach to our cold data storage system as well,” says Burgy. The cold data had been stored on EPFL’s Simple Storage System (S3), where they were obscure and hard to access.
New rules
Gönczy, Burgy and Argento set to work combing through the group’s vast digital archives – a process that took a number of weeks. By the end of the year, they had sorted through, cleansed and restructured all the data. The files they decided to keep are now stored on RCP servers with new classification rules and clear user authorizations. Nothing was left in the depths of the S3.
“Our research group wasn’t very good at handling administrative data separately from research data,” says Burgy, referring to files he identified in cold storage. “We found full file-system backups along with some vacation pictures!”
Argento points out that Gönczy’s data purging initiative resulted in new rules for storing data and enabled his scientists to make better use of the data processing and storage systems available at EPFL.
“Given the very nature of a research lab, it’s normal that researchers would find themselves in this situation,” says Argento. “When you first set up a lab, you tend to focus on the scientific procedures. You don’t really think about data management until you’ve collected so much data that it becomes an issue. And research methods have changed considerably since the early 2000s – today scientists collect reams of complex data. EPFL has continually upgraded its storage capabilities over the past ten years but the network of systems isn’t always clear. What’s more, in a given experiment researchers can collect many different kinds of data, meaning they have to use different storage formats. They need to invest a fair amount of time to learn how to use the different systems and apply best practices in this area.”
The good news is that if you set up a research lab with the right procedures in place from the get-go, you won’t have so much of problem.
“Now that our data are better organized, our systems have been streamlined and everyone understands how it all works, I hope we’ll never again find our files in such disarray," says Gönczy. "We even developed a procedure for linking our microscopes directly to the servers. If people stick to our new way of working, everything should run smoothly.”
Training and guidance
Gönczy hopes that his experience will serve as an example for his colleagues at SV. He’d like to see EPFL introduce a training course on the basics of data management, much like the training that’s already offered on lab safety.
Argento agrees, and believes research labs should hire data managers to oversee their systems for generating, processing and storing data. But Gönczy points out that this would eat into their research budgets. “I don’t think the Swiss National Science Foundation, for instance, will magically increase the size of its grants to pay for data managers.”
“Ideally we would be able to measure the salary-equivalent cost of a scientist who has to spend time finding a single file, learning how to use a new data storage system or developing a new data management procedure,” says Gönczy. “Unfortunately, that’s not so easy to quantify. Another aspect to consider is how this affects scientists. We talk a lot about mental health, and I think scientists would be less stressed if they could locate their research files easily and receive support on data management in all phases of their work at EPFL.” Argento notes this would free up more time for the research itself. “We’d shift to a paradigm where scientists don’t have to think too much about data storage.”
In Gönczy’s research group, one of the most noticeable benefits has been greater peace of mind. From an environmental perspective, according to Manuel Cubero-Castan, a digital project manager at EPFL’s Sustainability Unit, the 53 TB of needless data removed from Gönczy’s database will save roughly 600 kg CO2e of emissions – approximately the same as a round-trip flight in Europe for one person.
“The problem we had in our research group was tiny compared with the types of data generated at EPFL,” says Gönczy. “But by pooling our efforts, we could make a measurable impact in terms of saving our planet. Some research centers produce so much data that the emissions generated in just 48 hours of processing time are more than the entire amount we’ve saved!”
That said, when it comes to sustainability, the simple fact of reducing the mental load on scientists is already a step in the right direction.
Tips for cutting back on data at your unit
- Develop a data storage and archiving procedure for your unit.
- Contact your IT department for advice on hot and cold data storage.
- Before starting a new research project, think about the life cycle of the data you’ll generate.
- Group files by research project rather than by individual. That way, when a project is finished, you can transfer the entire project folder into your archives. That will also make it easier to manage read/write permissions.
- If special software is required to read the data, don’t forget to add it too.
- When sorting through your personal or work-related data, focus on the biggest files first. You can use a directory-scanning program like TreeSize or Gemini2 to help you.