Distributed processing of DNA sequencing data

Time and computing resource consumption optimization over a petabyte of data processed

3 min

May 31, 2019 from Activeeon

Founded in 1946, the Institut National de la Recherche Agronomique (INRA) is the leading agricultural research institute in Europe and the second largest in the world in terms of the number of projects carried out by its researchers and the number of scientific publications. Its teams work in research areas that range from food quality and agricultural sustainability to the preservation of the environment, biodiversity and ecosystems. To carry out its missions, INRA uses state-of-the-art technologies.

Sequencing the intestinal microbiota

MetaGenoPolis brings together researchers, engineers, laboratory technicians, bioinformaticians, bio-analysts, statisticians, mathematicians, microbiologists and a doctor. Through the implementation of advanced metagenomic technologies, the mission of this INRA platform is to understand the impact of the intestinal microbiota - i. e. all microorganisms (bacteria, archaea, viruses, fungi) found in the intestine - on human and animal health.

MetaGenoPolis works on human stool samples to extract microbial DNA and sequence them. Each sample results in 20 million short sequences that must then be assembled like a puzzle to reconstruct genes and genomes and finally establish their microbial profile (i.e. the microbial species present and their abundances).

“Our database now includes the sequencing results of almost 20,000 samples”, explains Nicolas Pons, research engineer at INRA and head of the MetaGenoPolis bioinformatics platform. “This represents a total of 1 petabyte of data that we must store and process locally, i.e. 1 million billion bytes. That’s considerable!”.

From genes to microbial profile characterization: a very large amount of data to be processed

“To build microbial profiles for each individual, we rely on catalogues of genes and microbial species representative of ecosystems,” he continues. “In the human intestine alone, there are nearly 10 million genes. “, explains Nicolas Pons.

According to the studies entrusted to MetaGenoPolis, the number of samples to be processed at the same time can reach and even exceed several hundred or even several thousand units.

Ensure reliability and optimize data processing

MetaGenoPolis therefore needs a digital infrastructure and storage solutions that are particularly reliable and adapted to these complex operations. To do this, the platform relies on the ProActive solution developed by ActiveEon to orchestrate the IT processing of data from the analysis of microbiota samples. ProActive allows not only to distribute treatments in a time-optimized way, but also in terms of computing resource consumption.

“ProActive allow us to organize the different computing tasks on a cluster, our group of servers on the network, while providing us with a workflow engine that facilitates bio-analysts the implementation of certain processes by optimizing the workflow and accessibility to specific resources. It is completely adapted to our bio-informatics and bio-statistics processing needs.”

To learn more about INRA MetaGenoPolis research visit the following website:

INRA MetaGenoPolis

Download INRA case study

Tags: