High Performance Computing for Distributed Indexing of Scientific Data

Searching for information within large sets of unstructured, heterogeneous scientific data can be very challenging unless an inverted index has been created in advance. Several solutions, mainly based on the Hadoop ecosystem, have been proposed to accelerate the process of index construction. These solutions perform well when data are already distributed across the cluster nodes involved in the elaboration. On the other hand, the cost of distributing data can introduce noticeable overhead. We propose ISODAC, a new approach aimed at improving efficiency without sacrificing reliability. Our solution reduces to the bare minimum the number of I/O operations by using a stream of in-memory operations to extract and index heterogeneous data. We further improve the performance by using GPUs and POSIX Threads programming for the most computationally intensive tasks of the indexing procedure. ISODAC indexes heterogeneous documents up to 10.6x faster than other widely adopted solutions, such as Apache Spark.

Giuseppe Totaro, High Performance Computing for Distributed Indexing of Scientific Data

Advanced Concepts Team

science coffee

High Performance Computing for Distributed Indexing of Scientific Data