2024-06-11 –, Palais Atelier
This talk presents an experimental serverless Solr indexer where the records from multiple database tables are merged using the map-reduce approach. The merge is implemented in AWS using Step Function Distributed Map to orchestrate many lambda functions.
Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets.
In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure.
We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times.
Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost.
We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.
Daniele Antuzi is a software engineer passionate about high-performance data structures and algorithms. He has been working for 4 years in finance (List Spa) and 2 years in cloud services (Amazon Web Services) but his curiosity to learn more about information retrieval brings him to join Sease Ltd.
He likes studying and experimenting with new technologies trying to reduce the gap between academia and industry.