Berlin Buzzwords 2024

Open Web Search - a platform for a free European Web Index
2024-06-11 , Palais Atelier

The Open Web Search Initiative (OWS.eu) is a European initiative providing an open platform to foster innovation in search and AI applications. Join us for an introduction to OWS, its public datasets and an intro to two of the open source projects that underpin it - StormCrawler and URLFrontier.


In recent years, the Internet and the digital world have become extremely important in our everyday life as almost every aspect of our daily routine is tied to digital resources. Those resources are currently curated by a small number of non-European gatekeepers. In the matter of Web Search, this results in limited access to information, provided by companies that are primarily focused on commercial success rather than taking the individual needs and societal values into consideration. The dominance of a few can lead to polarisation and disinformation. Consequently, the concept of information as a public good, with freely accessible and transparent content is no longer under public control.

The objective of the Open Web Search Initiative (OWS.eu) is to create an Open Web Index (OWI) – open in contrast to the proprietary, closed web indices of the large commercial providers. The forthcoming web index is planned as a public infrastructure. The underlying software will be shared as open-source code and its content undergoes public and transparent moderation. The Open Web Index is specifically aimed at companies, NGOs, scientists and other organisations in Europe to develop new search applications and AI applications.

In our session, we will first introduce the project, its objectives, and the open data it generates. To bring the project idea closer to the audience in a practical way, we plan to conduct a live demonstration of the functionalities of the public dashboard. This web service enables users to configure custom indices and download these at regular intervals to their systems. User-customised indices can encompass the entire Open Web Index or focus only on specific parts of the web, allowing for countless use cases. For instance, researchers interested in structured web data can offload documents containing JSON-LD metadata, whereas companies may want to create topic-specific search engines within the scope of their particular domain. In this way, the data we collect becomes intuitive and easily usable without significant technical overhead. With the live demo, we want to foster a hands-on understanding of the Open Web Index's potential.
After presenting the “what we are doing”, we will delve into the technical details and the “how we are doing it”. Our focus lies on the data collection using the OWLer (Open Web Crawler), a distributed crawling system implemented with the help of the open-source projects StormCrawler and URLFrontier. StormCrawler is a SDK that allows developers to construct their own efficient web crawlers on top of an Apache Storm cluster. This results in distributed crawlers that are both scalable and high-performing. Moreover, they integrate seamlessly with the URLFrontier framework, which offers a crawler-agnostic API and abstracts the web frontier data structure - the queue for the next URLs to crawl. Both software projects, StormCrawler and URLFrontier, are the backbone of the technical setup of our Open Web Crawler.

See also: Slides (2.8 MB)

Hello, I'm a PhD student at the University of Passau. As my work is related to web crawling, I'm interested in all new things related to Data Science and Information Retrieval.