<?xml version='1.0' encoding='utf-8' ?>
<!-- Made with love by pretalx v2026.1.1. -->
<schedule>
    <generator name="pretalx" version="2026.1.1" />
    <version>0.18</version>
    <conference>
        <title>Berlin Buzzwords 2023</title>
        <acronym>berlin-buzzwords-2023</acronym>
        <start>2023-06-18</start>
        <end>2023-06-20</end>
        <days>3</days>
        <timeslot_duration>00:05</timeslot_duration>
        <base_url>https://program.berlinbuzzwords.de</base_url>
        <logo>https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/img/bbuzz_logo_500x500-transparent_TKOnJhA.png</logo>
        <time_zone_name>Europe/Berlin</time_zone_name>
        
        
    </conference>
    <day index='1' date='2023-06-18' start='2023-06-18T04:00:00+02:00' end='2023-06-19T03:59:00+02:00'>
        <room name='Palais Atelier' guid='2fd45b3f-34a5-5cdc-83ff-c5179ae5fb09'>
            <event guid='27f888a5-481d-553c-ab46-d63bdb35583c' id='30538' code='K7BZXH'>
                <room>Palais Atelier</room>
                <title>Barcamp</title>
                <subtitle></subtitle>
                <type>Workshop</type>
                <date>2023-06-18T15:00:00+02:00</date>
                <start>15:00</start>
                <duration>03:00</duration>
                <abstract>Barcamps are informal sessions, a kind of &quot;un-conference&quot;, with a schedule decided on the day.</abstract>
                <slug>berlin-buzzwords-2023-30538-barcamp</slug>
                <track></track>
                
                <persons>
                    <person id='32682'>Nick Burch</person>
                </persons>
                <language>en</language>
                <description>Barcamps are informal sessions, a kind of &quot;un-conference&quot;, with a schedule decided on the day. It is all driven by the interests and expertise of those who attend so each one is different, but ours are always great!

Although the barcamp doesn&apos;t have a strict schedule, it won&apos;t be completely devoid of structure! #bbuzz barcamps are dynamic events, focused on the overall Berlin Buzzwords topics, tackling the same challenges but in a different format. At the barcamp each session runs for 30 minutes giving enough time to get into the meat of a topic, but without a chance of anyone getting bored. These are participatory sessions and more inclusive than regular conference talks, with everyone taking part. You can help by leading the session, by giving some insights, by asking some great questions, or maybe just with your enthusiasm.

The barcamp will be coordinated and moderated by Nick Burch.

Registration starts from 2:30pm</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/K7BZXH/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/K7BZXH/feedback/</feedback_url>
            </event>
            
        </room>
        
    </day>
    <day index='2' date='2023-06-19' start='2023-06-19T04:00:00+02:00' end='2023-06-20T03:59:00+02:00'>
        <room name='Kesselhaus' guid='0eebb650-d6b8-53cb-9d33-48dd47df08c1'>
            <event guid='5d3b34dc-6ffb-52b1-afa0-81900c1fdd7e' id='31134' code='3WEQZF'>
                <room>Kesselhaus</room>
                <title>Opening Session</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T09:15:00+02:00</date>
                <start>09:15</start>
                <duration>00:15</duration>
                <abstract>Join us as we kick off Berlin Buzzwords 2023</abstract>
                <slug>berlin-buzzwords-2023-31134-opening-session</slug>
                <track></track>
                
                <persons>
                    
                </persons>
                <language>en</language>
                
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/3WEQZF/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/3WEQZF/feedback/</feedback_url>
            </event>
            <event guid='4c1dfd06-f033-565d-94e0-9c091ea4f2d6' id='29798' code='BKUVLR'>
                <room>Kesselhaus</room>
                <title>What defines the &#8220;open&#8221; in &#8220;open AI&#8221;?</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T09:35:00+02:00</date>
                <start>09:35</start>
                <duration>00:40</duration>
                <abstract>This talk focuses on unpacking this year&#8217;s big buzzwords of &#8220;open AI&#8221; and &#8220;responsible AI&#8221; to highlight the range of (sometimes contradictory) activities that exist under these umbrella terms and how</abstract>
                <slug>berlin-buzzwords-2023-29798-what-defines-the-open-in-open-ai</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/BKUVLR/Ding-Jennifer_wEkHrVL.png</logo>
                <persons>
                    <person id='34260'>Jennifer Ding</person>
                </persons>
                <language>en</language>
                <description>While the majority of AI production is concentrated within a few companies in even fewer countries, alternative pathways are emerging for more people to participate in the process of building, applying, and governing ML models. Open Artificial Intelligence (open AI) initiatives offer new spaces to reimagine how AI is developed and who can be part of the process. However, over the past year, the intensification of AI model development and hype has made the already nebulous term &#8220;AI&#8221; even more confusing when extended with terms like &#8220;open&#8221;, &#8220;responsible&#8221;, &#8220;trustworthy&#8221;, and &#8220;democratic&#8221;. This talk focuses on unpacking this year&#8217;s big buzzwords of &#8220;open AI&#8221; and &#8220;responsible AI&#8221; to highlight the range of (sometimes contradictory) activities that exist under these umbrella terms and how the AI field is expanding the practice of &#8220;open&#8221; beyond traditional FOSS contexts. Following an overview of the current open and responsible AI landscape, we will end with a discussion on community priorities for focus and intervention to build AI production pipelines that live up to aspirational attributes, like &#8220;open.&#8221;</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/BKUVLR/resources/Jennifer_Ding_-_What_Defines_the_Open_EzHjw87.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/BKUVLR/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/BKUVLR/feedback/</feedback_url>
            </event>
            <event guid='0233bf17-db89-57ba-8d3e-743f3dee468b' id='27856' code='VUGYME'>
                <room>Kesselhaus</room>
                <title>Vectorize Your Open Source Search Engine</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T10:40:00+02:00</date>
                <start>10:40</start>
                <duration>00:20</duration>
                <abstract>Fascinated by vector search but don&apos;t know where to start?
Join us to crack the code and leverage the potential of vector search to delight your users.</abstract>
                <slug>berlin-buzzwords-2023-27856-vectorize-your-open-source-search-engine</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/VUGYME/Arora-Atita_BjKE0QQ.png</logo>
                <persons>
                    <person id='32470'>Atita Arora</person>
                </persons>
                <language>en</language>
                <description>Neural search (a.k.a. Vector search) has rewritten the standards of information retrieval in many different domains.
Vector search can help you gather a better understanding of the user query intent, drive product recommendations, search across different source data (text, images, audio, video), deliver better results, improve personalization and create a more successful user experience. Vector search goes beyond keywords to harvest the potential of graphs and embeddings to match users to the intended document, product, job, picture, song, or video.
As fascinating as this may sound it&apos;s easy to find ourselves lost in the deluge of new information.  
If you&apos;re struggling to get started, understand what vector search can bring to the party, add cool new models such as OpenAI models and want to avoid common pitfalls, this talk is for you.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/VUGYME/resources/Atita_Arora_-_Vectorize_your_Opensour_ZHaysB1.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/VUGYME/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/VUGYME/feedback/</feedback_url>
            </event>
            <event guid='9c7c6d44-c4d5-5cac-9293-49ddae355412' id='27877' code='KPELMM'>
                <room>Kesselhaus</room>
                <title>Kaldb: serverless lucene at petabyte scale</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T11:10:00+02:00</date>
                <start>11:10</start>
                <duration>00:40</duration>
                <abstract>In this talk, we share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at PB scale.</abstract>
                <slug>berlin-buzzwords-2023-27877-kaldb-serverless-lucene-at-petabyte-scale</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/KPELMM/Karumuri-Suman_kLs9ElG.png</logo>
                <persons>
                    <person id='32492'>Suman Karumuri</person>
                </persons>
                <language>en</language>
                <description>Running petabyte-scale columnar stores has become a routine operation in today&apos;s data-driven world. However, running a petabyte-scale search system is still a challenging task operationally. Enter Kaldb, an open-source, serverless Lucene serving system designed specifically for petabyte-scale Lucene workloads. We&apos;ve designed Kaldb to automate and reduce operational toil without sacrificing performance or reliability.

But designing a serverless Lucene system at this scale poses several unique challenges, such as ensuring durability of data, modifying replication and caching protocols for high availability, high fanout reads, managing ephemeral nodes, and more. 

In this talk, we&apos;ll delve into the details of how our redesigned Kaldb system overcomes these challenges. We&apos;ve separated durability of the data from storage, separated compute from storage, modified replication algorithms to handle ephemeral nodes, use Kafka as a write ahead log and developed a novel query execution layer to handle high-fanout queries. Our implementation not only reduces operational toil but also adds several self-healing properties to the system. We&apos;re proud to say that Kaldb currently runs on Kubernetes at petabyte scale with improved reliability and performance.

Join us in this talk to learn more about how Kaldb can help you overcome the challenges of running a petabyte-scale Lucene serving system. We&apos;ll share our experiences, best practices, and lessons learned in designing and operating a serverless Lucene serving system at this scale, and provide practical insights and techniques that you can use to optimize your own search systems.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/KPELMM/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/KPELMM/feedback/</feedback_url>
            </event>
            <event guid='fa98a059-5699-5ead-a6f4-b1b079de74f4' id='27943' code='YTLX8T'>
                <room>Kesselhaus</room>
                <title>Boosting Ranking Performance with Minimal Supervision</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T12:00:00+02:00</date>
                <start>12:00</start>
                <duration>00:40</duration>
                <abstract>Using generative Large Language Models (LLMs) to generate synthetic labeled data to train in-domain ranking models. Distilling the knowledge and power of generative LLMs into effective ranking models.</abstract>
                <slug>berlin-buzzwords-2023-27943-boosting-ranking-performance-with-minimal-supervision</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/YTLX8T/Bergum-Jo-Kristian_eWlgmPY.png</logo>
                <persons>
                    <person id='32561'>Jo Kristian Bergum</person>
                </persons>
                <language>en</language>
                <description>Transformer language models are highly effective text rankers; however, training Transformer-based neural ranking models requires vast amounts of labeled supervised data, which is costly and time-consuming. What if you could teach a ranking model without behavioral click data or human annotations? Enter generative large language models (LLMs) such as GPT-3. 

This talk showcases a novel approach to generating labeled data with minimal human supervision. First, with just three human-labeled queries and document examples, an open-source LLM generates synthetic questions for all documents in the index. Then, the synthetic data trains a much smaller, cost-efficient Transformer ranking model, which outperforms a strong BM25 baseline by 10 nDCG@10 points on a popular relevance dataset. 

The innovative method saves on costly annotation efforts and enables faster adaptation to search ranking in new domains, and allows organizations to revolutionize their search capabilities without breaking the bank.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YTLX8T/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YTLX8T/feedback/</feedback_url>
            </event>
            <event guid='5213db91-9dcf-5e8e-b6fb-b4e63a1d5c17' id='27989' code='YPUPAA'>
                <room>Kesselhaus</room>
                <title>Introducing Multi-valued Vector Fields in Apache Lucene</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>Multiple vectors in a field dedicated to K-nearest-neighbors search has been a fundamental problem for Apache Lucene for long. 
This talk describes how this has been finally designed and implemented.</abstract>
                <slug>berlin-buzzwords-2023-27989-introducing-multi-valued-vector-fields-in-apache-lucene</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/YPUPAA/Benedetti-Alessandro_Cg3KRSu.png</logo>
                <persons>
                    <person id='32602'>Alessandro Benedetti</person>
                </persons>
                <language>en</language>
                <description>Since the introduction of native vector-based search in Apache Lucene happened, many features have been developed, but the support for multiple vectors in a dedicated KNN vector field remained to explore. 
Having the possibility of indexing (and searching) multiple values per field unlocks the possibility of working with long textual documents, splitting them in paragraphs and encoding each paragraph as a separate vector: scenario that is often encountered by many businesses. 
This talk explores the challenges, the technical design and the implementation activities happened during the work for this contribution to the Apache Lucene project.
The audience is expected to get an understanding of how multi-valued fields can work in a vector-based search use-case and how this feature has been implemented.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/YPUPAA/resources/Alessandro_Benedetti_-_Multi_Valued_V_k7X51YQ.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YPUPAA/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YPUPAA/feedback/</feedback_url>
            </event>
            <event guid='934cbe59-3c42-50d3-92b4-9cea5d15b76c' id='27803' code='XEC7W3'>
                <room>Kesselhaus</room>
                <title>Privacy-Preserving Web Search</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:20</duration>
                <abstract>An ethical overview of how a privacy-focused search engine has to adapt its behavior from crawling to ranking web documents without knowing anything about the user and still be as relevant as possible</abstract>
                <slug>berlin-buzzwords-2023-27803-privacy-preserving-web-search</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/XEC7W3/Perinetti-Lara_cITESO7.png</logo>
                <persons>
                    <person id='32364'>Lara Perinetti</person>
                </persons>
                <language>en</language>
                <description>Our ubiquitous connection to the internet triggered awareness and concerns regarding privacy preservation issues.
However, while privacy is a more and more known subject, a few points remain to be clarified. It is expected from a privacy-preserving web search engine not to track you via your queries and clicks history nor to sell your personal data. Nevertheless, it can use non-personal data to improve search engine relevance.
Moreover, using a privacy-focused web search engine means being ready to adapt the way of querying it to add the information the search engine does not have about you.
This talk will focus on how we can create a web search engine that preserves its users&apos; privacy while focusing on the relevance of its results and the privacy preservation of its users.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/XEC7W3/resources/Lara_Perinetti_-_Privacy-Preserving_W_wHGjPIY.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/XEC7W3/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/XEC7W3/feedback/</feedback_url>
            </event>
            <event guid='b70901fb-02b3-50a6-ba1b-d3a33aae8bba' id='28095' code='QUQFBY'>
                <room>Kesselhaus</room>
                <title>Towards a decentralized and collaborative search engine</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T15:20:00+02:00</date>
                <start>15:20</start>
                <duration>00:40</duration>
                <abstract>In this session we will share our vision towards an alternative, decentralized and collaborative search engine, from social considerations to technical implementation.</abstract>
                <slug>berlin-buzzwords-2023-28095-towards-a-decentralized-and-collaborative-search-engine</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/QUQFBY/Paponaud-Aline_Precup-Lucian_foOAT1B.png</logo>
                <persons>
                    <person id='32646'>Aline Paponaud</person><person id='32647'>Lucian Precup</person>
                </persons>
                <language>en</language>
                <description>There are many alternatives to traditional search engines, culminating in the recent breakthrough of ChatGPT. Furthermore, the trend is to move towards decentralized, peer-to-peer and community-driven architectures, as the recent movement towards Mastodon implementations testifies.

We wanted to bring these concepts from GitHub, Wikipedia and Mastodon to search engines and build the all.site platform - the collaborative search engine. We will show how developer communities can help organize information and build a new relevance model.

We will share with you the experience of this adventure: what we tried, what we learned, the limits encountered and the challenges to come. We will present the internal functioning of a search engine with its different modules, the architecture and the infrastructure, the notions of security and ethics, ending with the economic model and the prototypes currently in place.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/QUQFBY/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/QUQFBY/feedback/</feedback_url>
            </event>
            <event guid='08080b5f-c68a-5100-9247-db1329f30041' id='28170' code='JXSJB8'>
                <room>Kesselhaus</room>
                <title>How to not kill people</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T16:30:00+02:00</date>
                <start>16:30</start>
                <duration>00:40</duration>
                <abstract>As AI grows, software manages more risks to humans. Moving fast and breaking things won&apos;t do. We will look at aviation to learn how successful risk management structures might look in software &amp; AI.</abstract>
                <slug>berlin-buzzwords-2023-28170-how-to-not-kill-people</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/JXSJB8/Albertson-Lars_pvylmac.png</logo>
                <persons>
                    <person id='32681'>Lars Albertsson</person>
                </persons>
                <language>en</language>
                <description>With the rise of artificial intelligence, we give more control of our lives to software. We thereby introduce new risks, and the fatal Uber crash in 2018 is the first example of AI causing an accidental death. It will be up to us as software engineers to build systems safe and reliable enough to entrust with important decisions.  Our culture, however, includes praising companies that move fast and break things (Facebook), celebrate principled confrontation (Uber), fake self-driving demonstrations (Tesla), and are right, a lot (Amazon). As an industry, we need to radically improve to meet the challenge, or more people will die.

In this presentation, we will look at aviation - the industry most successful at continuously improving safety - and attempt to learn. We will look at aviation safety principles, compare with similar practices in software engineering, and see how we can translate safety principles that have worked well in aviation to the software engineering domain.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/JXSJB8/resources/Lars_Albertson_-_How_to_not_kill_peop_mC8D8Qu.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JXSJB8/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JXSJB8/feedback/</feedback_url>
            </event>
            <event guid='529f5070-81dc-5d9b-8a6f-3ebef3556cf6' id='24838' code='73UNZD'>
                <room>Kesselhaus</room>
                <title>The Debate Returns (with more vectors) Which Search Engine?</title>
                <subtitle></subtitle>
                <type>Panel</type>
                <date>2023-06-19T17:20:00+02:00</date>
                <start>17:20</start>
                <duration>01:00</duration>
                <abstract>It&apos;s that old question - which search engine should I choose for my project? Elasticsearch, Solr, Opensearch (all based on Lucene), or Vespa, or maybe one of the new vector search engines?</abstract>
                <slug>berlin-buzzwords-2023-24838-the-debate-returns-with-more-vectors-which-search-engine</slug>
                <track></track>
                
                <persons>
                    <person id='29789'>Charlie Hull</person>
                </persons>
                <language>en</language>
                <description>What&apos;s best for a particular use case? What advantages does one approach have over another? How has vector search changed the picture? Does it even matter which one I choose?

Moderator Charlie Hull from OpenSource Connections and expert panellists representing various search engine platforms will offer a lively &amp; balanced debate and Q&amp;A session to help you figure out the big question: Which Search Engine?</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/73UNZD/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/73UNZD/feedback/</feedback_url>
            </event>
            
        </room>
        <room name='Maschinenhaus' guid='c6fb8c6b-f3de-5975-8424-16bf52eead3a'>
            <event guid='4ec21e69-884b-5a51-aab6-62f33df54ebf' id='28152' code='WFSUKH'>
                <room>Maschinenhaus</room>
                <title>Declarative Data Collections for Portable Parallelism</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T10:40:00+02:00</date>
                <start>10:40</start>
                <duration>00:20</duration>
                <abstract>This talk introduces a novel programming model - the user declares data collections with the properties, and and these declarations can be transparently ported to multiple platforms including GPUs.</abstract>
                <slug>berlin-buzzwords-2023-28152-declarative-data-collections-for-portable-parallelism</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/WFSUKH/Li-Zhibo_hjNf5Bp.png</logo>
                <persons>
                    <person id='32673'>Zhibo Li</person>
                </persons>
                <language>en</language>
                <description>I would like to introduce Declarative Abstractions for Data Collections, which provides a novel, declarative approach to data collections for convenient, portable, and efficient parallel computation. Modern programming languages provide programmers with rich abstractions for data collections as part of their standard libraries, e.g., containers in the C++ STL, the Java Collections Framework, or the Scala Collections API. Typically, these collections frameworks are organized as hierarchies that provide programmers with common abstract data types (ADTs) like lists, queues, and stacks. While convenient, this approach introduces problems that ultimately affect application performance due to users over-specifying collection data types, limiting implementation flexibility.

With the introduced framework, programmers explicitly select properties for their collections, thereby truly decoupling specification from implementation. By making collection properties explicit, immediate benefits materialize in the form of reduced risk of over-specification and increased implementation flexibility. In terms of computational performance, our framework helps shield the application developer from parallel implementation details, where the property-based data collection can be ported to multiple platforms, including GPU and FPGA, without modifying the declaration on the properties.

The framework provides a data-centric approach for high performance computation, where the users focus on what properties the container(collection) would have and do not need to work around the implementation details. The framework has been developed based on C++ metaprogramming and provides modern C++ API for the users. This framework will benefit the community as a convenience and high-performance programming model for parallel data processing in heterogeneous environment. The audience will get to know a practical programming model for data-centric parallelism, which is useful for their everyday job regarding parallel data analyzing, data storage/filter, etc.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/WFSUKH/resources/Zhibo_Li_-_Declarative_Data_Collectio_QSjRBtk.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WFSUKH/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WFSUKH/feedback/</feedback_url>
            </event>
            <event guid='d416bad1-3076-5be2-8b03-0b849a9882f9' id='27932' code='3GPYJQ'>
                <room>Maschinenhaus</room>
                <title>How to train your general purpose document retriever model</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T11:10:00+02:00</date>
                <start>11:10</start>
                <duration>00:40</duration>
                <abstract>A practical guide for training learned sparse models to outperform BM25 on zero-shot document retrieval tasks</abstract>
                <slug>berlin-buzzwords-2023-27932-how-to-train-your-general-purpose-document-retriever-model</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/3GPYJQ/Herreros-Quentin_Vaesey-Tom_LKemSHi.png</logo>
                <persons>
                    <person id='32548'>Tom Veasey</person><person id='32556'>Quentin Herreros</person>
                </persons>
                <language>en</language>
                <description>Large language models augment traditional information retrieval (IR) approaches with both high quality language parsing skills and knowledge external to the corpus. However, training a state of the art general purpose model for document retrieval is challenging. This talk is motivated by our experiences training a high quality retriever model for use alone or together with BM25 to improve relevance out-of-the-box in Elasticsearch.

We chose to focus on the learned sparse model (LSM) architecture. LSMs for information retrieval (IR) were recently popularised by SPLADE [1] and have various attractive properties for our purpose. They enable retrieval via inverted indices for which Elasticsearch has a high quality implementation in Lucene. They provide tuneable parameters which allow one to trade off accuracy with index size and query latency. They enable word level highlighting to explain matches. And they perform well in zero-shot settings.

In this talk we survey LSMs and discuss how they fit into the IR landscape. We describe some challenges training language models effectively. We briefly survey some techniques which have been studied previously and found to improve performance both in and out of domain. These include downstream task aware pre-training and knowledge distillation. Finally, we give an overview of the key ingredients of our full training pipeline and useful lessons we learned along the way.

Our goal was to consistently improve on BM25 relevance in a zero-shot setting. In particular, we set out to beat BM25 across a suite of diverse IR tasks gathered together in the BEIR benchmark [2] without using any in domain supervision. We survey other published results on this benchmark and discuss how we compare. 

[1] SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking, Formal et al

[2] BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models, Thakur et al</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/3GPYJQ/resources/Tom_Veasey__Quentin_Herreros_-_How_to_ctHFNBC.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/3GPYJQ/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/3GPYJQ/feedback/</feedback_url>
            </event>
            <event guid='3e6b31e6-7e07-5261-81af-bb3dc0ef0e0f' id='27635' code='YPNYCD'>
                <room>Maschinenhaus</room>
                <title>ClickHouse: what is behind the fastest columnar database</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T12:00:00+02:00</date>
                <start>12:00</start>
                <duration>00:40</duration>
                <abstract>Columnar databases seem to be full of mysteries and confusion.  In this introduction for ClickHouse, we&apos;ll take apart its building blocks to see how it achieves its remarkable performance.</abstract>
                <slug>berlin-buzzwords-2023-27635-clickhouse-what-is-behind-the-fastest-columnar-database</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/YPNYCD/Kutsenko-Olena_fIN8I8F.png</logo>
                <persons>
                    <person id='32271'>Olena Kutsenko</person>
                </persons>
                <language>en</language>
                <description>An open source columnar database ClickHouse is in many ways exceptional - it is exceptionally fast, exceptionally efficient, but also, at times exceptionally confusing. 

Its approach to handling data goes against many principles and concepts that we use in other databases. To give some examples: its primary index doesn&apos;t index each row and doesn&apos;t guarantee uniqueness; a secondary index is used to skip data and doesn&apos;t point to specific rows; JOINS is a complex topic and transactions are supported partially, not to mention that its SQL dialect holds a couple of surprises up its sleeve. 

But, all that said, if used correctly, ClickHouse is a superb solution for online analytical processing (OLAP).

The goal of this talk is to help you get the most of ClickHouse and avoid the pitfalls. We&apos;ll talk about OLAP and columnar databases. We&apos;ll touch topics of indexing, searching and disk storage. We&apos;ll look at the reasons behind the most puzzling concepts of ClickHouse, so that by the end of the talk you find them not only logical, but maybe even fascinating.

If your challenge is analysing terabytes of data - this talk is for you. If you&apos;re a data scientist looking for tools to work with big data - this talk is for you. And, of course, if you are just curious about what makes ClickHouse crazy fast - this talk is for you as well.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/YPNYCD/resources/Olena_Kutsenko-clickhouse-slides-to-s_d951kuX.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YPNYCD/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YPNYCD/feedback/</feedback_url>
            </event>
            <event guid='cc22919b-9fad-52c9-bb76-0dbfba03d552' id='27965' code='8EQS3K'>
                <room>Maschinenhaus</room>
                <title>Tip of the Iceberg</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>Apache Iceberg is an open table format that has wide support among open-source and cloud vendors. After this talk, you&apos;ll be comfortable with all the concepts and how to use Iceberg.</abstract>
                <slug>berlin-buzzwords-2023-27965-tip-of-the-iceberg</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/8EQS3K/Driesprong-Fokko_mU2N8Z4.png</logo>
                <persons>
                    <person id='32582'>Fokko Driesprong</person>
                </persons>
                <language>en</language>
                <description>Apache Iceberg is a high-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data while making it possible for engines to work with the same tables, at the same time. Iceberg is a layer on top of your traditional Parquet tables with all the best practices from the database world. Using this you can do ACID operations on a table that solely lives in cloud storage.
In the talk, I&apos;ll first introduce Iceberg and its history, and the companies that are using and actively contributing to it. We&apos;ll take a peek under the hood and I&apos;ll explain the different concepts such as metadata, manifest lists, and manifest itself, and how it uses this to help the query engine, and maintain correctness. Next, I&apos;ll go through the schema, partition, and sorting evolution and how this is done in a lazy fashion so you don&apos;t have to rewrite your multi-petabyte table, and finally I&apos;ll do a quick demo using PyIceberg.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/8EQS3K/resources/Fokko_Driesprong_-_Tip_of_the_Iceberg_wNjgvMC.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/8EQS3K/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/8EQS3K/feedback/</feedback_url>
            </event>
            <event guid='e157f6f5-6e8a-5c77-babc-eaf14c554bc9' id='27947' code='CUTBT7'>
                <room>Maschinenhaus</room>
                <title>No Mean Feat: Upgrading a Customized Solr to Upstream Solr</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:20</duration>
                <abstract>Learn how the News Search Infrastructure Team at Bloomberg migrated from a customized implementation of Apache Solr to the upstream Apache Solr</abstract>
                <slug>berlin-buzzwords-2023-27947-no-mean-feat-upgrading-a-customized-solr-to-upstream-solr</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/CUTBT7/Srivastava-Shikhar_030PmUK.png</logo>
                <persons>
                    <person id='32565'>Shikhar Srivastava</person>
                </persons>
                <language>en</language>
                <description>Technology upgrades are a key pillar of software infrastructure. However, upgrading search and information retrieval systems is a complex task. At Bloomberg, we had extended the open source Apache Solr implementation with in-house patches to customize it for our use cases. This made upgrading to a newer version quite challenging. But, when you have close to a billion documents that are used by major financial institutions across the world, you cannot afford any mistakes.

Learn how the News Search Infrastructure Team at Bloomberg migrated from a highly customized implementation of Apache Solr to the upstream Apache Solr, while also making sure that the quality, correctness, and performance of the system was not affected. You will learn about the different strategies we used before, during, and after the migration to make the upgrade transparent to our internal users, all while serving millions of requests everyday.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/CUTBT7/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/CUTBT7/feedback/</feedback_url>
            </event>
            <event guid='9af68c2b-03b7-5f69-8aa6-1a53f5d6a5e5' id='27129' code='DDRGJG'>
                <room>Maschinenhaus</room>
                <title>Model Fine-tuning For Search: From Algorithms to Infra</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T15:20:00+02:00</date>
                <start>15:20</start>
                <duration>00:40</duration>
                <abstract>Deep learning for search has become a hot topic, while pre-trained neural nets do not function well as expected. We will discuss the algorithms behind model fine-tuning, and how to scale it up.</abstract>
                <slug>berlin-buzzwords-2023-27129-model-fine-tuning-for-search-from-algorithms-to-infra</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/DDRGJG/Wang-Bo_Werk-Maximilian_2Mp4WyM.png</logo>
                <persons>
                    <person id='34928'>Maximilian Werk</person><person id='31785'>Bo Wang</person>
                </persons>
                <language>en</language>
                <description>Deep learning for search has become a hot topic in recent years, it enables users to search based on semantics, search based on visual similarity, and conduct cross-multi/modality searches.

Though promising, it is non-trivial to use deep neural nets inside your system and expect it works out of the box. In fact, in most cases, it doesn&apos;t work. The reason can be summarised into three pillars: task shift, domain shift, and knowledge shift. 

Firstly, most of the deep learning models are trained to minimize classification/regression/segmentation loss, rather than search loss.  Secondly, the dataset on which the model was trained could be quite different from the data you&apos;re working on. Last but not least, we observed a notable knowledge gap between search engineers and machine learning engineers.

In this talk, we would like to gently guide the audience into the neural search world. Discuss the motivation behind model tuning. Then, we&apos;ll discuss the algorithm frameworks behind model fine-tuning, such as deep metric learning, contrastive learning and self-supervised learning. Last but not least, we&apos;ll talk about the infrastructure behind a mature training service and how could we scale it up.

We believe the topic could be interesting for the Berlin Buzzwords audience since it covers several aspects of the tags: search, data science, and scale. After the 40 minutes talk, the audience is expected to understand:
1. What is neural search and why it is important.
2. The algorithms to improve pre-trained neural nets for single-modality search/cross-modality search.
3. Our tech stack to scale the training platform up.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/DDRGJG/resources/Bo_Wang_-_Model_fine-tuning_for_searc_JILo8oI.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/DDRGJG/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/DDRGJG/feedback/</feedback_url>
            </event>
            <event guid='33bc9cb1-f730-5381-b9ed-e9f90516d594' id='27997' code='8WUWFL'>
                <room>Maschinenhaus</room>
                <title>Ingesting over 4 million rows a second on a single instance</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T16:30:00+02:00</date>
                <start>16:30</start>
                <duration>00:40</duration>
                <abstract>When we set up to write an open source fast time series database, we realised we would need every trick in the book to make it as performant as possible. This talk will show what&apos;s inside.</abstract>
                <slug>berlin-buzzwords-2023-27997-ingesting-over-4-million-rows-a-second-on-a-single-instance</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/8WUWFL/Ramirez-Javier_W9rQMdJ.png</logo>
                <persons>
                    <person id='32613'>Javier Ramirez</person>
                </persons>
                <language>en</language>
                <description>How would you build a database to support sustained ingestion of several hundreds of thousands rows per second while running near real-time queries on top?

In this session I will go over some of the technical decisions and trade-offs we applied when building QuestDB, an open source time-series database developed mainly in JAVA, and how we can achieve over four million row writes per second on a single instance without blocking or slowing down the reads. There will be code and demos, of course.

We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/8WUWFL/resources/Javier_Ramirez_-_Ingesting-over-four-_tFj4GVs.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/8WUWFL/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/8WUWFL/feedback/</feedback_url>
            </event>
            
        </room>
        <room name='Palais Atelier' guid='2fd45b3f-34a5-5cdc-83ff-c5179ae5fb09'>
            <event guid='dbb33177-5550-5f34-8a64-8ea1e61cdd09' id='33356' code='ZNJLXG'>
                <room>Palais Atelier</room>
                <title>Migrate Data, &lt;Mesh&gt; in mind</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T10:40:00+02:00</date>
                <start>10:40</start>
                <duration>00:20</duration>
                <abstract>For quite some time, Hadoop served as the data warehouse for Kleinanzeigen. In this presentation, our objective is to provide an overview of our approach, which involves implementing a cloud-based data pipeline with the help of dbt and Airflow.</abstract>
                <slug>berlin-buzzwords-2023-33356-migrate-data-mesh-in-mind</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/ZNJLXG/Rende-Aydan_sWjaY9s.png</logo>
                <persons>
                    <person id='37495'>Aydan Rende</person>
                </persons>
                <language>en</language>
                <description>For quite some time, Hadoop served as the data warehouse for Kleinanzeigen. However, the central teams eventually decided to say goodbye to this old friend due to its outdated nature and high costs. This migration presented us with a valuable opportunity to embrace the Data Mesh strategy and establish a new data pipeline. In this presentation, our objective is to provide an overview of our approach, which involves implementing a cloud-based data pipeline with the help of dbt and Airflow. Furthermore, we will delve into the challenges we faced during the process, including the debugging of legacy data flows, the complexities of copying data to s3, and dealing with the domain ownership issues. By sharing these experiences, we aim to provide valuable insights into our journey.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/ZNJLXG/resources/Aydan_Rende_-_Migrate_Data_Mesh_in_mi_0jyeP5d.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ZNJLXG/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ZNJLXG/feedback/</feedback_url>
            </event>
            <event guid='a6e6224d-5b95-5a1e-913f-39b21b9d74d3' id='33803' code='ZQ9CPX'>
                <room>Palais Atelier</room>
                <title>Supercharging your transformers with synthetic query generation and lexical search</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T11:10:00+02:00</date>
                <start>11:10</start>
                <duration>00:40</duration>
                <abstract>This talk will explore dramatic gains in ranking performance from small transformer models, fine-tuned with synthetic query generation and combined with lexical search, and will equip the audience to pursue the same approach using open-source tools.</abstract>
                <slug>berlin-buzzwords-2023-33803-supercharging-your-transformers-with-synthetic-query-generation-and-lexical-search</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/ZQ9CPX/Shyani-Milind_9d4jqVj.png</logo>
                <persons>
                    <person id='37902'>Milind Shyani</person>
                </persons>
                <language>en</language>
                <description>Pre-trained transformers have revolutionized search. However, off-the-shelf transformers, at a fixed model size, perform poorly on out-of-domain data. Larger models have better generalization capabilities but strict latency and cost requirements limit the size of production models.

In this talk, we will demonstrate how small transformer models can be fine-tuned on specific domains, even in the absence of labelled data, using the technique of synthetic query generation. Our process involves releasing a fine-tuned 1.5B parameter query generation model that, given a document, generates multiple questions that are answered by the document. These query-document combinations are then used to train a fine-tuned model. We combine the fine-tuned model with OpenSearch lexical search tools and benchmark them. Using these tools, we demonstrate a state-of-the-art, zero-shot nDCG@10 boost of 14.30% over BM25 on a benchmark of 10 public test datasets.

 

We elaborate upon lessons learned from training and using large language models for query generation. We also discuss some open questions around representation anisotropy, keyword filtering and index sizes of dense models. Ultimately, audiences will take away from the presentation an understanding of the processes used to fine-tune small transformer models and combine them with lexical search, along with step-by-step guidance with which to pursue their own improvements in search accuracy using open-source tools.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ZQ9CPX/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ZQ9CPX/feedback/</feedback_url>
            </event>
            <event guid='c890fc57-529d-5646-ada4-2091c5961f28' id='27590' code='TDWCGF'>
                <room>Palais Atelier</room>
                <title>Apache Airflow in Production - Bad vs Best Practices</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T12:00:00+02:00</date>
                <start>12:00</start>
                <duration>00:40</duration>
                <abstract>This talk will explore the bad and best practices when deploying Apache Airflow in a production environment. From common pitfalls such as misconfigured tasks and lack of scalability,</abstract>
                <slug>berlin-buzzwords-2023-27590-apache-airflow-in-production-bad-vs-best-practices</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/TDWCGF/Ravi-Bhavani_Zjdsflt.png</logo>
                <persons>
                    <person id='32221'>Bhavani Ravi</person>
                </persons>
                <language>en</language>
                <description>Apache Airflow has become a popular open-source platform for managing and orchestrating data pipelines. However, as with any technology, there are good and bad ways to use it. This talk will explore the bad and best practices when deploying Apache Airflow in a production environment. From common pitfalls, such as misconfigured tasks and lack of scalability, to best practices, such as robust monitoring and proper security measures, this talk will provide practical advice for anyone looking to implement Apache Airflow in their production environment.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/TDWCGF/resources/Bhavani_Ravi_-_Apache_Airflow_Bad_Vs__PmL5Ums.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/TDWCGF/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/TDWCGF/feedback/</feedback_url>
            </event>
            <event guid='31a974bb-eea1-5503-85e6-28484f57bca8' id='26844' code='XY8JRJ'>
                <room>Palais Atelier</room>
                <title>A Kafka Client&#8217;s Request: There and Back Again</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>Understand how data moves into and out of Apache Kafka&#174; by taking a look at the producer and consumer request life cycle. Follow a request from an initial call to send() or poll(), all the way to disk</abstract>
                <slug>berlin-buzzwords-2023-26844-a-kafka-client-s-request-there-and-back-again</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/XY8JRJ/Fine-Danica_GzqoRYg.png</logo>
                <persons>
                    <person id='30636'>Danica Fine</person>
                </persons>
                <language>en</language>
                <description>Do you know how your data moves into and out of your Apache Kafka&#174; instance? From the programmer&#8217;s point of view, it&#8217;s relatively simple. But under the hood, writing to and reading from Kafka is a complex process with a fascinating life cycle that&#8217;s worth understanding.

When you call producer.send() or consumer.poll(), those calls are translated into low-level requests which are sent along to the brokers for processing. In this session, we&#8217;ll dive into the world of Kafka producers and consumers to follow a request from an initial call to send() or poll(), all the way to disk, and back to the client via the broker&#8217;s final response. Along the way, we&#8217;ll explore a number of client and broker configurations that affect how these requests are handled and discuss the metrics that you can monitor to help you to keep track of every stage of the request life cycle.

By the end of this session, you&#8217;ll know the ins and outs of the read and write requests that your Kafka clients make, making your next debugging or performance analysis session a breeze.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/XY8JRJ/resources/Danica_Fine_-_A_Kafka_Clients_Request_c0zWlGI.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/XY8JRJ/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/XY8JRJ/feedback/</feedback_url>
            </event>
            <event guid='daa2e518-7f47-51f5-845f-3c2d6c872d92' id='31382' code='UNYQ83'>
                <room>Palais Atelier</room>
                <title>Joining Dozens of Data Streams in Distributed Stream Processing Systems</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:20</duration>
                <abstract>This talk will explore the techniques and best practices for joining dozens of data streams, focusing on different joining mechanisms, such as binary joins and delta joins, as well as pros and cons.</abstract>
                <slug>berlin-buzzwords-2023-31382-joining-dozens-of-data-streams-in-distributed-stream-processing-systems</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/UNYQ83/Wu-Yingjun_upiPChu.png</logo>
                <persons>
                    <person id='35811'>Yingjun Wu</person>
                </persons>
                <language>en</language>
                <description>As real-time data processing becomes increasingly essential, organizations face the challenge of efficiently joining and correlating data from multiple streams to gain valuable insights using distributed stream processing systems. This talk will explore the techniques and best practices for joining dozens of data streams, focusing on different joining mechanisms, such as binary joins and delta joins, as well as their pros and cons. Attendees will gain an understanding of various stream join techniques, learn how to optimize performance in distributed environments and apply lessons from industry experiences. Furthermore, the talk will discuss leveraging decoupled compute-storage architecture to reduce join costs. This knowledge will enable participants to harness the full potential of their data, creating efficient and powerful distributed stream processing solutions for their organizations.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/UNYQ83/resources/Yingjun_Wu_-_joining_dozens_of_data_s_8bMIHY7.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/UNYQ83/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/UNYQ83/feedback/</feedback_url>
            </event>
            <event guid='5f2391d3-0c14-5bab-b42d-c88690336e6f' id='28166' code='JTD7GY'>
                <room>Palais Atelier</room>
                <title>Laptop-sized ML for Text, with Open Source</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T15:20:00+02:00</date>
                <start>15:20</start>
                <duration>00:40</duration>
                <abstract>Advanced ML models for text may need hundreds of machines, but with open source tools and pre-trained models, you can do a lot just on your laptop or docker container. Discover what and how!</abstract>
                <slug>berlin-buzzwords-2023-28166-laptop-sized-ml-for-text-with-open-source</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/JTD7GY/Burch-Nick_YqT4Xnn.png</logo>
                <persons>
                    <person id='32682'>Nick Burch</person>
                </persons>
                <language>en</language>
                <description>AI text models like GPT3, ChatGPT, Bing AI and Github Co-Pilot are getting a lot of buzz right now, both good and bad. Much of the training techniques are public, but the computational and data requirements mean most of us can&apos;t build our own. Using these big models typically involves cost or sharing your data. What if that&apos;s not an option?

Luckily, there are a number of open source language models out there, with pre-trained versions available to download! They won&apos;t let you compete with Google or OpenAI, but they&apos;re good enough for a number of real world problems.

We&apos;ll start with a quick introduction to the main open ML-for-text systems like Word2vec, GloVe, ELMo and BERT, along with how they differ from traditional text relevancy like TF-IDF. Then, we&apos;ll discover how open source ML frameworks let us easily work with those techniques, and how pre-trained models let
us quickly get up and running.

With our ML-for-text model running on our laptop (or hefty docker container!), next it&apos;s time to see what kinds of problems we can solve! We&apos;ll look at embeddings for search, inference, semantic reasoning, prediction and more, all with (fairly) minimal coding. Finally, we&apos;ll see how we can improve the pre-trained models for specific use-cases with our own text.

It may not run on your phone and it probably won&apos;t hallucinate incorrect answers, but there&apos;s still a lot of text problems we can solve just with open source on our laptops. And we&apos;ll share the code you need to do so!</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/JTD7GY/resources/Nick_Burch_-_Laptop-sized_ML_for_Text_NugssWi.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JTD7GY/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JTD7GY/feedback/</feedback_url>
            </event>
            <event guid='82782ed4-fd80-501e-b66d-eb89b6245de8' id='26702' code='DPBSAR'>
                <room>Palais Atelier</room>
                <title>ML with Domain-Specific Ontology for IT Security Industry</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T16:30:00+02:00</date>
                <start>16:30</start>
                <duration>00:40</duration>
                <abstract>The BSI provides actual data on acute IT threat situations. We developed a system for detecting threats: crawling, automatic analysis with NER, NEL, provision and use of dedicated tools for evaluating</abstract>
                <slug>berlin-buzzwords-2023-26702-ml-with-domain-specific-ontology-for-it-security-industry</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/DPBSAR/Wu-Qi_YtFtyWd.png</logo>
                <persons>
                    <person id='31377'>Qi Wu</person><person id='36867'>Bertram S&#228;ndig</person>
                </persons>
                <language>en</language>
                <description>The BSI monitors and assesses the current IT security situation and its long-term changes. This includes, for example, hacker groups or newly discovered security vulnerabilities. For this purpose, various news sources are monitored and important information is extracted to identify current trends and gain an overview.

To optimize this process, we are working with the BSI to develop a system that supports the work by subjecting documents to automatic analysis using methods such as Named Entity Recognition (NER) and Named Entity Linking (NEL). While NER refers to the mapping of text passages to given classes through machine learning (e.g., &quot;browser&quot; to software), NEL aims at mapping to concrete entities of an ontology (e.g., &quot;DOS&quot; to &quot;Disk Operating System&quot;). We explain how we deal with the particular challenge of conceptual ambiguities (&quot;DOS&quot; stands not only for &quot;Disk Operating System&quot; but also for &quot;Denial of Service&quot;). The talk gives an insight into our entity recognition system and how we create a powerful tool for analyzing IT security documents by combining ontology and machine learning.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/DPBSAR/resources/Qi_Wu__Bertram_Saendig_-_IT-Sec_NEL_wCdYsth.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/DPBSAR/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/DPBSAR/feedback/</feedback_url>
            </event>
            <event guid='dc0a716c-3dca-54a3-9f65-8528c0f8dd96' id='26832' code='JGCR9K'>
                <room>Palais Atelier</room>
                <title>Building Real-Time Applications: Cyclist Crash Detection</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T17:20:00+02:00</date>
                <start>17:20</start>
                <duration>00:40</duration>
                <abstract>In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes.</abstract>
                <slug>berlin-buzzwords-2023-26832-building-real-time-applications-cyclist-crash-detection</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/JGCR9K/Neubauer-Tomas_wcLQjFL.png</logo>
                <persons>
                    <person id='31507'>Tom&#225;&#353; Neubauer</person>
                </persons>
                <language>en</language>
                <description>As the demand for real-time data processing continues to grow, so too do the challenges associated with building production-ready applications that can handle large volumes of data and handle it quickly. In this talk, we will explore common problems faced when building real-time applications at scale, with a focus on a specific use case: detecting and responding to cyclist crashes.&#160;

Using telemetry data collected from a fitness app, we&#8217;ll demonstrate how we used a combination of Apache Kafka and Python-based microservices running on Kubernetes to build a pipeline for processing and analyzing this data in real-time. We&apos;ll also discuss how we used machine learning techniques to build a model for detecting collisions and how we implemented notifications to alert family members of a crash.

Our ultimate goal is to help you navigate the challenges that come with building data-intensive, real-time&#160; applications that use ML models. By showcasing a real-world example, we aim to provide practical solutions and insights that you can apply to your own projects.
Key takeaways:
&#8226; An understanding of the common challenges faced when building real-time applications at scale
&#8226; Strategies for using Apache Kafka and Python-based microservices to process and analyze data in real-time
&#8226; Tips for implementing machine learning models in a real-time application
&#8226; Best practices for responding to and handling critical events in a real-time application</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JGCR9K/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JGCR9K/feedback/</feedback_url>
            </event>
            
        </room>
        <room name='Frannz Salon' guid='53bc35d4-3bfa-5d32-b6f3-228e6d6dd639'>
            <event guid='f0fa6de6-c869-5bfc-8b95-fedbef42c12e' id='26706' code='79AVEA'>
                <room>Frannz Salon</room>
                <title>Using TensorFlow in a Solr Query Parser</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T11:10:00+02:00</date>
                <start>11:10</start>
                <duration>00:40</duration>
                <abstract>Tutorial for writing Solr Query Parser that use TensorFlow for Java to augment queries.</abstract>
                <slug>berlin-buzzwords-2023-26706-using-tensorflow-in-a-solr-query-parser</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/79AVEA/Gheorghe-Radu_Kuc-Rafal_GLnqTiW.png</logo>
                <persons>
                    <person id='31241'>Radu Gheorghe</person><person id='31477'>Rafa&#322; Ku&#263;</person>
                </persons>
                <language>en</language>
                <description>Typically, when you need to expand a query through a model - for example, to do entity recognition or query tagging - you&apos;d use a separate service. While this architecture is perfectly valid, the extra network hops to the &quot;query expansion microservices&quot; will impact query latency.

For autocomplete and other low-latency use-cases, you might want to trade some complexity for speed by implementing a custom query parser. In this talk, we&apos;ll show a working example:
- we&apos;ll build a model using TensorFlow in Python that does query expansion
- we&apos;ll load it with TensorFlow for Java in a Solr Query parser
- now we can run queries and get them expanded directly in Solr

One can use this talk and the resources we&apos;ll share in order to implement a query parser for their own use-case. We&apos;ll also expand on the architecture trade-offs. For example, as you add more nodes and replicas to handle more query throughput, you&apos;ll expand the capacity for query expansion. Should you need to scale these separately, you can use coordinator nodes.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/79AVEA/resources/Radu_Gheorghe__Rafal_Kuc_-_Using_Tens_lUNhggh.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/79AVEA/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/79AVEA/feedback/</feedback_url>
            </event>
            <event guid='8f4a91a7-3535-5158-90d9-712c7f153b77' id='27866' code='UFSE8P'>
                <room>Frannz Salon</room>
                <title>When Probably is Good Enough</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T12:00:00+02:00</date>
                <start>12:00</start>
                <duration>00:40</duration>
                <abstract>Probabilistic data structures give developers room to massively cut down on space requirements while sacrificing a bit of accuracy, so when is probably good enough?</abstract>
                <slug>berlin-buzzwords-2023-27866-when-probably-is-good-enough</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/UFSE8P/Norem-Savannah_rqbx521.png</logo>
                <persons>
                    <person id='32483'>Savannah Norem</person>
                </persons>
                <language>en</language>
                <description>Examining the probabilistic data structures that come built into Redis Stack will allow us to fully understand how, why and when they work best. We&apos;ll examine each of: count min sketch, top k, and bloom and cuckoo filters. Each of these has a distinct structure that we&apos;ll start with so we can see how they work. We&apos;ll then look at why each one is probabilistic and what the consequences are for that. Then we&apos;ll look at use cases for each to see when they would best be used in the wild. We&apos;ll wrap up with a demonstration of the space saving capabilities, for example the size difference between a bloom filter and a set with the same items added to each.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/UFSE8P/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/UFSE8P/feedback/</feedback_url>
            </event>
            <event guid='a7c16bc9-dce2-543c-a4d2-2fd5df6c0628' id='28158' code='PN9VJK'>
                <room>Frannz Salon</room>
                <title>Cooking up a new search system: Recipe search at Cookpad</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>How we successfully transitioned the search system for the world&apos;s largest recipe sharing platform to a modern stack &#8211; including the successes, failures, team structures, and processes along the way.</abstract>
                <slug>berlin-buzzwords-2023-28158-cooking-up-a-new-search-system-recipe-search-at-cookpad</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/PN9VJK/Williams-Matt_LjYEqVF.png</logo>
                <persons>
                    <person id='32679'>Matt Williams</person>
                </persons>
                <language>en</language>
                <description>Cookpad is the largest recipe sharing platform in the world. Our mission is to make everyday cooking fun, and central to that is our search product. Our search engine helps cooks everywhere find tasty dishes to cook from our ever-growing catalogue of six million recipes created by everyday cooks. As a global recipe search &#8211; available in 70+ countries, 30+ languages, and to over 50 million monthly users &#8211; delivering this is no mean feat.

In order to prepare for a substantial new iteration of our search product, we realised that our existing search system was not suited to our goals. Over the course of two years, we embarked on a technological and cultural transition, with the aim of giving product teams and engineers (including search engineers, data scientists, and ML engineers) greater ownership over the search experience. This included a shift to a Python-based stack and a data-driven approach to search relevance improvement.

We embarked on a transition to a new system, along with new team structures and team composition, and new strategies for improving the search experience. Over two years we delivered a new system, without halting product development along the way, and without disruption to the user experience.

Our starting point was a team and system with capacity limited to maintenance and bug fixes, where relevance enhancement was delivered through incremental knowledge base tuning by SMEs (non-engineer subject matter experts). Our end point was multiple search teams who have greater ownership over the search experience and relevance improvement, assisted by SMEs, and following a process for rigorously tested hypothesis-driven experimentation.

This change involved transitioning to a new event-driven architecture, along with technologies that were new to Cookpad search, such as Kubernetes, Kafka, Python, and machine learning. In addition &#8211; and just as importantly &#8211; it also involved a transformation in team structures and team composition, for which we borrowed many concepts and practices learned from the search community, and also ideas from the Team Topologies movement.

This talk will cover our journey, why we did it, as well as the trials, tribulations, and successes along the way. Hopefully, it will give others who are in a similar position new ideas on how to reinvent their own search system and search function, while minimising disruption to product delivery, in order to deliver proven improvements at pace.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/PN9VJK/resources/Matt_J_Williams_-_Cooking_up_a_new_Se_psIl09j.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/PN9VJK/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/PN9VJK/feedback/</feedback_url>
            </event>
            <event guid='c6a05577-3bae-5ac6-aad1-a0a91b8d7da8' id='27991' code='DFQK8E'>
                <room>Frannz Salon</room>
                <title>Big data in the service of reliable news</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-19T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:20</duration>
                <abstract>Data vs. Fake news : using available data to offer a critical view of the world</abstract>
                <slug>berlin-buzzwords-2023-27991-big-data-in-the-service-of-reliable-news</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/DFQK8E/Pop-Radu_Dauvissat-Benjamin_423ULmT.png</logo>
                <persons>
                    <person id='32558'>Radu Pop</person><person id='32607'>Benjamin Dauvissat</person>
                </persons>
                <language>en</language>
                <description>Brandolini&apos;s law states that debunking misinformation consumes more energy than spreading it.  
With tools that organize, transform and present data we can reduce this amount of energy and provide the tools to apprehend the world in a skeptical way as an alternative to short messages on social-media or re-interpreted news headlines.
These tools need to enquire a large base of data and to select heterogeneous sources of information.  
The main challenge leans in the ability to harvest, aggregate and synthetically present the emerging facts.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/DFQK8E/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/DFQK8E/feedback/</feedback_url>
            </event>
            <event guid='51aa2d2e-e86b-5505-8a92-19fab7386c2b' id='28154' code='VEQHVW'>
                <room>Frannz Salon</room>
                <title>Building On-Ramps for Non-Code Contributors in Open Source</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T15:20:00+02:00</date>
                <start>15:20</start>
                <duration>00:40</duration>
                <abstract>Open source software is so much more than code &#8211; docs, community and infra need maintaining. How do you attract and keep non-code contributors? Let two experienced practitioners show you the way!</abstract>
                <slug>berlin-buzzwords-2023-28154-building-on-ramps-for-non-code-contributors-in-open-source</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/VEQHVW/Vlatko-Natali_Horgan-Celeste_RCygfpy.png</logo>
                <persons>
                    <person id='32675'>Natali Vlatko</person><person id='35311'>Celeste Horgan</person>
                </persons>
                <language>en</language>
                <description>&#8220;Contributions welcome!&#8221; We&#8217;ve all seen this standard disclosure across open source projects, whether they&#8217;re actively looking for contributors or not. We build working groups, backlogs and enhancement proposal processes designed to bring in developers to help us work on our project, and with any luck it succeeds. But supporting open source means supporting the entire project: documentation, community meetings, events and infrastructure. One of the most common questions smaller projects have is how to attract these non-code contributors, but there aren&#8217;t any easy answers out there. 

In this talk, experienced non-technical contributors in the Kubernetes and cloud native ecosystems go through some of the ways Kubernetes has built contributor on-ramps for non-code contributions, and how you can adapt them to your projects. The talk features practical examples of what to do to support the non-code aspects of your open source project and how to attract &#8211; and retain &#8211; contributors.

This talk will cover the most common question we continually face in the cloud native community from non-k8s projects: how to attract non-technical contributors and get things done. This is clearly seen as a scalability issue for many OSS projects. We&apos;ll also share a bit about our individual stories: how we, two non-code contributors, got involved with the Kubernetes project and the challenges we faced.

Next, what we know has worked: mentorship programs, pairing/shadowing, clear role documentation, easy to understand backlogs (good first issues), and other approaches we&#8217;ve seen that we haven&#8217;t applied personally. Finally, we&apos;ll cover why non-code contribution is both exciting and important, especially regarding governance and policy, plus how open source maintainers and companies with OSPOs can help non-code contributors get involved.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/VEQHVW/resources/Natali-Vlatko_Celeste-Horgan-Building_wewfGLh.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/VEQHVW/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/VEQHVW/feedback/</feedback_url>
            </event>
            <event guid='2ef89655-81f1-555d-80d0-2156c0b5ea99' id='27533' code='VFWKWM'>
                <room>Frannz Salon</room>
                <title>Who broke the build? -Using Kuttl to test and Release faster</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T16:30:00+02:00</date>
                <start>16:30</start>
                <duration>00:40</duration>
                <abstract>No one wants to be responsible for breaking the build. But what can you do as a developer to avoid being the bad guy? How can project leads enable their teams to reduce the occurrence of builds?</abstract>
                <slug>berlin-buzzwords-2023-27533-who-broke-the-build-using-kuttl-to-test-and-release-faster</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/VFWKWM/chukka-ram_0UmILew.png</logo>
                <persons>
                    <person id='32181'>Ram Mohan Rao Chukka</person>
                </persons>
                <language>en</language>
                <description>Description:

No one wants to be responsible for breaking the build. But what can you do as a developer to avoid being the bad guy? How can project leads enable their teams to reduce the occurrence of broken builds?
In talking within our own teams, we discovered that many developers weren&#8217;t running sufficient integration and End to End tests in their local environments because it&#8217;s too difficult to set up and administer test environments in an efficient way.
That&#8217;s why we decided to rethink our entire local testing process in hopes of cutting down on the headaches, heartaches, and valuable time wasted. Enter Kuttl. Connecting Kuttl to CI builds has empowered our developers to easily configure a development environment locally that accurately matches the final test environment &#8212; without needing to become an expert CI admin themselves.
These days, we hear, &#8220;Who broke the build?&#8221; far less often &#8212; and you can too!

Session Outline:

In this session, we&#8217;ll discuss how we use kuttl to achieve more streamlined testing and fewer broken builds. We&#8217;ll cover:
&#9679; A quick history of our testing challenges and what led us to Kuttl
&#9679; The benefits of our new testing approach &#8212; easy to configure and minimal investment
&#9679; How we combine Kuttl and CI pipelines for more streamlined testing and fewer broken builds

Session Key Takeaways:

1. When and why we decided to rethink our e2e testing practices and our subsequent discovery of Kuttl.
2. Why Kuttl has been the perfect tool for our developers to perform better local integration/e2e testing without the burden of becoming their own CI administrators.
3. A detailed account of how we utilize Kuttl to set up development environments locally that match our final test environment in order to reduce unnecessary commits and minimize CI build breaks.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/VFWKWM/resources/Ram_Chukka_-_Who_broke_the_build_fina_5wjkVo8.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/VFWKWM/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/VFWKWM/feedback/</feedback_url>
            </event>
            <event guid='cf8adf2e-3c5b-5f63-8d55-216d4537f9bd' id='27845' code='9YHQK8'>
                <room>Frannz Salon</room>
                <title>Creating chaos in containers</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-19T17:20:00+02:00</date>
                <start>17:20</start>
                <duration>00:40</duration>
                <abstract>Chaos engineering is hard, in containers it is even harder.
This session will show attendees the considerations and get them started on their way to making more resilient applications in the cloud</abstract>
                <slug>berlin-buzzwords-2023-27845-creating-chaos-in-containers</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/9YHQK8/Saidel-Keesing-Maish_1Wrlxdb.png</logo>
                <persons>
                    <person id='32457'>Maish Saidel-Keesing</person>
                </persons>
                <language>en</language>
                <description>Chaos engineering is not a new concept, it has been around since 2011. The benefit of knowing the weak spots of your application before it actually breaks is extremely valuable.
But with containers, this becomes bit more complicated. There are many layers of possible failure running under your application.

In this session you will learn more about the different layers you should be releasing your chaos experiments on, the considerations you need to take into account while testing a shared platform, and also learn about the tooling available to accomplish this.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/9YHQK8/resources/Maish_Saidel-Keesing_-_Creating_chaos_jOGYFzb.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/9YHQK8/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/9YHQK8/feedback/</feedback_url>
            </event>
            
        </room>
        
    </day>
    <day index='3' date='2023-06-20' start='2023-06-20T04:00:00+02:00' end='2023-06-21T03:59:00+02:00'>
        <room name='Kesselhaus' guid='0eebb650-d6b8-53cb-9d33-48dd47df08c1'>
            <event guid='50d50f17-bb11-5b11-8009-472c460532b0' id='28105' code='NNNZ8W'>
                <room>Kesselhaus</room>
                <title>What&apos;s coming next with Apache Lucene?</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T09:30:00+02:00</date>
                <start>09:30</start>
                <duration>00:20</duration>
                <abstract>This talk will discuss the ways Apache Lucene might go in the next years. From the perspective of a full-text search engine, it looks like it is feature-complete. So what comes next?</abstract>
                <slug>berlin-buzzwords-2023-28105-what-s-coming-next-with-apache-lucene</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/NNNZ8W/Schindler-Uwe_Gzxeoma.png</logo>
                <persons>
                    <person id='32467'>Uwe Schindler</person>
                </persons>
                <language>en</language>
                <description>Around Lucene 8 most people thought &quot;There&apos;s not much that can be done anymore&quot;. In contrast to that, if you look into Apache Lucene&apos;s list of new features after each release, you will see mostly 2 new areas of improvements: Vector search and performance improvements.
Is this the end of development? For sure: No! This talk will check how ongoing optimizations in the Java ecosystem might be implemented in Apache Lucene. As example, this will present the new vector incubation module in recent JDKs and how it helps to make indexing and searching much faster starting with Java 20.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/NNNZ8W/resources/WhatsComingNextWithApacheLucene2023_a2J1CDa.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/NNNZ8W/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/NNNZ8W/feedback/</feedback_url>
            </event>
            <event guid='b1c4a591-f0fb-5cd4-b518-72565a01dfe9' id='27962' code='YLCZP8'>
                <room>Kesselhaus</room>
                <title>Building MLOps Infrastructure at Japan&apos;s Largest C2C E-Commerce Site</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T10:00:00+02:00</date>
                <start>10:00</start>
                <duration>00:40</duration>
                <abstract>The MLOps infrastructure we built to support ML in search at Mercari, Japan&#8217;s largest C2C e-commerce platform.</abstract>
                <slug>berlin-buzzwords-2023-27962-building-mlops-infrastructure-at-japan-s-largest-c2c-e-commerce-site</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/YLCZP8/Ginstrom-Ryan_Narboneta-Zosa-Teo_4htELvp.png</logo>
                <persons>
                    <person id='32576'>Ryan Ginstrom</person><person id='32578'>Teo Narboneta Zosa</person>
                </persons>
                <language>en</language>
                <description>We describe the system we built to support ML in search at Mercari, Japan&#8217;s largest C2C e-commerce platform. We start by describing the journey to enable the use of ML in a &#8220;traditional&#8221; term-based search infrastructure with high throughput and strict latency requirements. We also discuss the mixed blessing of rushing a successful proof of concept into production and the technical challenges this posed on the infrastructure side.

Next, we discuss the nuts and bolts of data engineering, ETLs, training pipelines, and serving/monitoring our ML model in production. We also discuss some of the weaknesses of our initial homegrown system, including A/B testing and model monitoring. Finally, we discuss our efforts to evolve our homegrown system into a more modern MLOps infrastructure using an A/B testing framework and Seldon for traffic routing and model serving.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YLCZP8/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YLCZP8/feedback/</feedback_url>
            </event>
            <event guid='16cf1689-ebfa-502d-805d-0ba0cb5fdcfa' id='28088' code='TVXR9Q'>
                <room>Kesselhaus</room>
                <title>Synthetic data:  when, why, and how</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T11:00:00+02:00</date>
                <start>11:00</start>
                <duration>00:40</duration>
                <abstract>This talk will cover several use cases in which generating synthetic data is useful (or even essential) and introduce a toolbox of practical techniques for synthesizing data in these situations.</abstract>
                <slug>berlin-buzzwords-2023-28088-synthetic-data-when-why-and-how</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/TVXR9Q/Benton-William_JsoGt5G.png</logo>
                <persons>
                    <person id='32575'>William Benton</person>
                </persons>
                <language>en</language>
                <description>Data is essential to today&apos;s most interesting applications and systems, which learn from data, act autonomously in response to data, and make data digestible via search. Somewhat counterintuitively, as the importance of real data has increased, the importance of synthetic data has increased as well. In this talk, you&apos;ll learn when it&apos;s appropriate to use synthetic data (and when it isn&apos;t likely to help).  You&apos;ll also learn about several circumstances in which synthetic data is especially useful, including dealing with personally-identifying information, load testing, and simulating system response to unlikely scenarios.  The talk will conclude by providing brief, actionable introductions to several practical approaches to generating synthetic tabular data, each of which is appropriate for particular kinds of synthetic data use cases:  we&apos;ll cover a simple way to simulate data-generating processes from first principles, basic and more sophisticated statistical techniques, and approaches based on machine learning models.  You&apos;ll leave with a better understanding of the role of synthetic data in today&apos;s systems and a concrete toolbox of ways to exploit it in your own programs.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/TVXR9Q/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/TVXR9Q/feedback/</feedback_url>
            </event>
            <event guid='9f412721-a447-5047-b5d6-6ab723309191' id='28003' code='ASWYUN'>
                <room>Kesselhaus</room>
                <title>Search saves lives: solving healthcare problems with search</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T11:50:00+02:00</date>
                <start>11:50</start>
                <duration>00:40</duration>
                <abstract>During covid the pressure was on for search. I&#8217;ll discuss the challenges of building a search engine matching people to covid test facilities and how the lessons learned can solve healthcare issues.</abstract>
                <slug>berlin-buzzwords-2023-28003-search-saves-lives-solving-healthcare-problems-with-search</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/ASWYUN/Hutchinson-Chris_MZrHbaK.png</logo>
                <persons>
                    <person id='37900'>Chris Hutchinson</person>
                </persons>
                <language>en</language>
                <description>At its peak, the UK&#8217;s National Health Service (NHS) was covid testing almost half a million people per week. When demand for these appointments began to outstrip supply, search result relevance suffered. In some extreme cases, people were recommended to cross a body of water to get a test. This was a risk to public health as the NHS wanted to avoid anyone that had covid using public transport.  

To solve this the NHS needed to switch the way they filtered search results. Instead of using straight line (euclidean) distance, they wanted to filter results based on travel times. They also needed a way to tailor results based on whether the searcher had access to a car and ensure public transport was avoided. 

There were many technical challenges to delivering this kind of search. 
- High user demand - needed to be able to handle 100,000 users searching concurrently
- Response times - deliver test centre locations in under 50 milliseconds
- User data privacy - ensuring no customer data will ever be at risk
- Security - ensuring no tampering with data

There was no room for months-long stress tests. It needed to deliver on performance instantly. In my presentation I&#8217;ll walk through how we built this search under a super tight deadline. 

I&#8217;ll also walk through many other applications of search in healthcare. Including: 
- Managing Europe&#8217;s nursing shortages
- Improving the efficiency of emergency services 
- Matching mobile doctors to patients</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/ASWYUN/resources/Chris_Hutchinson_-_Search_saves_lives_mF8UArv.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ASWYUN/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ASWYUN/feedback/</feedback_url>
            </event>
            <event guid='a426f2e9-a299-5acb-9d08-94fae1f61ed8' id='27972' code='W7YXCM'>
                <room>Kesselhaus</room>
                <title>ChatGPT is lying, how can we fix it?</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>Large Language Models are great in grammar but tend to confabulate. Building a reliable knowledge base might be a way to solve it. Here is how.</abstract>
                <slug>berlin-buzzwords-2023-27972-chatgpt-is-lying-how-can-we-fix-it</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/W7YXCM/Lukawski-Kacper_ocVG9uU.png</logo>
                <persons>
                    <person id='32590'>Kacper &#321;ukawski</person>
                </persons>
                <language>en</language>
                <description>ChatGPT was a revolution nobody was ready for. All the social channels have been flooded with prompts and answers which look ok at first glance but turn out to be counterfeit. Factuality is the biggest concern about Large Language Models, not only the OpenAI product. If you build an app with LLMs, you need to be aware of this.

Retrieval Augmented Language Models seem to be the solution to overcome that issue. They combine LLMs&apos; language capabilities and the knowledge base&apos;s accuracy. The talk will review possible ways to implement it with humans in the loop.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/W7YXCM/resources/Kacper_Lukawski_-_ChatGPT_is_lying_ho_DgMwFKH.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/W7YXCM/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/W7YXCM/feedback/</feedback_url>
            </event>
            <event guid='e3e2bd13-7627-5177-bb2c-540886ee0739' id='28002' code='JYNNE9'>
                <room>Kesselhaus</room>
                <title>Cross Data Center Replication in Solr - A new approach</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:40</duration>
                <abstract>Learn about the motivation that led to the development of the new Cross Data Center (XDC) Replication module in Apache Solr and discover the capabilities it offers making it disaster ready.</abstract>
                <slug>berlin-buzzwords-2023-28002-cross-data-center-replication-in-solr-a-new-approach</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/JYNNE9/Gupta-Anshum_gtsjLct.png</logo>
                <persons>
                    <person id='32442'>Anshum Gupta</person><person id='35142'>Mark Miller</person>
                </persons>
                <language>en</language>
                <description>Apache Solr is a critical piece of infrastructure for most companies dealing with data. The systems that Solr powers are critical, requiring high availability, low latency, and disaster recovery.

This talk introduces a new approach to cross data-center replication in Solr that allows for the feature to scale and ensure disaster readiness as well as lower latency at a scale that Solr is expected to support. 

The audience will be provided a design overview including the challenges and approaches we tried. We will also introduce the current capabilities and our plan for this newly added Solr XDC module.

At the end of this talk, attendees would have a better understanding of how and when to use the new module to ensure disaster readiness for the Solr cluster as we&#8217;ll as the avenues for them to participate in enhancing the solution.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JYNNE9/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/JYNNE9/feedback/</feedback_url>
            </event>
            <event guid='5ee20920-0cb7-5b7b-a2fd-5c6b33ef320d' id='26157' code='NPKZHP'>
                <room>Kesselhaus</room>
                <title>Column-level lineage is coming to the rescue</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T16:00:00+02:00</date>
                <start>16:00</start>
                <duration>00:40</duration>
                <abstract>How are the columns containing sensitive data used across the data ecosystem?  What input columns were used to produce a given report field? Openlineage can answers those questions automatically.</abstract>
                <slug>berlin-buzzwords-2023-26157-column-level-lineage-is-coming-to-the-rescue</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/NPKZHP/Leszczynski-Pawel_Obuchowski-Maciei_YpTxHsG.png</logo>
                <persons>
                    <person id='30912'>Pawe&#322; Leszczy&#324;ski</person><person id='30952'>Maciej Obuchowski</person>
                </persons>
                <language>en</language>
                <description>OpenLineage is a standard for metadata and lineage collection that is growing rapidly. Column-level lineage is one of its most anticipated features of the community that has been developed recently. In this talk, we:
 * show foundations for column lineage within OpenLineage standard,
 * provide real-life demo on how is it automatically extracted from Spark jobs,
 * describe and demo column lineage extraction from SQL queries,
 * show how the lineage can be consumed on Marquez backend. 

We aim to provide demos to focus on practical aspects of the column-level lineage which are interesting to data practitioners all over the world.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/NPKZHP/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/NPKZHP/feedback/</feedback_url>
            </event>
            <event guid='0ea9e5f3-1b29-5342-a891-49b3def101ea' id='28111' code='BWNJZN'>
                <room>Kesselhaus</room>
                <title>Tiny Flink &#8212; Minimizing the memory footprint of Apache Flink</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T16:50:00+02:00</date>
                <start>16:50</start>
                <duration>00:20</duration>
                <abstract>We will explore options to run Apache Flink with a very low resource footprint, allowing users to run full streaming SQL queries or custom streaming applications on JVMs with less than 500mb</abstract>
                <slug>berlin-buzzwords-2023-28111-tiny-flink-minimizing-the-memory-footprint-of-apache-flink</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/BWNJZN/Metzger-Robert_Z9djZCG.png</logo>
                <persons>
                    <person id='32659'>Robert Metzger</person>
                </persons>
                <language>en</language>
                <description>Apache Flink has been designed for, and is mostly used with large-scale real-time data processing use-cases. Companies report about TBs of data being processed per second, or TBs of state in huge clusters.

But what if you need to process low-throughput streams? Running a full, distributed Flink cluster might be an overkill, as there&#8217;s quite a bit of overhead for distributed coordination.

In this talk, we&#8217;ll explore options to reduce your resource footprint. We&#8217;ll dive deeper into Flink&#8217;s MiniCluster, allowing you to run Flink in-JVM for integration tests, as a micro service or just a small processing your data in Kubernetes. We will also discuss lessons learned from running MiniCluster in production for a service offering Flink SQL in the cloud.

Attend this talk if you want to learn about Apache Flink and its various options to deploy and configure it.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/BWNJZN/resources/Robert_Metzger_-_Tiny_Flink_VThaNrK.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/BWNJZN/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/BWNJZN/feedback/</feedback_url>
            </event>
            <event guid='5a63b0a9-46a8-545b-9598-5f0c3d0e5bbb' id='31135' code='YCHTJK'>
                <room>Kesselhaus</room>
                <title>Closing Session</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T17:15:00+02:00</date>
                <start>17:15</start>
                <duration>00:15</duration>
                <abstract></abstract>
                <slug>berlin-buzzwords-2023-31135-closing-session</slug>
                <track></track>
                
                <persons>
                    
                </persons>
                <language>en</language>
                
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YCHTJK/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/YCHTJK/feedback/</feedback_url>
            </event>
            
        </room>
        <room name='Maschinenhaus' guid='c6fb8c6b-f3de-5975-8424-16bf52eead3a'>
            <event guid='154c79a1-6aa6-517b-a912-959b772e538c' id='28114' code='WS7LUL'>
                <room>Maschinenhaus</room>
                <title>Avoiding Anti-patterns in Technical Communication</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T09:30:00+02:00</date>
                <start>09:30</start>
                <duration>00:20</duration>
                <abstract>Communicating technical knowledge effectively is a core skill for practitioners, but one which is often neglected. We&#8217;ll give practical advice on how to (and not to!) communicate technical ideas.</abstract>
                <slug>berlin-buzzwords-2023-28114-avoiding-anti-patterns-in-technical-communication</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/WS7LUL/Watson-Sophie_k2Qy4yU.png</logo>
                <persons>
                    <person id='32660'>Sophie Watson</person>
                </persons>
                <language>en</language>
                <description>Practitioners and researchers alike share technical knowledge across a wide range of mediums; from blogs and conference talks, to internal presentations and slack messages. However, communicating technical information effectively is not an easy skill to learn, and every day we are bombarded with poorly communicated content. In this talk we&#8217;ll cover some common, but rarely recognised, anti-patterns in technical communication. We&#8217;ll dive into why they are an ineffective way to get a point across, and discuss how to avoid them in your content. You will leave the talk with a clear understanding of how to improve your technical communication, making your blogs, talks and day-to-day discussions more effective and impactful.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WS7LUL/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WS7LUL/feedback/</feedback_url>
            </event>
            <event guid='76e51085-5c5f-55bc-92a3-56e10b09f783' id='27982' code='Q9Y9Y3'>
                <room>Maschinenhaus</room>
                <title>Connect GPT with your data: Retrieval-augmented Generation</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T10:00:00+02:00</date>
                <start>10:00</start>
                <duration>00:40</duration>
                <abstract>Learn how to build with LLMs, like ChatGPT, and avoid typical pitfalls like hallucination and outdated information. Accompanied by practical code examples using the open source framework Haystack.</abstract>
                <slug>berlin-buzzwords-2023-27982-connect-gpt-with-your-data-retrieval-augmented-generation</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/Q9Y9Y3/Pietsch-Malte_1UVerm3.png</logo>
                <persons>
                    <person id='32598'>Malte Pietsch</person>
                </persons>
                <language>en</language>
                <description>Large Language Models (LLMs), like ChatGPT, became the poster child of AI overnight. They changed how people search the web, how they write content, and how they code. These models have billions of parameters they can use to effectively store some of the information they saw during pre-training. This enables them to show deep knowledge of a subject, even if they weren&apos;t explicitly trained on it.

Yet, it&#8217;s not straightforward to use LLMs in enterprise use cases and embed them successfully in your product.

The most common challenges with LLMs are
1) They don&apos;t know anything about YOUR data
2) Their knowledge is not up-to-date
3) They hallucinate - it&apos;s hard to understand on what sources they based their answers on
4) It&#8217;s hard to assess their performance

In this talk, you will learn how to deal with all of the above challenges. We will demonstrate how to connect LLMs to your data and how to keep them up-to-date using retrieval-augmented generation. We will show how to design prompts that minimize hallucination and how to evaluate the performance of your NLP application by collecting end-user feedback. We share best practices of development workflows and typical traps along the way. 

Each step will be accompanied by practical code examples using the open source framework Haystack. By the end of the talk, you will not only know the methods to overcome the above challenges but also have code examples at hand that let you kickstart the development of your own NLP features.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/Q9Y9Y3/resources/Malte_Pietsch_-_Connect_GPT_with_your_0PadQlq.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/Q9Y9Y3/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/Q9Y9Y3/feedback/</feedback_url>
            </event>
            <event guid='a19121ad-c62c-50c2-b949-e53f76ee1e00' id='27885' code='HG9XEL'>
                <room>Maschinenhaus</room>
                <title>A Fresh Start? The Path Toward Apache Solr&apos;s v2 API</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T11:00:00+02:00</date>
                <start>11:00</start>
                <duration>00:40</duration>
                <abstract>Modernization efforts face particular hurdles in large, established OSS projects.  Come learn about the community and technical challenges encountered on Apache Solr&apos;s path towards revamped HTTP APIs.</abstract>
                <slug>berlin-buzzwords-2023-27885-a-fresh-start-the-path-toward-apache-solr-s-v2-api</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/HG9XEL/Gerlowski-Jason_hPuunjX.png</logo>
                <persons>
                    <person id='32500'>Jason Gerlowski</person>
                </persons>
                <language>en</language>
                <description>Affecting broad changes in large, established open source projects is hard.  Their larger codebases make for more places to update.  Their age makes for more technical debt to overcome.  And their larger user-bases make for more stakeholder opinions to weigh and balance.  This talk will explore some of these ideas through the lens of Apache Solr&apos;s ongoing attempt to modernize its HTTP APIs and associated clients.  Some attention will be given to the state of Solr&apos;s APIs, but the primary focus will be the technical and community challenges encountered by the Solr community on the path towards its &quot;v2&quot; API.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/HG9XEL/resources/Jason_Gerlowski_-_A_fresh_start_Gt4igxh.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/HG9XEL/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/HG9XEL/feedback/</feedback_url>
            </event>
            <event guid='132f3275-7993-5e2d-823f-0a4b3c9a0d75' id='27890' code='ELVNYV'>
                <room>Maschinenhaus</room>
                <title>Rethinking Autoscaling for Apache Solr using Kubernetes</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T11:50:00+02:00</date>
                <start>11:50</start>
                <duration>00:40</duration>
                <abstract>Apache Solr&#8217;s built-in autoscaling is gone, but the need for autoscaling persists. Using Kubernetes&#8217; HPA, the Solr Operator and new Solr APIs, we re-introduce autoscaling for Solr on Kubernetes.</abstract>
                <slug>berlin-buzzwords-2023-27890-rethinking-autoscaling-for-apache-solr-using-kubernetes</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/ELVNYV/Putman-Houston_TGuAFN3.png</logo>
                <persons>
                    <person id='32502'>Houston Putman</person>
                </persons>
                <language>en</language>
                <description>SolrCloud clusters are often large and complex, with each organization using its own code to deploy and maintain these clusters. The Solr Operator was a first step in consolidating complexity through official deployment tooling to run Solr on Kubernetes. However the Operator does not address scaling up and down based on demand. 

Much like it provides generic deployments, Kubernetes provides generic ways of autoscaling applications, such as the HorizontalPodAutoscaler (HPA). This works especially well for stateless applications, much like deployments do. Solr is a stateful application that has specific state assigned to each pod (Solr node), therefore autoscaling SolrClouds with the HPA will not work by default.

The Solr Operator has already been built to extend Kubernetes&#8217; StatefulSets, Services and Ingresses to support Solr&#8217;s unique use-case. Therefore it is the prefect mechanism to also bridge the gap between the HorizontalPodAutoscaler and Solr.

Through extending the functionality of the Solr Operator, and adding new APIs to Solr, we will show how autoscaling can be re-introduced to the Solr ecosystem.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/ELVNYV/resources/Houston_Putman_-_Solr_Autoscaling_on__hz6jpE7.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ELVNYV/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/ELVNYV/feedback/</feedback_url>
            </event>
            <event guid='14d54ed1-a01e-595b-a25a-c04ba397e117' id='25403' code='TMGNMF'>
                <room>Maschinenhaus</room>
                <title>Platform Engineering is All About Product</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>&#8220;Platform Engineering,&#8221; the latest buzzword, means building an internal platform to improve your SDLC in a way your developers will want to use. Can this be done with engineering skills alone?</abstract>
                <slug>berlin-buzzwords-2023-25403-platform-engineering-is-all-about-product</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/TMGNMF/Bashan-Gal_NQ9p40m.png</logo>
                <persons>
                    <person id='30212'>Gal Bashan</person>
                </persons>
                <language>en</language>
                <description>&quot;Platform Engineering&quot; is the latest buzzword in modern software engineering. It is the discipline of designing and building toolchains and workflows that enable self-service capabilities for software engineering organizations in the cloud-native era. Today&apos;s holy grail for platform engineering is to achieve the most effective &quot;Internal Developer Platform&quot; (IDP) that enables the rest of the developers in the company to be as effective as possible. Can this job be accomplished with engineering skills alone?

Platform intersects with product in two ways: first, the platform must be optimized for supporting the development of the company-specific product. Second, the platform must be built with a product mindset and practices for its users - the developers- to adopt it. In this session, we will discuss how to build an engineering platform your engineers want to use. We will go over standard product practices to use when creating the developer platform and the importance of making sure your IDP helps developers build the company&apos;s products faster and better. We will define the role of the platform product manager (PPM) and his importance in ensuring our platform is not a glorified Rube Goldberg machine.

In this session, you will learn:

- What is platform engineering? Is it just a new name for DevOps?
- What makes an IDP and a platform team successful?
- Who is the PPM? Why is he important? How do I convince my head of product we need one?
- Practices you can use to build a successful platform and pitfalls to avoid.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/TMGNMF/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/TMGNMF/feedback/</feedback_url>
            </event>
            <event guid='55292975-6ccc-53d2-a151-0201914cc915' id='26093' code='Q3MFKD'>
                <room>Maschinenhaus</room>
                <title>Hadoop Vectored IO: your data just got faster!</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:40</duration>
                <abstract>We are introducing a new Hadoop Filesystem API called &quot;vectored read&quot; using which we can achieve significant speedups for all big data applications, especially in cloud storage like S3 and ABFS.</abstract>
                <slug>berlin-buzzwords-2023-26093-hadoop-vectored-io-your-data-just-got-faster</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/Q3MFKD/Loughran-Steve-2_eJYmhSG.png</logo>
                <persons>
                    <person id='32686'>Steve Loughran</person>
                </persons>
                <language>en</language>
                <description>Since 2006 the world of big data has moved from terabytes to hundreds of petabytes, from local clusters to remote cloud storage, yet the original Apache Hadoop posix-based file APIs have barely changed.

It is wonderful that these APIs have worked so well, but we can do a lot better with remote object stores, by providing new operations which suit them better, targeted at columnar data libraries such as ORC and Parquet. Only a few libraries need to migrate to these APIs for significant speedups of all big data applications.

This talk introduces a new Hadoop Filesystem API called &quot;vectored read&quot;, coming in Hadoop 3.4. An extension of the classic FSDataInputStream it is automatically offered by all filesystem clients.
The S3A connector is the first object store to provide a custom implementation, reading different blocks of data in parallel. In Apache Hive benchmarks with a modified ORC library, we saw a 2x speedup compared to using the classic s3a connector through the Posix APIs.

We will introduce the API spec, the S3A implementation, and the benchmarks, and show how to use it in your own applications. We will also cover our ongoing work on providing similar speedups with other object stores, and the use of the API in other applications.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/Q3MFKD/resources/Steve-Loughran-Hadoop_Vectored_IO_Mq2kLRW.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/Q3MFKD/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/Q3MFKD/feedback/</feedback_url>
            </event>
            <event guid='0f7a37de-1013-5d0c-8af7-7ff075445b70' id='27970' code='CWV3T3'>
                <room>Maschinenhaus</room>
                <title>Scalable distributed messaging&amp;streaming with Apache Pulsar</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T16:00:00+02:00</date>
                <start>16:00</start>
                <duration>00:40</duration>
                <abstract>In this session, you&apos;ll discover seven Apache Pulsar features that enable you to build amazing event-driven applications and how Apache Pulsar differs from traditional message brokers.</abstract>
                <slug>berlin-buzzwords-2023-27970-scalable-distributed-messaging-streaming-with-apache-pulsar</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/CWV3T3/jakubowski-julien_YBHxfGV.png</logo>
                <persons>
                    <person id='32589'>Julien Jakubowski</person>
                </persons>
                <language>en</language>
                <description>Today, when you think about building event-driven and real-time applications, the words that come to you spontaneously are probably: RabbitMQ, ActiveMQ, or Kafka. These are the solutions that dominate this landscape. But have you ever heard of Apache Pulsar?

After a brief presentation of the fundamental concepts of messaging, you&apos;ll discover the Pulsar features that enable you to build amazing event-driven applications. 
You&apos;ll learn the following:
- how Apache Pulsar architecture differs from other brokers
- how it enables scaling processing power &amp; data independently, quickly, and with no hassle
- how it guarantees high durability of messages
- how it can be relevant as a unified streaming &amp; messaging platform
- how to integrate Pulsar with your existing application portfolio that is compatible with Kafka or RabbitMQ
- some insight into the open-source community around Pulsar</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/CWV3T3/resources/Julien_Jakubowski_-_Scalable_Distribu_vS43NJZ.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/CWV3T3/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/CWV3T3/feedback/</feedback_url>
            </event>
            <event guid='b42653c2-f1ad-5f31-abb1-eea3a591faab' id='28165' code='HAKWWW'>
                <room>Maschinenhaus</room>
                <title>Catch the fraud &#8212; with observability and analytics</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T16:50:00+02:00</date>
                <start>16:50</start>
                <duration>00:20</duration>
                <abstract>This is the story of how to catch cheaters by combining observability and analytics data through the power of search.</abstract>
                <slug>berlin-buzzwords-2023-28165-catch-the-fraud-with-observability-and-analytics</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/HAKWWW/Krenn-Phillip_VD4J2lh.png</logo>
                <persons>
                    <person id='32676'>Philipp Krenn</person>
                </persons>
                <language>en</language>
                <description>Elastic &#8212; the company behind Elasticsearch, Kibana,... &#8212; is running an annual competition to reward contributions like pull requests, blog posts, talks, etc. Once we started giving away MacBooks, we got a massive influx of fraud. This talk tells the tongue-in-cheek story of how people cheated and also how we caught them:
* Observability: Find the bots and trace everyone&apos;s actions to figure out what is a coincidence and what is not.
* Analytics: See how people are trying to exploit the system through fake accounts, shady content, or bending the rules.

While we initially hadn&apos;t planned for this scenario, having the power of search available across observability and analytics data let us do many interesting correlations to get a complete picture of the monster we had created.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/HAKWWW/resources/Philipp_Krenn_-_CatchTheFraud_9D1naqw.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/HAKWWW/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/HAKWWW/feedback/</feedback_url>
            </event>
            
        </room>
        <room name='Palais Atelier' guid='2fd45b3f-34a5-5cdc-83ff-c5179ae5fb09'>
            <event guid='1835b72b-a0bb-5228-88f6-afcfda0fcd82' id='27857' code='MDFQCE'>
                <room>Palais Atelier</room>
                <title>When ms matter: Maximizing query performance in CrateDB</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T09:30:00+02:00</date>
                <start>09:30</start>
                <duration>00:20</duration>
                <abstract>Achieving optimal execution plans in distributed databases is a challenging task. This talk will focus on CrateDB: a distributed SQL database, and key strategies for optimizing its query performance.</abstract>
                <slug>berlin-buzzwords-2023-27857-when-ms-matter-maximizing-query-performance-in-cratedb</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/MDFQCE/Selakovic-Marija_ya3li6V.png</logo>
                <persons>
                    <person id='32471'>Marija Selakovic</person>
                </persons>
                <language>en</language>
                <description>Distributed databases provide easy scaling, high performance, and availability crucial to handle large amounts of data. However, achieving optimal execution plans in these systems is often a challenge and requires special considerations. In this talk, we will explore the key concepts and best practices for optimizing query performance in the CrateDB database. CrateDB is a highly scalable and distributed SQL database that offers a unique blend of SQL and NoSQL capabilities. Although the focus of the talk is going to be on CrateDB, most of the techniques we are going to discuss apply to many distributed databases.

As a first step, we will go through query planning to better understand potential bottlenecks. Then, we will discuss the practical implications of indexing, sharding, and partitioning strategies, and provide practical advice on how to further optimize CrateDB queries for optimal performance. All these topics will be covered by real-world examples and practical solutions to some of the most common issues. At the end of the talk, you will be equipped with practical tips and techniques for detecting performance issues and optimizing your queries. 

Key learnings:
- Intro to CrateDB and query plans
- How different sharding, partitioning, and indexing strategies affect query performance
- Real-life examples and tips for debugging slow queries</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/MDFQCE/resources/Marija_Selakovic_-_When_miliseconds_m_mCVOMIw.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/MDFQCE/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/MDFQCE/feedback/</feedback_url>
            </event>
            <event guid='aab86b1b-277d-5a81-95a2-48d26ec24083' id='28012' code='EQMRJP'>
                <room>Palais Atelier</room>
                <title>Deep dive into an Elasticsearch plugin for query-time joins</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T10:00:00+02:00</date>
                <start>10:00</start>
                <duration>00:40</duration>
                <abstract>Siren Federate is an Elasticsearch plugin for joining inverted indices at query-time. Learn in this talk about its inner workings and how it complements features of Elasticsearch like runtime fields.</abstract>
                <slug>berlin-buzzwords-2023-28012-deep-dive-into-an-elasticsearch-plugin-for-query-time-joins</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/EQMRJP/campinas-stephane_7uN3txd.png</logo>
                <persons>
                    <person id='32626'>St&#233;phane Campinas</person>
                </persons>
                <language>en</language>
                <description>Data are often at the basis of critical decisions in many different sectors, ranging from e-commerce to cyber-security: machine and search logs, user information, metrics over transactions, etc. Data in such domains are by nature voluminous and inter-connected. Analytics systems are expected not only to search and aggregate those data, but also to join them: join operations are often necessary in order to explore inter-connected data and get insights from them. Analysts often interact with such systems by following an explorative and iterative process that represents their train of thoughts. Such systems then must have fast response times to avoid impeding the mental process of the analysts. Whilst Elasticsearch is a fantastic high performance analytics engine, it presents some limitations in certain cases when it comes to joining data from different indices at query-time.

In this talk, we will present Siren&#8217;s ten years-long effort in implementing distributed joins on top of Elasticsearch. We will introduce Siren Federate &#8211; our Elasticsearch plugin that provides query-time join capabilities over indices &#8211; and we will discuss some of the challenges we had to tackle during its development. We will begin by describing how joins are performed by Federate, from the reception of a query till the computation of its results. Then we will show the importance of caching join results for performance, and how a cache can be efficiently implemented. Talking about performance, we will explain the benefits of adopting a vectorized data processing model by showing some experimental results. To conclude, we will discuss the importance of the expressiveness of a query language by illustrating the Federate DSL and how it integrates with some advanced features of Elasticsearch such as runtime fields.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/EQMRJP/resources/Stephane_Campinas_-_Siren_Federate_wyKPlde.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/EQMRJP/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/EQMRJP/feedback/</feedback_url>
            </event>
            <event guid='9e867122-e8b2-5ea5-b676-f23f44dd2c03' id='32403' code='9SJGJ3'>
                <room>Palais Atelier</room>
                <title>From keyword to vector</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T11:00:00+02:00</date>
                <start>11:00</start>
                <duration>00:20</duration>
                <abstract>During this talk, I will take you on my over-a-decade-long journey in search. Starting from having witnessed the inception of Elasticsearch to my current endeavors with Weaviate, I will share my first-hand experience of the evolution, challenges, and lessons learned along the way.</abstract>
                <slug>berlin-buzzwords-2023-32403-from-keyword-to-vector</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/9SJGJ3/Voorbach-Byron_VB0KisD.png</logo>
                <persons>
                    <person id='36661'>Byron Voorbach</person>
                </persons>
                <language>en</language>
                <description>The future of search lies in machine learning-based approaches, a realization that has led me to the world of semantic search. As part of my current endeavors with Weaviate, I&#8217;ve come to understand the transformative potential of this advanced technology and how it&#8217;s reshaping our digital experiences.

My journey into this field began during an internship in the early stages of my programming career. A group of my then colleagues ventured out to form a new company called Elasticsearch. Recognizing their potential, my mentor recommended that I focus my personal development on search technologies. This advice sparked my exploration of Lucene, Solr, and Elasticsearch, among others.

In the subsequent years as a search consultant, I wrestled with the inherent challenges of keyword-based systems. The tasks were anything but straightforward, from managing semantics, synonyms, and typos to trying to decipher user intent. However, this endeavor was far from fruitless - it led to a deep understanding of the intricate workings of search technologies.

This talk will take you through significant advancements in search over the years, peppered with practical insights and hard-earned wisdom I&#8217;ve accumulated along the way. The goal is not to argue that vector search replaces keyword search but to illustrate how combining both can yield the best results. Attendees can look forward to a nuanced understanding of search 
technologies, their evolution, and their potential to shape our future digital experiences</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/9SJGJ3/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/9SJGJ3/feedback/</feedback_url>
            </event>
            <event guid='778be5a3-aa32-5d17-a1d0-fc8160064600' id='27940' code='KWFLKN'>
                <room>Palais Atelier</room>
                <title>Semantic vs keyword search as context for GPT</title>
                <subtitle></subtitle>
                <type>Short Talk</type>
                <date>2023-06-20T11:30:00+02:00</date>
                <start>11:30</start>
                <duration>00:20</duration>
                <abstract>If you want to build a chat bot like ChatGPT on your own data, you need to use search to provide the context. Usually semantic search is used, but we&apos;ve found that keyword search has some pros.</abstract>
                <slug>berlin-buzzwords-2023-27940-semantic-vs-keyword-search-as-context-for-gpt</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/KWFLKN/Golubenco-Tudor_tQut9dy.png</logo>
                <persons>
                    <person id='32560'>Tudor Golubenco</person>
                </persons>
                <language>en</language>
                <description>The OpenAI ChatGPT has taken the world by storm and people want to be able to offer the same type of chat bot experience on their own data. Such a bot can answer questions based on your documentation or knowledge base.

This can be done with the OpenAI API by providing the right context, extracted from your data, to the model. You can do this in two steps:

* the search step: perform a search to select the documentation pages that are likely to contain the answer.
* the GPT step: provide these pages as context with a prompt like &quot;With this context: .... answer this questions: ...&quot;.

For the search step, semantic search is often used, because it makes use of the LLM capabilities. However, we have found that in practice keyword search (e.g. BM25 based) has some advantages when it comes to tuning the search step, and it tends to be more &quot;explainable&quot;.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/KWFLKN/resources/Tudor_Golubenco_-_Semantic_vs_keyword_XmjmVSg.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/KWFLKN/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/KWFLKN/feedback/</feedback_url>
            </event>
            <event guid='2b542fdc-fe73-55a2-8f65-34df004f6d6e' id='28171' code='WFLLCY'>
                <room>Palais Atelier</room>
                <title>Alexa, is The Smart Home vision failing?</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T12:00:00+02:00</date>
                <start>12:00</start>
                <duration>00:40</duration>
                <abstract>Amazon&apos;s Alexa team has lost billions. Google and Apple&apos;s hub aren&apos;t great successes. Is the Smart home failing? How can you keep your lights on when they depend on cloud infrastructure to work?</abstract>
                <slug>berlin-buzzwords-2023-28171-alexa-is-the-smart-home-vision-failing</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/WFLLCY/Loughran-Steve_Yopov2y.png</logo>
                <persons>
                    <person id='32686'>Steve Loughran</person>
                </persons>
                <language>en</language>
                <description>A harsh critique of the current state of the &quot;Smart Home&quot;. The vision was of a home full of smart devices -network enabled, remotely controllable and managed via local applications, phones and cloud services.

We now have three competing vendors all trying to be the one ecosystem of the Smart Home: Apple, Google and Amazon. The financial disaster which is Alexa highlights how even for them it is a way to lose money -and raises the question &quot;how long will Amazon keep the blue light on your Alexa on?&quot;. Some the problems technical, but many are related to usability and integration.

Based on experiences of attempting to use devices through all the ecosystems, and even writing a basic Alexa skill, this highlight how broken the smart home currently is. Like the need to give Alexa and Philips Hue light bulb groups different names so alexa knows which office lights to turn on. Or the way which cloud-hosted platforms can change their speech recognition and pattern matching algorithms without any warning or control. What longevity can we expect of the hardware-cloud-enhanced devices may have a lower purchase price but they depend on VC cash to keep working. 

What can we do? We must embrace platforms such as Home Assistant to stay in control. Yes, you get add debug statements to python modules to fix plug authentication -but if we developers do this, others will benefit. We also need to look at the survival of cloud integration -a subscription model is the only one which works. Finally, there is the promise of Threads, the low power wireless mesh network, and Matter, the model and API for devices and applications -useful but insufficient.

A code free talk; the audience will get the historical context of the early Ubicomp work and the experience of trying to get the modern platforms to achieve that visions from the turn of the century. While things have moved on from hacked together hardware and rigged demos -they haven&apos;t moved on far enough.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WFLLCY/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WFLLCY/feedback/</feedback_url>
            </event>
            <event guid='efac0c79-0011-5ef8-9ad5-7ff98c010a22' id='27766' code='WCPQTA'>
                <room>Palais Atelier</room>
                <title>A Crash Course in Error Handling for Streaming Data Pipeline</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>Learn how to handle errors in streaming data pipelines using concepts, such as dead-letter queues.</abstract>
                <slug>berlin-buzzwords-2023-27766-a-crash-course-in-error-handling-for-streaming-data-pipeline</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/WCPQTA/Sprenger-Stefan_PbErXUV.png</logo>
                <persons>
                    <person id='32382'>Stefan Sprenger</person>
                </persons>
                <language>en</language>
                <description>Streaming data pipelines pose unique requirements for the handling of errors and other malfunctions because they are executed continuously and cannot be manually supervised. As a consequence, we need to automate the handling of errors as much as possible.
This talk answers three critical questions in the context of data streaming: What are potential errors? How shall we handle the different kinds of errors? Which metrics help us to keep track of the health of streaming data pipelines?
We discuss (1) errors that happen when consuming Apache Kafka topics, e.g., when deserializing records, (2) errors that happen when producing records to Apache Kafka topics, e.g., when serializing data, (3) errors that happen when processing records, e.g., exceptions raised in data transformations, and (4) errors that are caused by external factors, e.g., when the streaming data pipeline exceeds available memory resources.
Once potential errors have been introduced, we show how to cope with them through design patterns, like dead-letter queues, or practical approaches, like log-based alerts.
Finally, we discuss important metrics for monitoring the health of streaming data pipelines, e.g., consumer lags, or producing rates for dead-letter topics.
While we use examples from Kafka Streams applications, the presented content can be easily transferred to other stream processing frameworks.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/WCPQTA/resources/Stefan_Sprenger-A_Crash_Course_in_Err_afEYEoA.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WCPQTA/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/WCPQTA/feedback/</feedback_url>
            </event>
            <event guid='88941d1d-46da-5897-91f6-21f8cb00ce33' id='27821' code='EAD8JD'>
                <room>Palais Atelier</room>
                <title>Fact Checking Rocks: how to build a fact-checking system</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:40</duration>
                <abstract>In this infodemic era, fact-checking is becoming a vital task.
In this talk, we&#8217;ll discover how to build a simple fact-checking system for rock music, leveraging the power of open-source libraries.</abstract>
                <slug>berlin-buzzwords-2023-27821-fact-checking-rocks-how-to-build-a-fact-checking-system</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/EAD8JD/Fiorucci-Stefan_Ao9Dv7j.png</logo>
                <persons>
                    <person id='32428'>Stefano Fiorucci</person>
                </persons>
                <language>en</language>
                <description>In this infodemic era, fact-checking is becoming a vital task. However, it is a complex and time-consuming activity.

In this talk, we will see how to combine Information Retrieval tools with modern Language Models to simply implement a fact-checking baseline with low human effort.

I will show you how to build a funny use case around rock music.

The application is based on several Python open-source libraries: Haystack, FAISS, Hugging Face Transformers, Sentence Transformers.
This step-by-step implementation will be an opportunity to learn more about Dense retrieval and Natural Language Inference models in a hands-on way. I will share some insights into developing modern Natural Language applications.

**Why it&apos;s relevant:**

Fact-checking is significant to the society, although it is still difficult to do automatically. Using modern NLP tools can help speed up and automate part of this task.

**What the audience will learn:**
- Dense retrieval for semantic search
- Natural Language Inference models
- How to build a fact-checking system using Haystack, FAISS, Hugging Face Transformers, Sentence Transformers.
- How to integrate powerful (Large) Language Models in your NLP applications, conditioning them to operate on your knowledge base
- How to efficiently combine tools from Information Retrieval, NLP, and Vector Search</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links>
                    <link href="https://github.com/anakin87/fact-checking-rocks">Github project</link>
                
                    <link href="https://huggingface.co/spaces/anakin87/fact-checking-rocks">Demo of the project</link>
                </links>
                <attachments></attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/EAD8JD/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/EAD8JD/feedback/</feedback_url>
            </event>
            <event guid='7f0be176-3bf5-5440-b483-dc38752c01b3' id='28144' code='FKKNBD'>
                <room>Palais Atelier</room>
                <title>Learning to hybrid search</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T16:00:00+02:00</date>
                <start>16:00</start>
                <duration>00:40</duration>
                <abstract>Combining BM25, neural embeddings and customer behavior with Learning-to-Rank into an ultimate ranking ensemble, with examples on Amazon ESCI e-commerce search dataset.</abstract>
                <slug>berlin-buzzwords-2023-28144-learning-to-hybrid-search</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/FKKNBD/Grebennikov-Roman_Goloviznin-Vsevolod_P5Qvv1R.png</logo>
                <persons>
                    <person id='32664'>Roman Grebennikov</person><person id='32666'>Vsevolod Goloviznin</person>
                </persons>
                <language>en</language>
                <description>Traditional term search has good precision but lacks semantics. Modern neural search is good at semantics but can miss customer behavior. Learning-to-rank approach adapts to customer behavior, but only if your baseline retrieval is already good enough.

The current hype about neural search can make an impression that it&apos;s the ultimate solution for all problems of legacy term search and LTR. You just only need [disclaimer: irony ahead] to do a very simple thing of fine-tuning a giant neural network to notice all the dependencies between queries, documents and customer behavior on all the data you have. But what if instead of replacing A with B, you can combine the strengths of all the approaches?

In this talk, we will take an example of an e-commerce search with an open-source Amazon&apos;s ESCI/ESCI-S dataset and compare traditional text matching and Learning-to-Rank approaches with modern neural search methods on real data. We will show how combining multiple old, and new approaches in a single hybrid system can deliver an even better result than each of them separately.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/FKKNBD/resources/grebennikov_learning_to_hybrid_search_cU1eJKJ.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/FKKNBD/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/FKKNBD/feedback/</feedback_url>
            </event>
            
        </room>
        <room name='Frannz Salon' guid='53bc35d4-3bfa-5d32-b6f3-228e6d6dd639'>
            <event guid='3bc45cc8-0f2d-53e6-a224-6db489c9d52e' id='25783' code='FFNZSK'>
                <room>Frannz Salon</room>
                <title>Advanced Search Plays with GraphQL</title>
                <subtitle></subtitle>
                <type>Workshop</type>
                <date>2023-06-20T09:30:00+02:00</date>
                <start>09:30</start>
                <duration>01:10</duration>
                <abstract>This demo-heavy workshop scores a hat trick by combining Apache Lucene, MongoDB, and GraphQL to easily build search functionality across data collections and 3rd party APIs into applications.</abstract>
                <slug>berlin-buzzwords-2023-25783-advanced-search-plays-with-graphql</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/FFNZSK/Vlaeva-Stanimira_BveVwPy.png</logo>
                <persons>
                    <person id='35599'>Stanimira Vlaeva</person>
                </persons>
                <language>en</language>
                <description>GraphQL is rapidly growing in popularity as the new standard for working with APIs, and it&#8217;s easy to see why! This groundbreaking API query language gives developers a single endpoint to access exactly the data they need. This eliminates over-fetching, decreases the response payload, and avoids multiple costly round trips to the server and long page load times.
This could be a session, long or short, or a workshop. The application is a football themed app (or a movie app if the organizers prefer) where we start small with exposing data via a GraphQL endpoint in minutes, but then we make the application much different and more fun by using GraphQL custom resolvers to add a 3rd party TikTok endpoint to the mix. The code is hosted in a code sandbox so attendees will leave with the inspiration, best practices, and actual code to implement immediately in their workflow.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/FFNZSK/resources/Stanimira_Vlaeva_-_Advanced_Search_Pl_KL5uBfn.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/FFNZSK/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/FFNZSK/feedback/</feedback_url>
            </event>
            <event guid='1e276956-392f-516e-9e74-780fc97f3230' id='28093' code='PRQ7PV'>
                <room>Frannz Salon</room>
                <title>How to Implement Online Search Quality Evaluation with Kibana</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T11:00:00+02:00</date>
                <start>11:00</start>
                <duration>00:40</duration>
                <abstract>Conducting online testing is crucial for assessing a model&#8217;s performance in a real-world scenario. This talk explores a customized approach for evaluating ranking models using Kibana.</abstract>
                <slug>berlin-buzzwords-2023-28093-how-to-implement-online-search-quality-evaluation-with-kibana</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/PRQ7PV/Petreti-Ilaria_Ruggero-Anna_xYMLsg4.png</logo>
                <persons>
                    <person id='32645'>Ilaria Petreti</person><person id='35209'>Anna Ruggero</person>
                </persons>
                <language>en</language>
                <description>Online testing represents a fundamental method to assess the performance of a ranking model in practical applications, providing the information needed to improve and better understand its behavior.
Despite the advantages, the currently available evaluation tools have certain limitations. For this reason, we will present an alternative and customized approach to evaluate ranking models using Kibana.
The talk will begin with an overview of online testing, including its benefits and drawbacks. Then, we will provide an in-depth exploration of our Kibana implementation, detailing the reasons behind our approach. Attendees will learn about the various tools provided by Kibana, and with practical examples, we will show how to create visualizations and dashboards, complete with queries and code, to compare different rankers.
Attending this presentation will provide participants with valuable knowledge on how to leverage Kibana for the purpose of evaluating ranking models on custom metrics and on specific contexts such as the most popular and &#8220;populous&#8221; queries.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/PRQ7PV/resources/Ilaria_Petrieti_-_Anna_Ruggero_-_How__Rbvvqck.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/PRQ7PV/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/PRQ7PV/feedback/</feedback_url>
            </event>
            <event guid='e3db037e-f2d5-5492-9839-d54f156f94f3' id='28056' code='N9JRVC'>
                <room>Frannz Salon</room>
                <title>Highly Available Search at Shopify</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T11:50:00+02:00</date>
                <start>11:50</start>
                <duration>00:40</duration>
                <abstract>This talk shares the story of how Shopify implemented seamless storage autoscaling for Elasticsearch that powers search for millions of merchants without data loss.</abstract>
                <slug>berlin-buzzwords-2023-28056-highly-available-search-at-shopify</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/N9JRVC/Ebrahimpour-Khosrow_41WBDnS.png</logo>
                <persons>
                    <person id='32638'>Khosrow Ebrahimpour</person>
                </persons>
                <language>en</language>
                <description>Millions of merchants rely on Shopify&#8217;s search infrastructure to sell their products and fulfill their orders. To be successful, merchants need their data to be highly available and also searchable in a matter of seconds. Moreover, these merchants are spread in different jurisdictions across the globe where data residency regulations require them to ensure their sensitive data stays within their jurisdiction. However, since their buyers are also spread across the globe, non personal data such as store products should be available globally and close to buyers to provide a fast search experience.  

This talk explains how the search platform team at Shopify built a highly available search infrastructure that indexes petabytes of data from traditional databases to Elasticsearch through Kafka in record time. 
Since search is a critical service for a global commerce platform in Shopify&#8217;s scale, the indexing pipeline writing to Elasticsearch is implemented with high availability and disaster recovery as a key requirement. That is, if one region becomes unavailable, the designed data replication mechanism allows the search infrastructure to provide service without impacting merchants and buyers.
Moreover, this infrastructure is distributed across the globe and designed in a way to follow data residency regulations of different jurisdictions while making sure buyers are able to search products with minimum delay.

Shopify&#8217;s search infrastructure has proven to be performant and capable of indexing millions of documents per minute while serving millions of queries at the same time. The lessons learned shared in this talk about the challenges of building a highly available and performant search infrastructure will be interesting to individuals and will encourage them to solve similar challenges.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/N9JRVC/resources/Khosrow_Ebrahimpour_-_Highly_availabl_a8ucfeL.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/N9JRVC/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/N9JRVC/feedback/</feedback_url>
            </event>
            <event guid='fd550680-0b17-5313-a175-75ec188998b4' id='27976' code='K8AR9R'>
                <room>Frannz Salon</room>
                <title>Using Dense Vector search at the EU Publications Office</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:00:00+02:00</date>
                <start>14:00</start>
                <duration>00:40</duration>
                <abstract>How dense vector functionality was used to provide several &#8216;Google-like&#8217; capabilities such as Extractive Answers and knowledge graph search over a large dataset at the EU Publications Office.</abstract>
                <slug>berlin-buzzwords-2023-27976-using-dense-vector-search-at-the-eu-publications-office</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/K8AR9R/Bayton-Martin_KEqRHXJ.png</logo>
                <persons>
                    <person id='37969'>Martin Bayton</person>
                </persons>
                <language>en</language>
                <description>In this session you will discover how dense vector functionality was used to enhance traditional search and provide users with a &apos;Google-like&apos; search experience during a proof of concept over a large dataset of multi-lingual legal content curated by the European Union Publications Office in Luxembourg. The presentation will explain how a combination of Elasticsearch, Google BERT transformer models, and the Pureinsights Discovery Platform (PDP) were utilised during the project and discuss the results obtained. There will also be a live demonstration showing the power of semantic understanding across documents and search queries.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/K8AR9R/resources/Martin_Bayton_-_Using_Dense_Vector_se_VJtVNmW.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/K8AR9R/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/K8AR9R/feedback/</feedback_url>
            </event>
            <event guid='3fd3f61c-ada8-59fa-a66d-ffdc428977e1' id='27755' code='KDT73L'>
                <room>Frannz Salon</room>
                <title>Searching large data sets in (near) constant time</title>
                <subtitle></subtitle>
                <type>Talk</type>
                <date>2023-06-20T14:50:00+02:00</date>
                <start>14:50</start>
                <duration>00:40</duration>
                <abstract>Tackle large search results by estimating hit count, interpolating a first phase ranking and limiting
the returned result set to the most relevant documents in a multi-million document index.</abstract>
                <slug>berlin-buzzwords-2023-27755-searching-large-data-sets-in-near-constant-time</slug>
                <track></track>
                <logo>/media/berlin-buzzwords-2023/submissions/KDT73L/Bogh-K%C3%B6ster-Torsten_Berger-Dennis_uXP3UmH.png</logo>
                <persons>
                    <person id='32374'>Torsten B&#248;gh K&#246;ster</person><person id='35110'>Dennis Berger</person>
                </persons>
                <language>en</language>
                <description>In low latency search environments, queries producing large result sets are a real pain. A proper ranking of large result sets burns a lot cpu. Those queries have the potential to slow down or even brick your cluster. On the customer side it is questionable whether it makes sense to return millions of documents as the customer has to filter them afterwards anyway.

Those large result sets caused us heavy headache as they significantly reduced the available compute head room on the nodes of our Solr cluster. They even bricked the whole cluster when hitting the cluster in high volume. In this project report we&apos;ll guide you through the steps (and math) how we:

- constructed index based random experiments,
- estimate the rough query hit count of a query by extrapolating bucket search results,
- collect and apply static first phase ranking information,
- use the information collected to filter the result set to the most relevant documents to return no more than a given number of documents,
- extrapolate hit and facet counts to mimic the original search result and
- handle document collapsing and facetting.

In this talk we&apos;ll guide you through the software architectural aspects as well as the math applied. Although applied on a Solr search system, this concept can be applied on other search engines as well.</description>
                <recording>
                    <license></license>
                    <optout>false</optout>
                </recording>
                <links></links>
                <attachments>
                    <attachment href="https://program.berlinbuzzwords.de/media/berlin-buzzwords-2023/submissions/KDT73L/resources/Torsten_Bogh_Koester__Dennis_Berger_-_BW0Xm0f.pdf">Slides</attachment>
                </attachments>

                <url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/KDT73L/</url>
                <feedback_url>https://program.berlinbuzzwords.de/berlin-buzzwords-2023/talk/KDT73L/feedback/</feedback_url>
            </event>
            
        </room>
        
    </day>
    
</schedule>
