Synthetic data: when, why, and how
2023-06-20 , Kesselhaus

This talk will cover several use cases in which generating synthetic data is useful (or even essential) and introduce a toolbox of practical techniques for synthesizing data in these situations.


Data is essential to today's most interesting applications and systems, which learn from data, act autonomously in response to data, and make data digestible via search. Somewhat counterintuitively, as the importance of real data has increased, the importance of synthetic data has increased as well. In this talk, you'll learn when it's appropriate to use synthetic data (and when it isn't likely to help). You'll also learn about several circumstances in which synthetic data is especially useful, including dealing with personally-identifying information, load testing, and simulating system response to unlikely scenarios. The talk will conclude by providing brief, actionable introductions to several practical approaches to generating synthetic tabular data, each of which is appropriate for particular kinds of synthetic data use cases: we'll cover a simple way to simulate data-generating processes from first principles, basic and more sophisticated statistical techniques, and approaches based on machine learning models. You'll leave with a better understanding of the role of synthetic data in today's systems and a concrete toolbox of ways to exploit it in your own programs.

William Benton is passionate about making it easier for machine learning practitioners to benefit from advanced infrastructure and making it possible for organizations to manage machine learning systems. His recent roles have included defining product strategy and professional services offerings related to data science and machine learning, leading teams of data scientists and engineers, and contributing to many open source communities related to data, ML, and distributed systems. Will was an early advocate of building machine learning systems on Kubernetes and developed and popularized the “intelligent applications” idiom for machine learning systems in the cloud. He has also conducted research and development related to static program analysis, language runtimes, cluster configuration management, and music technology.