Disseminate: The Computer Science Research Podcast

Episodes

Liana Patel | ACORN: Performant and Predicate-Agnostic Hybrid Search | #60

Nov 11 2024
In this episode, we chat with with Liana Patel to discuss ACORN, a groundbreaking method for hybrid search in applications using mixed-modality data. As more systems require simultaneous access to embedded images, text, video, and structured data, traditional search methods struggle to maintain efficiency and flexibility. Liana explains how ACORN, leveraging Hierarchical Navigable Small Worlds (HNSW), enables efficient, predicate-agnostic searches by introducing innovative predicate subgraph traversal. This allows ACORN to outperform existing methods significantly, supporting complex query semantics and achieving 2–1,000 times higher throughput on diverse datasets. Tune in to learn more!

Links:
ACORN: Performant and Predicate-Agnostic Search Over Vector Embeddings and Structured Data [SIGMOD'24]
Liana's LinkedIn
Liana's X

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
53 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
High Impact in Databases with... David Maier

Nov 4 2024
In this High Impact episode we talk to David Maier.

David is the Maseeh Professor Emeritus of Emerging Technologies at Portland State University. Tune in to hear David's story and learn about some of his most impactful work.

The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.

You can find David on:
Homepage
Google Scholar

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
1 hr and 2 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

Oct 28 2024
In this episode, Raunak Shah joins us to discuss the critical issue of data redundancy in enterprise data lakes, which can lead to soaring storage and maintenance costs. Raunak highlights how large-scale data environments, ranging from terabytes to petabytes, often contain duplicate and redundant datasets that are difficult to manage. He introduces the concept of "dataset containment" and explains its significance in identifying and reducing redundancy at the table level in these massive data lakes—an area where there has been little prior work.

Raunak then dives into the details of R2D2, a novel three-step hierarchical pipeline designed to efficiently tackle dataset containment. By utilizing schema containment graphs, statistical min-max pruning, and content-level pruning, R2D2 progressively reduces the search space to pinpoint redundant data. Raunak also discusses how the system, implemented on platforms like Azure Databricks and AWS, offers significant improvements over existing methods, processing TB-scale data lakes in just a few hours with high accuracy. He concludes with a discussion on how R2D2 optimally balances storage savings and performance by identifying datasets that can be deleted and reconstructed on demand, providing valuable insights for enterprises aiming to streamline their data management strategies.

Materials:
SIGMOD'24 Paper - R2D2: Reducing Redundancy and Duplication in Data Lakes
ICDE'24 - Towards Optimizing Storage Costs in the Cloud

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
31 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
High Impact in Databases with... Aditya Parameswaran

Oct 21 2024
In this High Impact episode we talk to Aditya Parameswaran about his some of his most impactful work.

Aditya is an Associate Professor at the University of California, Berkeley. Tune in to hear Aditya's story!

The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.

Links:
EPIC Data Lab
Answering Queries using Humans, Algorithms and Databases (CIDR'11)
Potter’s Wheel: An Interactive Data Cleaning System (VLDB'01)
Online Aggregation (SIGMOD'97)
Polaris: A System for Query, Analysis and Visualization of Multi-dimensional Relational Databases (INFOVIS'00)
Coping with Rejection
Ponder

You can find Aditya on:
Twitter
LinkedIn
Google Scholar

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
59 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Marco Costa | Taming Adversarial Queries with Optimal Range Filters | #58

Oct 14 2024

In this episode, we sit down with Marco Costa to discuss the fascinating world of range filters, focusing on how they help optimize queries in databases by determining whether a range intersects with a given set of keys. Marco explains how traditional range filters, like Bloom filters, often result in high false positives and slow query times, especially when dealing with adversarial inputs where queries are correlated with the keys. He walks us through the limitations of existing heuristic-based solutions and the common challenges they face in maintaining accuracy and speed under such conditions.

The highlight of our conversation is Grafite, a novel range filter introduced by Marco and his team. Unlike previous approaches, Grafite comes with clear theoretical guarantees and offers robust performance across various datasets, query sizes, and workloads. Marco dives into the technicalities, explaining how Grafite delivers faster query times and maintains predictable false positive rates, making it the most reliable range filter in scenarios where queries are correlated with keys. Additionally, he introduces a simple heuristic filter that excels in uncorrelated queries, pushing the boundaries of current solutions in the field.

SIGMOD' 24 Paper - Grafite: Taming Adversarial Queries with Optimal Range Filters

Hosted on Acast. See acast.com/privacy for more information.

Show More Show Less

37 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
High Impact in Databases with... Ali Dasdan

Oct 8 2024
In this High Impact episode we talk to Ali Dasdan, CTO at Zoominfo. Tune in to hear Ali's story and learn about some of his most impactful work such as his work on "Map-Reduce-Merge".

The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.

Materials mentioned on this episode:
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters (SIGMOD'07)
The Art of Doing Science and Engineering: Learning to Learn, Richard Hamming
How to Solve It, George Polya
Systems Architecting: Creating & Building Complex Systems, Eberhardt Rechtin

You can find Ali on:
Twitter
LinkedIn

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
1 hr and 3 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

Jul 22 2024
In this episode, we dive deep into the complexities of managing analytical query workloads with our guest, Matt Perron. Matt explains how the rapid and unpredictable fluctuations in resource demands present a significant challenge for provisioning. Traditional methods often lead to either over-provisioning, resulting in excessive costs, or under-provisioning, which causes poor query latency during demand spikes. However, there's a promising solution on the horizon. Matt shares insights from recent research that showcases the viability of using cloud functions to dynamically match compute supply with workload demand without the need for prior resource provisioning. While effective for low query volumes, this approach becomes cost-prohibitive as query volumes increase, highlighting the need for a more balanced strategy.

Matt introduces us to a novel strategy that combines the best of both worlds: the rapid scalability of cloud functions and the cost-effectiveness of virtual machines. This innovative approach leverages the fast but expensive cloud functions alongside slow-starting yet inexpensive virtual machines to provide elasticity without sacrificing cost efficiency. He elaborates on how their implementation, called Cackle, achieves consistent performance and cost savings across a wide range of workloads and conditions. Tune in to learn how Cackle avoids the pitfalls of traditional approaches, delivering stable query performance and minimizing costs even as demand fluctuates wildly.

Links:
Cackle: Analytical Workload Cost and Performance Stability With Elastic Pools [SIGMOD'24]
Matt's Homepage

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
52 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free
High Impact in Databases with... Andreas Kipf

Jul 15 2024
In this High Impact episode we talk to Andreas Kipf about his work on "Learned Cardinalities".

Andreas is the Professor of Data Systems at Technische Universität Nürnberg (UTN). Tune in to hear Andreas's story and learn about some of his most impactful work.

The podcast is proudly sponsored by Pometry the developers behind Raphtory, the open source temporal graph analytics engine for Python and Rust.

Papers mentioned on this episode:
Learned Cardinalities: Estimating Correlated Joins with Deep Learning CIDR'19
The Case for Learned Index Structures SIGMOD'18
Adaptive Optimization of Very Large Join Queries SIGMOD'18

You can find Andreas on:
Twitter
LinkedIn
Google Scholar
Data Systems Lab @ UTN

Hosted on Acast. See acast.com/privacy for more information.
Show More Show Less
53 mins

Failed to add items

Sorry, we are unable to add the item because your shopping cart is already at capacity.

Add to basket failed.

Please try again later

Add to Wish List failed.

Please try again later

Remove from Wish List failed.

Please try again later

Follow podcast failed

Unfollow podcast failed

Listen for free

Audiobook Categories

More to Explore

GETTING STARTED

Episodes

Liana Patel | ACORN: Performant and Predicate-Agnostic Hybrid Search | #60

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

High Impact in Databases with... David Maier

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Raunak Shah | R2D2: Reducing Redundancy and Duplication in Data Lakes | #59

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

High Impact in Databases with... Aditya Parameswaran

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Marco Costa | Taming Adversarial Queries with Optimal Range Filters | #58

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

High Impact in Databases with... Ali Dasdan

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

Matt Perron | Analytical Workload Cost and Performance Stability With Elastic Pools | #57

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed

High Impact in Databases with... Andreas Kipf

Failed to add items

Add to basket failed.

Add to Wish List failed.

Remove from Wish List failed.

Follow podcast failed

Unfollow podcast failed