Tuesday, June 10, 2025
HomeBig DataDenny’s high session picks for Knowledge + AI Summit 2025

Denny’s high session picks for Knowledge + AI Summit 2025

Knowledge + AI Summit 2025 is just some weeks away! This 12 months, we’re providing our largest number of periods ever, with over 700+ to select from. Register to hitch us in-person in San Francisco or just about.

With a profession rooted in open supply, I’ve seen firsthand how open applied sciences and codecs are more and more central to enterprise technique. As a long-time contributor to Apache Spark™ and MLflow, a maintainer and committer for Delta Lake and Unity Catalog, and most just lately a contributor to Apache Iceberg™, I’ve had the privilege of working alongside a number of the brightest minds within the business.

For this 12 months’s periods, I’m specializing in the intersection of open supply and AI – with a specific curiosity round multimodal AI. Particularly, how open desk codecs like Delta Lake and Iceberg, mixed with unified governance by way of Unity Catalog, are powering the following wave of real-time, reliable AI and analytics.

My High Picks

The upcoming Apache Spark 4.1: The Subsequent Chapter in Unified Analytics

Apache Spark™ has lengthy been acknowledged because the main open-source unified analytics engine, combining a easy but highly effective API with a wealthy ecosystem and top-notch efficiency. Within the upcoming Spark 4.1 launch, the neighborhood reimagines Spark to excel at each large cluster deployments and native laptop computer growth. Pay attention and ask inquiries to:

  • Xiao Li an Engineering Director at Databricks, an Apache Spark Committer, and a PMC member.
  • DB Tsai is an engineering chief on the Databricks Spark group. He’s an Apache Spark Mission Administration Committee (PMC) Member and Committer

Iceberg Geo Sort: Reworking Geospatial Knowledge Administration at Scale

Geospatial is changing into increasingly necessary for lakehouse codecs. Be taught from Jia Yu, Co-founder and Chief Architect of Wherobots Inc., and Szehon Ho, Software program Engineer at Databricks, on the most recent and best across the geospatial information varieties in Apache Iceberg™.

Let’s Save Tons of Cash with Cloud-native Knowledge Ingestion!

R. Tyler Croy from Scribd, Delta Lake maintainer, and shepherd of delta-rs since its inception, will dive into the cloud-native structure Scribd has adopted to ingest information from AWS Aurora, SQS, Kinesis Knowledge Firehose, and extra. By utilizing off-the-shelf open supply instruments like kafka-delta-ingest, oxbow, and Airbyte, Scribd has redefined its ingestion structure to be extra event-driven, dependable, and most significantly: cheaper. No jobs wanted!

This session will dig into the worth props of a lakehouse structure and cost-efficiencies throughout the Rust/Arrow/Python ecosystems. A couple of advisable movies to observe beforehand:

Daft and Unity Catalog: a multimodal/AI-native lakehouse

Multimodal AI will essentially change the panorama as information is extra than simply tables. Workflows now typically contain paperwork, photos, audio, video, embeddings, URLs and extra.

This session from Jay Chia, Co-founder of Eventual, will present how Daft + Unity Catalog will help unify authentication, authorization and information lineage, offering a holistic view of governance, with Daft, a preferred multimodal framework.

Bridging Huge Knowledge and AI: Empowering PySpark with Lance Format for Multi-Modal AI Knowledge Pipelines

PySpark has lengthy been a cornerstone of massive information processing, however the rise of multimodal AI and vector search introduces challenges past its capabilities. Spark’s new Python information supply API allows integration with rising AI information lakes constructed on the multi-modal Lance format.

This session will dive into how the Lance format works and why it is a vital part for multimodal AI information pipelines. Allison Wang, Apache Spark™ committer, and Li Qiu, LanceDB Database Engineer and Alluxio PMC member, will dive into how combining Apache Spark (PySpark) and LanceDB lets you advance multi-modal AI information pipelines.

Streamlining DSPy Growth: Monitor, Debug and Deploy with MLflow

Chen Qian, Senior Software program Engineer at Databricks, will present learn how to combine MLflow with DSPy to deliver full observability to your DSPy growth.

You’ll get to see learn how to monitor DSPy module calls, evaluations, and optimizers utilizing MLflow’s tracing and autologging capabilities. Combining these two instruments makes it simpler to debug, iterate, and perceive your DSPy workflows, then deploy your DSPy program end-to-end.

From Code Completion to Autonomous Software program Engineering Brokers

Kilian Lieret, Analysis Software program Engineer at Princeton College, was just lately a visitor on the Knowledge Brew videocast for an enchanting dialogue on new instruments for analysis and enhancing AI in software program engineering.

This session is an extension of this dialog, the place Kilian will dig into SWE-bench (a benchmarking device) and SWE-agent (an agent framework), the present frontier of agentic AI for builders, and learn how to experiment with AI brokers.

Composing high-accuracy AI techniques with SLMs and mini-agents

The always-amazing Sharon Zhou, CEO and Founding father of Lamini, discusses learn how to make the most of small language fashions (SLMs) and mini-agents to cut back hallucinations utilizing Combination of Reminiscence Exports (i.e., MoME is aware of finest)!

Discover out a bit of bit extra about MoME on this enjoyable Knowledge Brew by Databricks episode that includes Sharon: Combination of Reminiscence Exports.

Past the Tradeoff: Differential Privateness in Tabular Knowledge Synthesis

Differential privateness is a vital device to offer mathematical ensures round defending the privateness of the people behind the information. This speak by Lipika Ramaswamy of Gretel.ai (now a part of NVIDIA) explores the usage of Gretel Navigator to generate differentially non-public artificial information that maintains excessive constancy to the supply information and excessive utility on downstream duties throughout heterogeneous datasets.

Some good pre-reads on the subject:

Constructing Data Brokers to Automate Doc Workflows
One of many largest guarantees for LLM brokers is automating all information work over unstructured information — we name these “information brokers.” Jerry Liu, Founding father of LlamaIndex, dives into learn how to create information brokers to automate doc workflows. What can typically be advanced to implement, Jerry showcases learn how to make this a simplified circulation for a basic enterprise course of.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments