Technical Deep Dive

How ChatSlide Leverages Apache Gravitino for Multi-Modal Data Catalog Management at Scale

By Quanlai Li, Founder of ChatSlide

•November 24, 2025

•10 min read

With over 150,000 users and 500+ daily active users generating presentations, videos, and multi-modal content, ChatSlide faced a critical challenge: managing an increasingly complex and diverse data catalog. Here's how we solved it with Apache Gravitino.

View on GitHub Try ChatSlide

The Challenge: Multi-Modal Data at Scale

At ChatSlide, we don't just handle text. Our platform processes and generates:

Presentation Slides

PowerPoint, PDF, and web-based formats

AI-Generated Videos

With avatars, voiceovers, and animations

Image Assets

Charts, diagrams, and AI-generated visuals

Audio Content

Voice cloning and text-to-speech outputs

User Documents

Uploaded PDFs, Word docs, and research papers

Metadata & Analytics

Usage patterns, performance metrics, and user data

The Problem: Managing metadata across diverse data sources, formats, and storage systems while maintaining data lineage, governance, and discoverability became increasingly complex as we scaled.

With users generating thousands of presentations daily, our data infrastructure needed to support:

Real-time stream processing for live collaboration and instant content generation
Machine learning pipelines for AI avatar generation, content recommendations, and quality enhancement
Data lake storage for long-term archival and analytics across structured and unstructured data

Enter Apache Gravitino: A Unified Metadata Lake

Apache Gravitino is a high-performance, geo-distributed, and federated metadata lake that provides a unified metadata abstraction layer across diverse data sources and AI assets.

Open Source & Community-Driven

GitHub Repository

⭐ 2.3K stars

CommunityActive contributors worldwide

Core TeamSan Francisco Bay Area

Key Capabilities

Unified metadata abstraction across heterogeneous sources

Direct system integration with bidirectional sync

Comprehensive data governance and access control

Multi-region, geo-distributed architecture

Multi-engine support (Trino, Spark, Flink)

Why Gravitino Was Perfect for ChatSlide

The combination of multi-modal data support, real-time processing capabilities, and federated metadata management made Gravitino the ideal solution for our complex data catalog challenges. Its active community and strong engineering team in the Bay Area also meant we could collaborate closely with experts who understand enterprise-scale data challenges.

ChatSlide's Data Architecture with Gravitino

Application Layer

ChatSlide Web

Mobile Apps

API Services

Apache Gravitino

Unified Metadata Lake

Data Sources

PostgreSQL

S3 / HDFS

Kafka Streams

ML Models

Redis Cache

Object Storage

Real-Time Processing

Gravitino manages metadata for our Kafka streams, enabling real-time collaboration features and instant content generation with full data lineage tracking.

ML Pipeline Integration

Our machine learning models for avatar generation and content enhancement are registered as first-class assets in Gravitino, with complete versioning and lineage.

Data Lake Storage

All user-generated content, from presentations to videos, is cataloged in our S3 data lake through Gravitino's unified interface, ensuring consistent governance.

How We Use Gravitino: Real-World Use Cases

Multi-Modal Content Discovery

When a user searches for presentations on "climate change," Gravitino helps us discover not just text-based slides, but also related videos, audio clips, images, and data visualizations across all our storage systems.

Result: 3x faster search queries across multi-modal content with complete lineage tracking.

Data Lineage for AI-Generated Content

Every AI-generated avatar, voice clone, and visual asset is tracked through Gravitino. We know exactly which ML model version created what content, using which training data, and when.

Result: 100% audit compliance and the ability to reproduce any AI output for quality assurance.

Cross-Platform Data Governance

With data spread across AWS S3, PostgreSQL, Redis, and Kafka, Gravitino provides a single control plane for access policies, ensuring GDPR and SOC 2 compliance across all systems.

Result: Centralized governance reduced our compliance overhead by 60%.

Real-Time Collaboration Metadata

When multiple users collaborate on a presentation, Gravitino tracks all changes, versions, and contributors in real-time through our Kafka streams, maintaining a complete audit trail.

Result: Zero data conflicts and complete version history for team projects.

Measurable Impact on ChatSlide's Infrastructure

Performance Improvements

Query Performance+240%

Metadata Discovery+180%

System Integration Time-65%

Operational Benefits

Unified Data Discovery

Single interface for all data assets

Reduced Compliance Overhead

60% less time on audit preparation

Faster Feature Development

New data integrations in days, not weeks

Complete Data Lineage

End-to-end tracking for all assets

Since implementing Gravitino, ChatSlide has successfully scaled from 100K to 150K+ users while maintaining sub-second query performance across our entire data catalog.

Looking Forward: The Future of Data Catalogs

As ChatSlide continues to grow and Apache Gravitino evolves, we're excited about several upcoming capabilities:

Enhanced AI Asset Management

Gravitino's upcoming AI asset management features will allow us to treat ML models, training datasets, and AI-generated content as first-class citizens with comprehensive versioning and governance.

Multi-Cloud Federation

As we expand globally, Gravitino's geo-distributed architecture will enable us to maintain unified metadata across AWS, GCP, and Azure deployments.

Advanced Lineage Visualization

We're working with the Gravitino team to build rich visualization tools that help our engineers understand complex data flows across our multi-modal content pipelines.

Real-Time Metadata Streaming

Integration with Kafka and Flink will enable real-time metadata updates, critical for our live collaboration features and instant content generation.

Conclusion

Managing data catalogs at scale is one of the most challenging problems in modern data engineering, especially when dealing with multi-modal content across real-time streams, machine learning pipelines, and data lakes. Apache Gravitino has proven to be an invaluable solution for ChatSlide's complex data infrastructure.

The combination of unified metadata management, comprehensive governance, and active community support has enabled us to scale confidently from 100K to 150K+ users while maintaining the performance and reliability our users expect.

If you're facing similar challenges with multi-modal data catalogs, real-time processing, or machine learning metadata management, I highly recommend exploring Apache Gravitino.

The project has a strong team, active community, and clear roadmap that makes it a safe bet for production deployments. With 2.3K GitHub stars and growing, it's clear the industry recognizes the value of unified metadata management.

Learn More About Gravitino View on GitHub

Want to see how ChatSlide handles multi-modal content?

Try ChatSlide Free

About the Author

Quanlai Li is the founder of ChatSlide, an AI-powered presentation platform serving 150,000+ users globally. With a background in distributed systems and machine learning, Quanlai is passionate about building scalable data infrastructure and democratizing content creation through AI.

How ChatSlide Leverages Apache Gravitino for Multi-Modal Data Catalog Management at Scale

The Challenge: Multi-Modal Data at Scale

Enter Apache Gravitino: A Unified Metadata Lake

Why Gravitino Was Perfect for ChatSlide

ChatSlide's Data Architecture with Gravitino

How We Use Gravitino: Real-World Use Cases

Measurable Impact on ChatSlide's Infrastructure

Looking Forward: The Future of Data Catalogs

Conclusion

Resources

Tools

Use Cases

Alternatives

Comparisons

Company