Technical Deep Dive

How ChatSlide Leverages Apache Gravitino for Multi-Modal Data Catalog Management at Scale

By Quanlai Li, Founder of ChatSlide
November 24, 2025
10 min read

With over 150,000 users and 500+ daily active users generating presentations, videos, and multi-modal content, ChatSlide faced a critical challenge: managing an increasingly complex and diverse data catalog. Here's how we solved it with Apache Gravitino.

The Challenge: Multi-Modal Data at Scale

At ChatSlide, we don't just handle text. Our platform processes and generates:

Presentation Slides

PowerPoint, PDF, and web-based formats

AI-Generated Videos

With avatars, voiceovers, and animations

Image Assets

Charts, diagrams, and AI-generated visuals

Audio Content

Voice cloning and text-to-speech outputs

User Documents

Uploaded PDFs, Word docs, and research papers

Metadata & Analytics

Usage patterns, performance metrics, and user data

With users generating thousands of presentations daily, our data infrastructure needed to support:

  • Real-time stream processing for live collaboration and instant content generation
  • Machine learning pipelines for AI avatar generation, content recommendations, and quality enhancement
  • Data lake storage for long-term archival and analytics across structured and unstructured data

Enter Apache Gravitino: A Unified Metadata Lake

Apache Gravitino is a high-performance, geo-distributed, and federated metadata lake that provides a unified metadata abstraction layer across diverse data sources and AI assets.

Open Source & Community-Driven
GitHub Repository
CommunityActive contributors worldwide
Core TeamSan Francisco Bay Area
Key Capabilities

Unified metadata abstraction across heterogeneous sources

Direct system integration with bidirectional sync

Comprehensive data governance and access control

Multi-region, geo-distributed architecture

Multi-engine support (Trino, Spark, Flink)

Why Gravitino Was Perfect for ChatSlide

The combination of multi-modal data support, real-time processing capabilities, and federated metadata management made Gravitino the ideal solution for our complex data catalog challenges. Its active community and strong engineering team in the Bay Area also meant we could collaborate closely with experts who understand enterprise-scale data challenges.

ChatSlide's Data Architecture with Gravitino

Application Layer
ChatSlide Web
Mobile Apps
API Services
Apache Gravitino

Unified Metadata Lake

Data Sources
PostgreSQL
S3 / HDFS
Kafka Streams
ML Models
Redis Cache
Object Storage
Real-Time Processing

Gravitino manages metadata for our Kafka streams, enabling real-time collaboration features and instant content generation with full data lineage tracking.

ML Pipeline Integration

Our machine learning models for avatar generation and content enhancement are registered as first-class assets in Gravitino, with complete versioning and lineage.

Data Lake Storage

All user-generated content, from presentations to videos, is cataloged in our S3 data lake through Gravitino's unified interface, ensuring consistent governance.

How We Use Gravitino: Real-World Use Cases

1
Multi-Modal Content Discovery

When a user searches for presentations on "climate change," Gravitino helps us discover not just text-based slides, but also related videos, audio clips, images, and data visualizations across all our storage systems.

Result: 3x faster search queries across multi-modal content with complete lineage tracking.

2
Data Lineage for AI-Generated Content

Every AI-generated avatar, voice clone, and visual asset is tracked through Gravitino. We know exactly which ML model version created what content, using which training data, and when.

Result: 100% audit compliance and the ability to reproduce any AI output for quality assurance.

3
Cross-Platform Data Governance

With data spread across AWS S3, PostgreSQL, Redis, and Kafka, Gravitino provides a single control plane for access policies, ensuring GDPR and SOC 2 compliance across all systems.

Result: Centralized governance reduced our compliance overhead by 60%.

4
Real-Time Collaboration Metadata

When multiple users collaborate on a presentation, Gravitino tracks all changes, versions, and contributors in real-time through our Kafka streams, maintaining a complete audit trail.

Result: Zero data conflicts and complete version history for team projects.

Measurable Impact on ChatSlide's Infrastructure

Performance Improvements
Query Performance+240%
Metadata Discovery+180%
System Integration Time-65%
Operational Benefits

Unified Data Discovery

Single interface for all data assets

Reduced Compliance Overhead

60% less time on audit preparation

Faster Feature Development

New data integrations in days, not weeks

Complete Data Lineage

End-to-end tracking for all assets

Looking Forward: The Future of Data Catalogs

As ChatSlide continues to grow and Apache Gravitino evolves, we're excited about several upcoming capabilities:

Enhanced AI Asset Management

Gravitino's upcoming AI asset management features will allow us to treat ML models, training datasets, and AI-generated content as first-class citizens with comprehensive versioning and governance.

Multi-Cloud Federation

As we expand globally, Gravitino's geo-distributed architecture will enable us to maintain unified metadata across AWS, GCP, and Azure deployments.

Advanced Lineage Visualization

We're working with the Gravitino team to build rich visualization tools that help our engineers understand complex data flows across our multi-modal content pipelines.

Real-Time Metadata Streaming

Integration with Kafka and Flink will enable real-time metadata updates, critical for our live collaboration features and instant content generation.

Conclusion

Managing data catalogs at scale is one of the most challenging problems in modern data engineering, especially when dealing with multi-modal content across real-time streams, machine learning pipelines, and data lakes. Apache Gravitino has proven to be an invaluable solution for ChatSlide's complex data infrastructure.

The combination of unified metadata management, comprehensive governance, and active community support has enabled us to scale confidently from 100K to 150K+ users while maintaining the performance and reliability our users expect.

If you're facing similar challenges with multi-modal data catalogs, real-time processing, or machine learning metadata management, I highly recommend exploring Apache Gravitino.

The project has a strong team, active community, and clear roadmap that makes it a safe bet for production deployments. With 2.3K GitHub stars and growing, it's clear the industry recognizes the value of unified metadata management.

Want to see how ChatSlide handles multi-modal content?

Try ChatSlide Free

About the Author

Quanlai Li is the founder of ChatSlide, an AI-powered presentation platform serving 150,000+ users globally. With a background in distributed systems and machine learning, Quanlai is passionate about building scalable data infrastructure and democratizing content creation through AI.