Platform

AI

AI Agents
Sense, decide, and act faster than ever before
AI Visibility
See how your brand shows up in AI search
AI Feedback
Distill what your customers say they want
Amplitude MCP
Insights from the comfort of your favorite AI tool

Insights

Product Analytics
Understand the full user journey
Marketing Analytics
Get the metrics you need with one line of code
Session Replay
Visualize sessions based on events in your product
Heatmaps
Visualize clicks, scrolls, and engagement

Action

Guides and Surveys
Guide your users and collect feedback
Feature Experimentation
Innovate with personalized product experiences
Web Experimentation
Drive conversion with A/B testing powered by data
Feature Management
Build fast, target easily, and learn as you ship
Activation
Unite data across teams

Data

Warehouse-native Amplitude
Unlock insights from your data warehouse
Data Governance
Complete data you can trust
Security & Privacy
Keep your data secure and compliant
Integrations
Connect Amplitude to hundreds of partners
Solutions
Solutions that drive business results
Deliver customer value and drive business outcomes
Amplitude Solutions →

Industry

Financial Services
Personalize the banking experience
B2B
Maximize product adoption
Media
Identify impactful content
Healthcare
Simplify the digital healthcare experience
Ecommerce
Optimize for transactions

Use Case

Acquisition
Get users hooked from day one
Retention
Understand your customers like no one else
Monetization
Turn behavior into business

Team

Product
Fuel faster growth
Data
Make trusted data accessible
Engineering
Ship faster, learn more
Marketing
Build customers for life
Executive
Power decisions, shape the future

Size

Startups
Free analytics tools for startups
Enterprise
Advanced analytics for scaling businesses
Resources

Learn

Blog
Thought leadership from industry experts
Resource Library
Expertise to guide your growth
Compare
See how we stack up against the competition
Glossary
Learn about analytics, product, and technical terms
Explore Hub
Detailed guides on product and web analytics

Connect

Community
Connect with peers in product analytics
Events
Register for live or virtual events
Customers
Discover why customers love Amplitude
Partners
Accelerate business value through our ecosystem

Support & Services

Customer Help Center
All support resources in one place: policies, customer portal, and request forms
Developer Hub
Integrate and instrument Amplitude
Academy & Training
Become an Amplitude pro
Professional Services
Drive business success with expert guidance and support
Product Updates
See what's new from Amplitude

Tools

Benchmarks
Understand how your product compares
Templates
Kickstart your analysis with custom dashboard templates
Tracking Guides
Learn how to track events and metrics with Amplitude
Maturity Model
Learn more about our digital experience maturity model
Pricing
LoginContact salesGet started

AI

AI AgentsAI VisibilityAI FeedbackAmplitude MCP

Insights

Product AnalyticsMarketing AnalyticsSession ReplayHeatmaps

Action

Guides and SurveysFeature ExperimentationWeb ExperimentationFeature ManagementActivation

Data

Warehouse-native AmplitudeData GovernanceSecurity & PrivacyIntegrations
Amplitude Solutions →

Industry

Financial ServicesB2BMediaHealthcareEcommerce

Use Case

AcquisitionRetentionMonetization

Team

ProductDataEngineeringMarketingExecutive

Size

StartupsEnterprise

Learn

BlogResource LibraryCompareGlossaryExplore Hub

Connect

CommunityEventsCustomersPartners

Support & Services

Customer Help CenterDeveloper HubAcademy & TrainingProfessional ServicesProduct Updates

Tools

BenchmarksTemplatesTracking GuidesMaturity Model
LoginSign Up

Distributed Real-time Data Store with Flexible Deduplication

Here at Amplitude, we strive to reduce our customers’ time to insight. We are excited to build new features on top of the real-time layer to further this goal.
Product

Jan 18, 2017

8 min read

Jin Hao Wan

Jin Hao Wan

Software Developer, Amplitude

Distributed Real-time Data Store with Flexible Deduplication

In the world of “big data”, businesses that can quickly discover and act upon insights from their users’ events have a decisive advantage. It is no longer sufficient for analytics systems to solely rely on daily batch processing. This is why our new column store, Nova, continues to use a lambda architecture. In addition to a batch layer, this architecture also has a real-time layer that processes event data as they come in, and the real-time layer only needs to maintain the last day’s events. In a previous post, we focused on the batch layer of Nova. Designing the real-time layer to support incremental updates for a column store creates a different set of requirements and challenges. We will discuss our approach in this post.

Flow of data through a generic lambda architecture (source)

lambda architecture data flow

Design Requirements

Events are the units of data in our system. We think of an event as a database row with columns such as upload time, Amplitude ID, user ID, etc. As these events come in, they are partitioned and distributed across multiple real-time nodes.

There are four main requirements for Nova’s real-time layer:

  1. Store and support queries on events from the most recent day.
  2. Optimized for both reads and writes. The write requirement is especially crucial because a real-time node typically updates thousands of times per second as new events come in.
  3. De-duplicate events from the last 10 minutes. This is because a real-time node reads from a message bus such as Kafka, and the message bus could repeatedly send the same batch of events in case of failure.
  4. Fault-tolerant. A real-time node should be able to recover from a reboot to its previous state.

Druid, a high-performance, real-time column store, inspired a lot of our final design. Most prominently, our real-time layer also has two main components: an in-memory store and a disk store. Druid’s real-time layer, however, does not satisfy the third requirement. For example, it would end up with duplicated events in any of the following scenarios:

  1. A failure happens between a persist (a Druid operation) and its corresponding offset update on the message bus.
  2. The message bus fails to stream events for more than 10 minutes (the interval in the Druid paper) and then resumes streaming events; but the first batch of incoming events was already ingested by real-time before the failure.

The ability to de-duplicate events directly determines how reliable and actionable our real-time insights are. Because of this, we decided not to entirely adopt Druid’s approach. Our final design differs in how long event data stay in the in-memory store, and how these data’s format changes once they move into the disk store.

In-memory store

To satisfy the deduplication requirement without sacrificing write speed, we must maintain at least 10 minutes worth of recent events in memory. This specification leads us to implement the in-memory store as a LRU cache. The entries in this cache are called “blocks”, which are groups of events based on upload times. To be more exact, we divide the timeline into five-minute intervals and group events whose upload times are from the same interval into a block. Note that blocks are row-oriented, so we can easily de-duplicate incoming events against them.

event blocks

Let’s go over the ingestion workflow of the in-memory store. Whenever we see new incoming events (after deduplication) that belong to a block, we add these events to the block. As its events continue to show up in the stream, this block will stay in the cache and keep updating. After all its events have passed, the block will not be accessed again unless there are duplicated events. The time period between this block’s last update and its eviction from the cache must be no shorter than our real-time layer’s deduplication window (10 minutes). Therefore, we configure the cache to hold 15 minutes worth of event data so that the block will not be evicted for at least another 10 (15 – 5) minutes.

cache

One of the biggest challenges of maintaining an in-memory store is recovery. To make sure our cache-based system is fault-tolerant, each time new events are added to a block, we append these events to a corresponding file on disk. This file grows as the block updates: it essentially is a write-ahead log for the block. When a node restarts, we can reconstruct the LRU cache by reading all the persisted files. We optimize these append operations so that they do not compromise the write performance of our store. Once the cache evicts a block, its corresponding file gets moved to the disk store to be converted to a columnar format at a later stage.

Window TinyLFU scheme (source)

TinyLFU scheme

We chose to heavily utilize Caffeine Cache in our implementation because it has several nice properties:

  1. Caffeine is highly concurrent, so it easily satisfies our requirement on write-optimization.
  2. We can implement Caffeine as a write-through cache, which greatly simplifies our logics around synchronizing blocks and their persisted files.
  3. Caffeine uses a sophisticated eviction policy called Windowed-TinyLFU that maintains both a “window cache” and a “main cache”. In our event-data-streaming setting, blocks from the most recent five-minute interval will always be admitted into the main cache; so the eviction policy is effectively just Segmented LRU, which serves our needs.

Disk store

The disk store has two main tasks:

  1. Repeatedly merging a fixed number of block files into a single column-oriented file that we call “chunk”.
  2. Periodically deleting chunks whose views are ready in the batch layer.

Chunk creation happens whenever enough blocks have been evicted from the in-memory store. By converting multiple small, row-oriented blocks to a single column-oriented chunk, we significantly improve the real-time layer’s query speed. To facilitate handoff between real-time and batch layers, we label chunks with a global batch number when they are created. This number increments every time the batch layer finishes processing one day’s events. The disk store runs an hourly task to delete chunks with old batch numbers. This simple handoff scheme enables the real-time layer to only store recent events and to transition fluidly with the batch layer.

realtime data flow

Querying on the real-time layer

To compute a query result, we simultaneously fetch data from both the in-memory store and the disk store across all real-time nodes. The in-memory store has the drawback of being row-oriented, but it gains speed by having all its data in memory. On the other hand, the disk store has the disadvantage of having to read from disk, which it makes up by having all its data in an optimized columnar format. When reading from the in-memory store, we make sure to do it through a view of the LRU cache (which Caffeine supports). This way, queries do not affect the access history of the cache.

Towards faster insights

Our real-time layer achieves a tradeoff among speed, robustness and data integrity. These are the prerequisites of turning real-time analytics into fast, reliable insights. Here at Amplitude, we strive to reduce our customers’ time to insight. We are excited to build new features on top of the real-time layer to further this goal. If you are interested in working on these types of technical challenges, we would love to speak with you!

About the author
Jin Hao Wan

Jin Hao Wan

Software Developer, Amplitude

More from Jin

Jin is on Amplitude's back-end engineering team, where he works on maintaining Amplitude's query engine and prototyping new features. He graduated from MIT with an MS in Computer Science.

More from Jin
Topics
Platform
  • Product Analytics
  • Feature Experimentation
  • Feature Management
  • Web Analytics
  • Web Experimentation
  • Session Replay
  • Activation
  • Guides and Surveys
  • AI Agents
  • AI Visibility
  • AI Feedback
  • Amplitude MCP
Compare us
  • Adobe
  • Google Analytics
  • Mixpanel
  • Heap
  • Optimizely
  • Fullstory
  • Pendo
Resources
  • Resource Library
  • Blog
  • Product Updates
  • Amp Champs
  • Amplitude Academy
  • Events
  • Glossary
Partners & Support
  • Contact Us
  • Customer Help Center
  • Community
  • Developer Docs
  • Find a Partner
  • Become an affiliate
Company
  • About Us
  • Careers
  • Press & News
  • Investor Relations
  • Diversity, Equity & Inclusion
Terms of ServicePrivacy NoticeAcceptable Use PolicyLegal
EnglishJapanese (日本語)Korean (한국어)Español (Spain)Português (Brasil)Português (Portugal)FrançaisDeutsch
© 2025 Amplitude, Inc. All rights reserved. Amplitude is a registered trademark of Amplitude, Inc.
Blog
InsightsProductCompanyCustomers
Topics

101

AI

APJ

Acquisition

Adobe Analytics

Amplify

Amplitude Academy

Amplitude Activation

Amplitude Analytics

Amplitude Audiences

Amplitude Community

Amplitude Feature Experimentation

Amplitude Guides and Surveys

Amplitude Heatmaps

Amplitude Made Easy

Amplitude Session Replay

Amplitude Web Experimentation

Amplitude on Amplitude

Analytics

B2B SaaS

Behavioral Analytics

Benchmarks

Churn Analysis

Cohort Analysis

Collaboration

Consolidation

Conversion

Customer Experience

Customer Lifetime Value

DEI

Data

Data Governance

Data Management

Data Tables

Digital Experience Maturity

Digital Native

Digital Transformer

EMEA

Ecommerce

Employee Resource Group

Engagement

Event Tracking

Experimentation

Feature Adoption

Financial Services

Funnel Analysis

Getting Started

Google Analytics

Growth

Healthcare

How I Amplitude

Implementation

Integration

LATAM

Life at Amplitude

MCP

Machine Learning

Marketing Analytics

Media and Entertainment

Metrics

Modern Data Series

Monetization

Next Gen Builders

North Star Metric

Partnerships

Personalization

Pioneer Awards

Privacy

Product 50

Product Analytics

Product Design

Product Management

Product Releases

Product Strategy

Product-Led Growth

Recap

Retention

Startup

Tech Stack

The Ampys

Warehouse-native Amplitude

Recommended Reading

article card image
Read 
Product
Getting Started: Product Analytics Isn’t Just for Analysts

Dec 5, 2025

5 min read

article card image
Read 
Insights
Vibe Check Part 3: When Vibe Marketing Goes Off the Rails

Dec 4, 2025

8 min read

article card image
Read 
Customers
How CAFU Tripled Engagement and Boosted Conversions 20%+

Dec 4, 2025

8 min read

article card image
Read 
Customers
The Future is Data-Driven: Introducing the Winners of the Ampy Awards 2025

Dec 2, 2025

6 min read

Explore Related Content

Integration
Using Behavioral Analytics for Growth with the Amplitude App on HubSpot

Jun 17, 2024

10 min read

Personalization
Identity Resolution: The Secret to a 360-Degree Customer View

Feb 16, 2024

10 min read

Product
Inside Warehouse-native Amplitude: A Technical Deep Dive

Jun 27, 2023

15 min read

Guide
5 Proven Strategies to Boost Customer Engagement

Jul 12, 2023

Video
Designing High-Impact Experiments

May 13, 2024

Startup
9 Direct-to-consumer Marketing Tactics to Accelerate Ecommerce Growth

Feb 20, 2024

10 min read

Growth
Leveraging Analytics to Achieve Product-Market Fit

Jul 20, 2023

10 min read

Product
iFood Serves Up 54% More Checkouts with Error Message Makeover

Oct 7, 2024

9 min read