Project Metamorphosis: Unveiling the next-gen event streaming platformLearn More

Apache Kafka for Real-Time Retail at Walmart Labs

This post is by Anil Kumar, Global eCommerce Engineer at Walmart Labs. Anil specializes in enabling stream processing, cloud computing, ad technology, search services, information retrieval, transport and application layer protocols, machine learning algorithms and firewall and NAT traversal mechanisms. Prior to working with the WalmartLabs team, Anil was Principal Software Architect for Adchemy, acquired by Walmart.This blog is re-posted with his permission from Anil’s post on Medium.

At in the U.S. and at Walmart’s 11 other websites around the world, we provide seamless shopping experience where products are sold by:

  1. Own Merchants for & Walmart Stores
  2. Suppliers for Online & Stores
  3. Sellers on Walmart’s marketplaces
Walmart Stores
Product sold on – Online, Stores by Walmart & by 3 marketplace sellers

The Process is referred to internally as “Item Setup” and the visitors to the sites see Product listings after data processing for Products, Offers, Price,Inventory & Logistics. These entities are comprised of data from multiple sources in different formats & schemas. They have different characteristics around data processing:

1. Products requires more of data preparation around:

  • Normalization — This is standardization of attributes & values, aids in search and discovery
  • Matching — This is a slightly complex problem to match duplicates with imperfect data
  • Classification — This involves classification against Categories & Taxonomies
  • Content — This involves scoring data quality on attributes like Title, Description, Specifications etc. , finding & filling the “gaps” through entity extraction techniques
  • Images — This involves selecting best resolution, deriving attributes, detecting watermark
  • Grouping — This involves matching, grouping products based on variations, like shoes varying on Colors & Sizes
  • Merging — This involves selection of the best sources and data aggregation from multiple sources
  • Reprocessing — The Catalog needs to be reprocessed to pickup daily changes

2. Offers are made by Multiple sellers for same products & need to checked for correctness on:

  • Identifiers
  • Price variance
  • Shipping
  • Anzahl
  • Condition
  • Start & End Dates

3. Pricing & Inventory adjustments many times of the day which need to be processed with very low latency & strict time constraints

4. Logistics has a strong requirement around data correctness to optimize cost & delivery

Products & Offers

Modified Original with permission from Neha Narkhede

This yields architecturally to lots of decentralized autonomous services, systems & teams which handle the data “Before & After” listing on the site. As part of redesign around 2014 we started looking into building scalable data processing systems. I was personally influenced by this famous blog post “The Log: What every software engineer should know about real-time data’s unifying abstraction where Kafka could provide good abstraction to connect hundreds of Microservices, Teams, and evolve to company-wide multi-tenant data hub. We started modeling changes as event streams recorded in Kafka before processing. The data processing is performed using a variety of technologies like:

  1. Stream Processing using Apache Storm, Apache Spark
  2. Plain Java Program
  3. Reactive Micro services
  4. Akka Streams

The new data pipelines which was rolled out in phases since 2015 has enabled business growth where we are on boarding sellers quicker, setting up product listings faster. Kafka is also the backbone for our New Near Real Time (NRT) Search Index, where changes are reflected on the site in seconds.

Message Rate
Message Rate filtered for a Day, split Hourly

The usage of Kafka continues to grow with new topics added everyday, we have many small clusters with hundreds of topics, processing billions of updates per day mostly driven by Pricing & Inventory adjustments. We built operational tools for tracking flows, SLA metrics, message send/receive latencies for producers and consumers, alerting on backlogs, latency and throughput. The nice thing of capturing all the updates in Kafka is that we can process the same data for Reprocessing of the catalog, sharing data between environments, A/B Testing, Analytics & Data warehouse.

The shift to Kafka enabled fast processing but has also introduced new challenges like managing many service topologies & their data dependencies, schema management for thousands of attributes, multi-DC data balancing, and shielding consumer sites from changes which may impact business.

The core tenant which drove Kafka adoption where “Item Setup” teams in different geographical locations can operate autonomously has definitely enabled agile development. I have personally witnessed this over the last couple of years since introduction. The next steps are to increase awareness of Kafka internally for New & (Re)architecting existing data processing applications, and evaluate exciting new streaming technologies like Kafka Streams and Apache Flink. We will also engage with the Kafka open source community and the surrounding ecosystem to make contributions.

Did you like this blog post? Share it now

Subscribe to the Confluent blog

More Articles Like This

Walmart’s Real-Time Inventory System Powered by Apache Kafka

Consumer shopping patterns have changed drastically in the last few years. Shopping in a physical store is no longer the only way. Retail shopping experiences have evolved to include multiple […]

Preventing Fraud and Fighting Account Takeovers with Kafka Streams

Many companies have recently started to take cybersecurity and data protection even more seriously, particularly driven by the recent General Data Protection Regulation (GDPR) legislation. They are increasing their investment […]

Running Kafka Streams Applications in AWS

This guest blog post is the second in a series about the use of Apache Kafka’s Streams API by Zalando, Europe’s largest online fashion retailer. See Ranking Websites in Real-time […]

Jetzt registrieren

Start your 3-month trial. Get up to $200 off on each of your first 3 Confluent Cloud monthly bills

Nur neue Registrierungen.

Wenn Sie oben auf „registrieren“ klicken, erklären Sie sich damit einverstanden, dass wir Ihre personenbezogenen Daten verarbeiten – gemäß unserer und bin damit einverstanden.

Indem Sie oben auf „Registrieren“ klicken, akzeptieren Sie die Nutzungsbedingungen und den gelegentlichen Erhalt von Marketing-E-Mails von Confluent. Zudem ist Ihnen bekannt, dass wir Ihre personenbezogenen Daten gemäß unserer und bin damit einverstanden.

Auf einem einzigen Kafka Broker unbegrenzt kostenlos verfügbar

Die Software ermöglicht die unbegrenzte Nutzung der kommerziellen Funktionen auf einem einzelnen Kafka Broker. Nach dem Hinzufügen eines zweiten Brokers startet automatisch ein 30-tägiger Timer für die kommerziellen Funktionen, der auch durch ein erneutes Herunterstufen auf einen einzigen Broker nicht zurückgesetzt werden kann.

Wählen Sie den Implementierungstyp aus
Manuelle Implementierung
  • tar
  • zip
  • deb
  • rpm
  • docker
Automatische Implementierung
  • kubernetes
  • ansible

Wenn Sie oben auf „kostenlos herunterladen“ klicken, erklären Sie sich damit einverstanden, dass wir Ihre personenbezogenen Daten verarbeiten – gemäß unserer Datenschutzerklärung zu.

Indem Sie oben auf „kostenlos herunterladen“ klicken, akzeptieren Sie die Confluent-Lizenzvertrag und den gelegentlichen Erhalt von Marketing-E-Mails von Confluent. Zudem erklären Sie sich damit einverstanden, dass wir Ihre personenbezogenen Daten gemäß unserer Datenschutzerklärung zu.

Diese Website verwendet Cookies zwecks Verbesserung der Benutzererfahrung sowie zur Analyse der Leistung und des Datenverkehrs auf unserer Website. Des Weiteren teilen wir Informationen über Ihre Nutzung unserer Website mit unseren Social-Media-, Werbe- und Analytics-Partnern.