Why Scale Matters More Than Ever In Retail
The US retail sector accounts for 16% of annual gross domestic product and employs 23% of the national workforce. In 2016, US retailers clocked $5.5 trillion in annual sales and employed 15+ million people. Here are some stats that demonstrate the scale of operations for leading retail chains:
Figure 1. Big Box Retailers Have Honed The Art Of Operating At Scale.
How do retailers manage such unprecedented economies of scale? Modern retailers have invested billions of dollars in creating a reliable and agile IT infrastructure that lets them predict customer demand, handle millions of stock keeping units, and coordinate supply chains with thousands of partners. Retailers have successfully mastered the art of building IT infrastructure that processes millions of customers every day but still delivers personalized experiences at scale.
Figure 2. Walmart’s IT systems run 11,000 global stores that serve 250 million customers per week.
What Happens When Retail Systems Break?
Most retailers work in a centralized IT model where a majority of IT functions get delivered from the corporate headquarters. Each retail store operates with a small technology footprint that supports point of sale (POS), kiosks, and inventory management systems. Retail IT systems integrate real time store data across hundreds of locations, to decode seasonal trends, sales insights, and inventory holdings.
If any of the store systems stop working, retailers can lose millions of dollars per hour. In 2015, a daily system refresh failure at Starbucks resulted in three hours of POS disruption across 15,000 stores. Starbucks lost a huge chunk of revenues and had to giveaway free cappuccinos as a goodwill gesture to customers. Every retail IT outage disrupts productivity, damages credibility, and impacts customer loyalty.
Figure 3. Point of Sale Outages Can Have A Huge Monetary Impact.
When Ransomware Strikes
Let’s take the case of a fictional retailer, Jason & Chang (J&C). J&C is a high end fashion apparel retailer with a thousand stores in North America. Due to several acquisitions, J&C uses ten different POS systems for its store operations. Most of these POS systems run on older versions of the Windows operating system.
This May, J&C’s IT operations team decides to roll out emergency patches for securing its POS devices after the WannaCry ransomware attack strikes 300,000 machines in 150 countries. The IT team rolls out ten different patches for the ten POS systems after close of business hours. After initial testing, the IT team feels confident that all the POS devices are working well and the malware hasn't compromised any confidential data.
Figure 4. Why Ransomware Is An Existential Threat For Retail Store Operations.
J&C’s stores open up for business the next day. Store managers start complaining that the POS terminals have stopped working. J&C has different monitoring tools for store systems and a flood of alerts show up at the NOC for POS failure. After an initial analysis, it's clear that some of the WannaCry emergency patches are causing the POS outage. The challenge for J&C’s IT team is to figure out which patches to keep and which to roll back. And, to complete the roll back process before the POS outage becomes headline news!
Incident Management To The Rescue
The Standish Group reports that every minute of POS downtime costs retailers $4,700. One hour of POS downtime across 200 J&C stores can result in $50+ million of lost revenues. How does the IT team maximize availability, pinpoint root cause, and restore services during the POS outage?
A comprehensive incident management framework helps retailers drive operational agility, responsiveness, and adaptability during an outage. Retailers can reduce downtime and accelerate recovery with improved visibility, situational analysis, and collaboration. IT teams can drive faster mean time to resolution across hundreds of locations and thousands of endpoint devices with alert aggregation, correlation, and escalation.
Figure 5. An Incident Management System Provides Visibility and Control For Retail Operations.
- Alert Aggregation. With alert aggregation, J&C can centrally consolidate events, metrics, and logs from different IT tools. J&C gets visibility for all critical alerts across their stores in a single dashboard. The IT team can troubleshoot POS alerts with contextual data, without having to switch between different tools.
- Alert Correlation. Alert correlation helps the IT team separate signal from noise, by combining alerts from different POS systems. J&C can reduce the staff and time required to investigate the POS outage using alert prioritization and deduplication. With alert correlation, J&C is able to pinpoint which two emergency patches caused the outage.
- Alert Escalation. After the root cause analysis is complete, alert escalation notifies the right subject matter experts for speedy resolution. If a team can't respond within the defined SLA, notifications move up the chain of command. The IT team can send alerts using SMS, voicemail, and email and customize notifications with alert related data.
Incident management has many obvious benefits in a retail context. J&C is able to roll back patches across two hundred POS devices and restore billing in no time. After analyzing the two patches, the IT team rolls out new fixes for the POS devices after business hours. And, all this before any financial impact or adverse publicity!