Chapter 4: The Observability Awakening

Feb 10, 2026

∙ Paid

Monday morning of Week 4 started with Maya’s presentation to Robert Harrison.

The CIO arrived at 8 AM sharp, along with CFO Patricia Winters and VP of Business Operations James Chen. Maya had the conference room set up with her laptop connected to the projector, the infrastructure repository open in one window, a terminal in another.

“Good morning,” Harrison said, settling into his chair. “I understand you have a demonstration.”

“We do,” Maya said. “Three weeks ago, you gave us twelve weeks to prove PeopleSoft modernization could compete with SaaS migration. I want to show you what we’ve accomplished in the first three weeks.”

She pulled up her summary slide:

Weeks 1-3: Foundation Complete

Week 1: Honest assessment, baseline metrics, quick wins
Week 2-3: Infrastructure as Code implementation
Result: 99.7% faster environment provisioning, $178K annual savings (dev only)

“Show me the demo,” Harrison said—no preamble, straight to business.

Maya opened her terminal. “I’m going to provision a complete PeopleSoft environment from scratch. When I run this command, Terraform and Chef will build a database, application servers, web servers, networking, storage, and configuration—everything needed for a functional PeopleSoft instance.”

She typed the command and hit enter.

“How long will this take?” Winters asked.

“About seventy-five minutes,” Maya said. “Our baseline three weeks ago was eighteen days.”

Winters blinked. “Eighteen days to seventy-five minutes?”

“Correct. And I’m not doing anything during these seventy-five minutes except watching automated processes run. Previously, this would have required multiple engineers working across multiple teams for days.”

They watched as infrastructure was provisioned in real-time. Harrison asked sharp questions—What happens if it fails? How do you validate correctness? Can you rebuild production this way? Maya answered each question confidently, occasionally pulling in Jake or Tom for technical details on database configuration or application-tier setup.

Seventy-three minutes later, the PeopleSoft sign-in screen appeared.

Harrison was quiet for a moment. “That’s… impressive. Walk me through the business value.”

Maya pulled up her cost analysis. “This dev environment costs us $110 per month to run versus $14,000 in our data center. Admin time to maintain it drops from forty hours per quarter to near zero. Total savings for this one environment: $178,000 annually. We have four environments. Extrapolating—”

“Seven hundred thousand per year,” Winters finished, doing the math. “That’s significant.”

“And that’s before we account for improved velocity,” Maya added. “When our developers can spin up test environments in an hour instead of waiting weeks, they move faster. When we can rebuild from disaster in ninety minutes instead of hoping our untested backup scripts work, we reduce risk. When our infrastructure knowledge lives in version-controlled code instead of tribal memory, we eliminate key person dependencies.”

Harrison leaned back. “What’s your plan for the next nine weeks?”

“Week 4, which starts today: CI/CD pipelines for automated testing and deployment of customizations. Week 5: Database modernization using managed services. Weeks 6-7: Observability and monitoring infrastructure. Week 8-9: Security automation and integration modernization. Weeks 10-12: Cost validation, scaling tests, and final presentation.”

“You’re on track?” Harrison asked.

“We’re ahead of track,” Maya said honestly. “I expected the infrastructure as code work to take three weeks with lots of struggles. We completed it in two weeks, delivering high-quality results. The team is learning faster than I anticipated.”

Harrison stood. “Good. Keep going. I want weekly updates. And Maya—this is good work. Genuinely impressive. But you’re still proving a concept. I need to see this working in production before I can take it to the board as an alternative to the SaaS plan.”

“Understood,” Maya said. “By Week 12, we’ll have production evidence.”

After the executives left, Maya’s team emerged from their desks where they’d been nervously monitoring the demo.

“How’d it go?” Tom asked.

“Harrison called it genuinely impressive,” Maya said. “Which from him is practically a standing ovation.”

“So we’re good?” Jake asked.

“We’re good for now,” Maya said. “But he’s right—we’re still proving a concept. We need production validation. Which means everything we build in the next nine weeks needs to be production-ready, not just demos.”

She pulled up the Week 4 plan on the screen. “Speaking of which: CI/CD pipelines. This week, we’re going to automate the testing and deployment of PeopleSoft customizations. No more manual exports and imports. No more emailing project files around. No more deployments that take six hours. We’re building a pipeline.”

“Before we dive into that,” Sarah said carefully, “we need to talk about something that happened over the weekend.”

Maya’s stomach dropped. “What happened?”

“Production incident. Saturday at 2:47 AM. Integration Broker went down. Took forty-three minutes to detect. A batch job had failed, and someone checked their email. Took another ninety minutes for Marcus to troubleshoot and fix. Total outage: two hours, thirteen minutes.”

“What caused it?” Maya asked.

“Web server ran out of memory and crashed,” Marcus said. “Took the IB domain with it. We don’t have memory monitoring, so I didn’t know what was wrong until I SSH’d into each server, checked logs, found the out-of-memory errors, and restarted services.”

“Two hours to diagnose a memory issue,” Maya said flatly.

“In my defense, the logs are spread across seventeen servers with no aggregation,” Marcus said. “I had to check each one manually. And our monitoring only watches whether processes are running, not whether they’re healthy.”

Maya walked to the whiteboard and wrote in large letters: “WEEK 4 REVISED PLAN: OBSERVABILITY FIRST.”

“Here’s the thing,” Maya said, turning to face her team. “We can build the prettiest CI/CD pipeline in the world, but if we can’t see what’s happening in production, we’re still operating blind. Saturday’s incident proved that. We need observability before we need CI/CD.”

“What’s the difference between monitoring and observability?” Priya asked.

“Great question,” Maya said. “Monitoring tells you something is broken. Observability tells you why it’s broken and helps you understand system behavior. Right now, we have basic monitoring—we know when a process dies. But we don’t have observability. We can’t answer questions like ‘Why is the system slow?’ or ‘What changed before this error started?’ or ‘Which integration is causing database contention?’”

She drew three columns on the whiteboard: Logs, Metrics, Traces.

“These are the three pillars of observability,” Maya explained. “Logs tell you what happened—detailed records of events, errors, transactions. Metrics tell you how the system is performing, including CPU, memory, response times, throughput. Traces tell you the path a request takes through your system—from web server to app server to database and back.”

“We have logs,” Tom pointed out.

“We have logs scattered across seventeen servers in different formats with no way to search them efficiently,” Maya corrected. “Marcus spent ninety minutes manually grepping through log files on Saturday. That’s not observability. That’s archaeology.”

“So what does good observability look like?” Lisa asked.

Maya pulled up her laptop and opened a screenshot she’d saved from her previous role—a Grafana dashboard with colorful charts showing system metrics, error rates, and response times, all in real-time.

“This is what we’re building toward,” she said. “A single pane of glass where we can see everything happening in our PeopleSoft environment. Application server health. Database performance. Integration throughput. User experience metrics. Error rates. Everything.”

“And when something breaks?” Jake asked.

“We see it immediately,” Maya said. “The dashboard shows the anomaly. We can drill into logs filtered by timeframe and component. We can see what changed right before the problem started. We can correlate events across different systems. Instead of spending ninety minutes figuring out what’s wrong, we spend five minutes confirming what we already suspect and fifteen minutes fixing it.”

“That sounds expensive,” Tom said. “Enterprise monitoring tools cost a fortune.”

“It would be,” Maya agreed, “if we were buying commercial tools. But we’re going to build it using open source: OpenSearch for log aggregation and analysis, Prometheus for metrics collection, Grafana for visualization, and Tempo for distributed tracing. Total cost: mostly our time to implement, plus about $200/month in infrastructure to run it.”

Sarah was nodding enthusiastically. “This is actually going to make our lives so much better. I’ve worked with observability stacks before. Once you have good observability, you can’t imagine working without it.”

“Okay,” Maya said. “Here’s the Week 4 plan, revised. Monday through Wednesday: we implement centralized logging with OpenSearch. Every log from every PeopleSoft component flows into a single location where we can search it. Thursday through Friday: we implement metrics collection with Prometheus and build our first Grafana dashboards. Next week, we’ll add distributed tracing and alerting.”

“That’s a lot for one week,” Marcus said.

“It is,” Maya admitted. “But we’re not building perfection. We’re building ‘better than Saturday night.’ If by Friday we have centralized logs and basic metrics, we’ve massively improved our ability to troubleshoot. The rest can evolve.”

“Who’s doing what?” Priya asked.

“Marcus, you’re leading the OpenSearch implementation since you already researched it in Week 1. Sarah, you’re helping Marcus with the technical architecture. Jake, you’re instrumenting the database to export metrics—Oracle has built-in monitoring views we can scrape. Tom, you’re configuring log shipping from all the application servers. Priya and Lisa, you’re documenting what we learn and building runbooks for using the observability tools.”

“And you?” Tom asked.

“I’m building the Grafana dashboards and figuring out what we need to measure to prove our system is healthy,” Maya said. “Plus, I’m talking to other teams who might want to consume our metrics. If we’re building an observability platform, we should make it useful beyond just PeopleSoft.”

She capped the marker. “Questions?”

“Yeah,” Jake said. “What do we do about Saturday’s incident? Do we need to file some kind of post-mortem or incident report?”

“We do,” Maya said. “And we’re going to use it as a teaching moment. Let me show you what a blameless post-mortem looks like.”

Chapter 4: The Observability Awakening

This post is for paid subscribers