The Good Tech Companies - Building Resilient Financial Systems With Explainable AI and Microservices

Starting point is 00:00:00 This audio is presented by Hacker Noon, where anyone can learn anything about any technology. Building resilient financial systems with explainable AI and microservices, by John Stoy and journalist. In today's cloud native and AI-driven enterprise landscape, system failures are no longer caused by simple outages but by complex interactions between microservices, automation, and machine learning models. To understand how explainable AI can transform reliability engineering, We spoke with Adithia Jakaraja who authored the IEEEE International Conference on Advances in Next Generation Computer Science. ICANCS, 2025 Best Paper, Explanable AI for Resilient Microservices, a transparency-driven approach, which presents a practical framework for building trustworthy,

Starting point is 00:00:46 auditable AI-driven resilience in large-scale systems. Q. Can you summarize the core idea behind your research? Adithia. The central idea of the paper is that AI-driven resilience systems fail not because they lack intelligence, but because they lack transparency. Modern microservices platforms increasingly rely on AI for anomaly detection, predictive scaling, and automated recovery. However, these decisions often operate as black boxes. When incidents occur, engineers are left without clarity on why an action was taken. This research introduces a transparency-driven resilience framework that embeds explainable AI directly into the resilience

Starting point is 00:01:24 life cycle so every AI-driven decision is interpretable, auditable, and operationally actionable. Q. What specific problems do Black Box I systems create in P-R-O-D-U-C-T-I-N-E-N-V-I-N-M-E-N-T-S? Adithia. Black-Box AI introduces three major problems during high- severity incidents. One. Unclear causality. Engineers cannot determine which service or metric triggered an action. 2. Delayed root cause analysis. Time is lost validating whether an AI decision was correct. 3. Reduced trust. Teams hesitate to rely on automation when they cannot explain it to stakeholders or regulators. In large microservices environments, these issues compound quickly, leading to-cascading failures and longer recovery times. Q. How does your framework address these

Starting point is 00:02:15 challenges? Adithia. The framework integrates explainability as a first-class architectural requirement. It maps specific explainable AI techniques to resilient scenarios such as anomaly detection, failure propagation, and predictive scaling. For example, Shapp and Lyme are used to explain anomalous behavior at the feature level. Bayesian networks are applied to identify probabilistic failure paths across service dependencies. Counterfactual explanations justify scaling and remediation actions by showing what would have prevented the failure. This ensures that every AI action is accompanied by a clear and

Starting point is 00:02:51 and technically grounded explanation. Q. Was this approach validated with real system data? Adithia, yes. The framework was validated using a production like microservices environment with over 38 services deployed across Kubernetes clusters. Faults such as latency spikes, memory leaks,

Starting point is 00:03:10 and cascading dependency failures were intentionally injected. The results showed, 42% reduction in mean time to recovery, MTTR. 35% improvement in successful. mitigation actions, up to 53% faster incident triage due to explainability-driven diagnostics. These results demonstrate that transparency directly improves operational outcomes. Many engineers worry that explainability adds performance overhead. How do OESYO-O-U-R work address this? Adithia, that concern is valid.

Starting point is 00:03:43 The study measured computational overhead carefully. Real-time explanations introduced approximately 15 to 20% additional compute cost, primarily due to SHAP calculations. However, this trade-off was justified by the substantial reductions in downtime and escalation rates. The framework also supports tiered explainability, using lightweight explanations for routine events and deeper analysis only during critical incidents, keeping overhead controlled. Q. How does this research translate to regulated industries like finance A-N-D-I-N-S-U-R-A-N-C-E? Adithia. Regulated industries require not only resilience, but accountability. AI systems must explain their decisions to auditors, regulators,

Starting point is 00:04:26 and executives takeholders. By producing cryptographically auditable explanation logs and trace aligned diagnostics, the framework enables organizations to meet governance requirements while still benefiting from automation. This is especially critical in financial services, where unexplained system behavior can have regulatory and economic consequences. Q. Did the explainability layer change how engineers interacted with incidents, Adithia, yes, significantly. In controlled evaluations with site reliability engineers, explainable diagnostics reduced uncertainty during outages. Engineers were able to identify route causes faster and make confident remediation decisions without second-guessing the AI. Incident resolution confidence scores increased from

Starting point is 00:05:12 three, one to four, six out of five, and escalation tickets dropped by nearly 40% in complex failure scenarios. Q. What makes this work different from existing AI-OPS approaches? Adithia, great question. Most AI-Ops solutions focus on prediction accuracy, butuneré interpretability. This work treats explainability as a resilience property, not a visualization afterthought. It provides architectural patterns, performance benchmarks, and measurable outcomes that show how explainable AI can be deployed safely at scale, rather than remaining a research concept. Q. What is the broader takeaway for system architects and engineering leaders. Adithia. The key takeaway is that reliable AI systems must be understandable systems. Automation without transparency increases risk rather than

Starting point is 00:06:00 reducing it. By embedding explainability into AI-driven resilience, organizations can achieve faster recovery, fewer escalations, and greater trust in autonomous systems. Transparency is not a cost, it is a force multiplier for reliability. Q. Last question. What's next for this area of research, Adithia, future work will focus on cross-cloud explainability, reinforcement learning transparency, and standardizing explanation formats for enterprise observability tools. As AI becomes more deeply embedded into critical infrastructure, explainability will be essential for building systems that are not only intelligent, but dependable. This story was published under Hackernoon's business blogging program.

Starting point is 00:06:43 Thank you for listening to this Hackernoon story, read by artificial intelligence. Visit hackernoon.com to read, write, learn and publish.

The Good Tech Companies - Building Resilient Financial Systems With Explainable AI and Microservices

There aren't comments yet for this episode. Click on any sentence in the transcript to leave a comment.