Volume III · Architecture · Patterns · Systems · AI

The
Engineer's
Codex

The final layer. Everything that surrounds the code: how systems are structured, how distributed failures behave, how AI applications are architected, how fintech operations actually work, and the mental models that separate engineers from senior engineers.

01
System Structure

Architecture Patterns

The vocabulary of how to arrange code, services, and responsibilities. These are the patterns interviewers ask about and that production systems live or die by.

Structural Architectures
Hexagonal Architecture
Ports & Adapters · Alistair Cockburn 2005
The application core is surrounded by ports (interfaces) and adapters (implementations). Business logic has zero dependency on frameworks, databases, or UI. The most production-proven architecture for testable, maintainable services.
[ HTTP Adapter ] [ Kafka Adapter ] ↓ ↓ ┌─────────────────────────────┐ │ PORT: PaymentInputPort │ │ ┌───────────────────────┐ │ │ │ DOMAIN / USE CASES │ │ │ └───────────────────────┘ │ │ PORT: PaymentRepoPort │ └─────────────────────────────┘ ↓ ↓ [ JPA Adapter ] [ InMemory Adapter ]
Primary PortsSecondary PortsTestable Core
Clean Architecture
Uncle Bob · Onion Variant
Concentric dependency rings: Entities → Use Cases → Interface Adapters → Frameworks. Dependency Rule: source code dependencies must point inward only. The domain never imports Spring, JPA, or anything external.
┌────────────────────────────────┐ │ Frameworks & Drivers (outer) │ │ ┌──────────────────────────┐ │ │ │ Interface Adapters │ │ │ │ ┌────────────────────┐ │ │ │ │ │ Use Cases │ │ │ │ │ │ ┌──────────────┐ │ │ │ │ │ │ │ Entities │ │ │ │ │ │ │ └──────────────┘ │ │ │ │ │ └────────────────────┘ │ │ │ └──────────────────────────┘ │ └────────────────────────────────┘
Dependency RuleFramework-Free Core
CQRS
Command Query Responsibility Segregation
Separate the write model (commands that change state) from the read model (queries that return data). Enables independent scaling of reads/writes, optimized read models (denormalized projections), and eliminates impedance mismatch between domain objects and query DTOs.
Command Side Query Side ───────────── ────────────── CreatePayment ─┐ PaymentSummary UpdateStatus ─┤─→ PaymentDetail RefundPayment ─┘ DashboardView │ ↑ Write Store → Read Store (projection)
CommandsQueriesRead Scaling
Event Sourcing
Store Events, Not State · Greg Young
The database stores the sequence of events that led to current state, not the current state itself. Current state is derived by replaying events. Natural audit log, time travel debugging, event replay to rebuild projections. Pairs naturally with CQRS. Complex; don't use without genuine need.
Event Store (append-only): ┌────────────────────────────────────┐ │ 1. PaymentCreated {id, amount} │ │ 2. PaymentValidated {riskScore} │ │ 3. PaymentCharged {txRef} │ │ 4. PaymentSettled {timestamp} │ └────────────────────────────────────┘ ↓ replay ↓ Current State = fold(events, initialState)
Append-OnlyReplayAudit TrailComplexity Cost
Saga Pattern
Distributed Transactions Without 2PC
A sequence of local transactions, each publishing events that trigger the next. If a step fails, compensating transactions undo completed steps. Two flavors: Choreography (services react to events) and Orchestration (a saga orchestrator directs steps). Fintech critical — payment spans multiple services.
ORCHESTRATION SAGA (explicit controller): SagaOrchestrator → PaymentService.reserve() ✓ → RiskService.check() ✓ → LedgerService.debit() ✗ ← FAILS ← LedgerService (compensate) ← PaymentService.release() ← rollback CHOREOGRAPHY SAGA (event-driven): PaymentCreated →→ RiskChecked →→ LedgerDebited ↓ if fails LedgerFailedEvt →→ PaymentReleased
Compensating TxnEventual Consistency
Strangler Fig
Legacy Migration · Martin Fowler
Gradually replace a legacy monolith by wrapping it with new services. Route specific endpoints to new microservices while the monolith handles the rest. The new system "strangles" the old one over time until nothing routes to the monolith. The only responsible way to containerize legacy fintech apps.
Phase 1: All → [Monolith] Phase 2: /payments/* → [New PaymentSvc] /users/* → [Monolith] Phase 3: /payments/* → [New PaymentSvc] /accounts/* → [New AccountSvc] /reports/* → [Monolith] (shrinking) Phase N: Monolith = empty → decommission
IncrementalRisk-MitigatedJob Relevant
Sidecar Pattern
Service Mesh · Envoy · Istio
Deploy a helper container alongside the main service container in the same pod. The sidecar handles cross-cutting concerns: mTLS, observability, rate limiting, circuit breaking. Your app code stays simple; the mesh handles reliability. Envoy proxy is the canonical sidecar. Critical for zero-trust fintech networking.
┌─── Pod ─────────────────────────────┐ │ ┌──────────────┐ ┌─────────────┐ │ │ │ Your App │ │ Envoy │ │ │ │ (Spring │←→│ Sidecar │ │ │ │ Boot) │ │ mTLS │ │ │ │ │ │ Metrics │ │ │ │ │ │ Tracing │ │ │ └──────────────┘ └─────────────┘ │ └─────────────────────────────────────┘
Cross-CuttingmTLSKubernetes Native
BFF Pattern
Backend For Frontend
Create a dedicated backend API for each frontend type (mobile app, web SPA, partner API). Each BFF aggregates and transforms data specifically for its client, avoiding the "too much / too little" problem of a generic API. Common in fintech: different data needs for mobile banking vs web dashboard vs partner integrations.
[Mobile App] → [Mobile BFF] ─┐ [Web SPA] → [Web BFF] ─┤→ [Microservices] [Partner API]→ [Partner BFF] ─┘ Each BFF: tailored aggregation, auth, rate limits
Client-SpecificAPI Tailoring
💡 Architecture Interview Move When asked "how would you design a payment system?", the correct answer layers these: Hexagonal architecture for each service (testable domain), CQRS to separate payment write commands from reporting queries, Saga for multi-service payment flows, Strangler Fig if modernizing legacy, and Sidecar for the service mesh. Name-dropping one pattern shows reading; composing all five shows experience.
02
Eric Evans · Vaughn Vernon

Domain-Driven Design

DDD is not a framework — it's a language for having precise conversations about complex business domains. In fintech, the domain IS the competitive advantage.

Strategic Design — The Big Picture
Strategic Pattern

Bounded Context

A semantic boundary within which a domain model is consistent and unambiguous. "Account" means different things in Banking (balance, transactions) vs Identity (user credentials). Each bounded context owns its model — no cross-context JPA join. Maps 1:1 to microservices in well-designed systems.

Strategic Pattern

Context Map

A diagram showing how bounded contexts relate: Upstream/Downstream, Customer/Supplier, Partnership, Shared Kernel, Conformist, Anti-Corruption Layer (ACL). The ACL is critical when integrating legacy systems — translate their model into yours at the boundary, don't let their concepts pollute your domain.

Strategic Pattern

Ubiquitous Language

A shared vocabulary between developers and domain experts, reflected exactly in code. If the business says "authorization" and "capture" for card payments, your code has authorize() and capture() — not processStep1() and processStep2(). The language evolves the model.

Tactical Design — The Building Blocks
Building BlockIdentity?Mutable?DefinitionJava Implementation
Entity ✓ Yes ✓ Yes An object defined by its identity, not its attributes. Two Payments with same amount but different IDs are different entities. Identity persists through state changes. @Entity class Payment { private PaymentId id; ... } — identity via equals/hashCode on ID only
Value Object ✗ No ✗ No Defined entirely by its attributes. Two Money(100, USD) objects are equal and interchangeable. Immutable. No identity. Replaces primitives: use Money, not double. Java record Money(BigDecimal amount, Currency currency) {} — records are perfect VOs
Aggregate ✓ (root) ✓ (via root) A cluster of entities and VOs with a single root entity. All access to the cluster goes through the Aggregate Root. Enforces invariants at the boundary. Transactional consistency unit. Order aggregate root owns OrderLine entities — never access OrderLine directly from outside Order
Repository Abstracts the persistence of aggregates. Returns aggregate roots only. The interface lives in the domain layer; the JPA implementation lives in the infrastructure layer. This is the Hexagonal port/adapter boundary. interface PaymentRepo { Payment findById(PaymentId); void save(Payment); }
Domain Service Domain logic that doesn't naturally fit one entity or VO. Cross-aggregate coordination. Stateless. If you find yourself putting logic in a utility class, it's probably a Domain Service. class TransferService { void transfer(Account from, Account to, Money amount) }
Domain Event ✓ (implicit) ✗ No Something that happened in the domain that other parts of the system care about. Immutable facts. Drives sagas, projections, notifications. Named in past tense: PaymentAuthorized, not AuthorizePayment. record PaymentAuthorized(PaymentId id, Instant at, Money amount) implements DomainEvent {}
// DDD Aggregate in Java 21 — the full idiom
public class Payment {  // Aggregate Root
    private final PaymentId   id;          // typed identity, not String
    private final Money       amount;      // Value Object, not BigDecimal
    private       PaymentStatus status;
    private final List<DomainEvent> events = new ArrayList<>();

    // Business methods express domain language, not CRUD
    public void authorize(RiskAssessment risk) {
        if (status != PaymentStatus.PENDING)
            throw new InvalidPaymentStateException(id, status, "authorize");
        if (risk.isHighRisk() && amount.exceeds(Money.of(10000, USD)))
            throw new RiskLimitExceededException(id, risk);
        
        this.status = PaymentStatus.AUTHORIZED;
        events.add(new PaymentAuthorized(id, Instant.now(), amount));
    }

    // Invariant: amount must be positive — enforced in constructor/factory
    public static Payment create(Money amount, Merchant merchant) {
        if (!amount.isPositive()) throw new IllegalArgumentException("Amount must be positive");
        var p = new Payment(PaymentId.generate(), amount, merchant);
        p.events.add(new PaymentCreated(p.id, amount, merchant.id()));
        return p;
    }
}
💡 DDD in Fintech — The Killer Application Fintech has the most DDD-amenable domain on earth. "Authorization," "capture," "settlement," "chargeback," "reconciliation" — these are precise, legally-defined terms that every business stakeholder uses exactly. If your code uses those same words, bugs become immediately visible to non-engineers. That's DDD's core promise.
03
Fallacies · Laws · Theorems

Distributed Systems

The 8 fallacies of distributed computing are all lies you've believed at some point. These are the theorems and patterns that let you build honest, reliable systems anyway.

The Fundamental Laws
CAP Theorem — Brewer 2000
A distributed system can guarantee at most two of:
Consistency — every read receives the most recent write
Availability — every request receives a response
Partition Tolerance — system continues despite network splits

Since network partitions are unavoidable in production, you're really choosing between CP (consistent, possibly unavailable during partition) or AP (always available, possibly stale).

Postgres = CP. Cassandra (default) = AP. DynamoDB = AP. ZooKeeper = CP.
PACELC — Abadi 2012
CAP only addresses partition scenarios. PACELC extends it:

If Partition: choose between Availability or Consistency
Else (no partition): choose between Latency or Consistency

The latency-consistency tradeoff is the everyday reality that CAP misses. Synchronous replication = consistent but slower. Async replication = faster but eventual.

DynamoDB: PA/EL (high availability, low latency, eventual)
Spanner: PC/EC (consistent, higher latency, consistent)
Consistency Models (Weakest → Strongest)
ModelGuaranteeExamplesLatency Cost
EventualAll replicas converge to same value eventually — no time boundDNS, Cassandra (default), S3Lowest
Monotonic ReadYou never see older values than you've already read (within session)DynamoDB sessions, Cosmos DBLow
Read Your WritesAfter a write, you always see it in subsequent readsMost sticky-session databasesLow-Medium
CausalOperations that are causally related appear in causal orderMongoDB causal sessions, CockroachDBMedium
SequentialOperations appear in some total order consistent with program orderKafka partition orderingMedium-High
LinearizableEvery operation appears to take effect atomically at some instant — real-time ordering guaranteedZookeeper, etcd, Google Spanner, CockroachDB strong readsHigh
Strict SerializableTransactions appear to execute serially in real-time order — the gold standardGoogle Spanner, CockroachDB transactionsHighest
Key Distributed Patterns
Outbox Pattern
Transactional Outbox · At-Least-Once Delivery
Write to your database AND an outbox table in the same transaction. A separate process reads the outbox and publishes to Kafka/SQS. Guarantees events are published if-and-only-if the DB write commits. Solves the dual-write problem. Essential in fintech event-driven systems.
BEGIN TRANSACTION INSERT payments (...) ← domain write INSERT outbox (event, status=PENDING) ← same txn COMMIT Poller reads outbox → publishes to Kafka → marks outbox row PUBLISHED
Idempotency Keys
Safe Retries in Payment Systems
Every mutating API request carries a client-generated idempotency key. Server stores (key → result). If the same key arrives again (retry/timeout), return the stored result without re-executing. Stripe's entire API works this way. In fintech, idempotency is not optional — double charges are lawsuit-level bugs.
POST /payments Idempotency-Key: a4f7-3b2c-9d01 ← client generates UUID Server: key exists? return cached result Server: key new? execute → store result → return
Circuit Breaker
Resilience4j · Cascading Failure Prevention
Three states: CLOSED (normal), OPEN (failing fast), HALF-OPEN (testing recovery). When failures exceed threshold, open the circuit — stop calling the failing service and return fallback immediately. Prevents one slow service from taking down all callers. Resilience4j is the Java standard.
CLOSED → (failure rate > 50%) → OPEN OPEN → (wait 60s) → HALF-OPEN HALF-OPEN → (test call ok) → CLOSED HALF-OPEN → (test call fails) → OPEN
Two-Phase Commit
2PC · Why We Use Sagas Instead
Distributed atomic transaction across multiple services: Phase 1 (Prepare — all participants vote commit/abort), Phase 2 (Commit — coordinator sends final decision). Works but: coordinator SPOF, blocking during failure, poor scalability. In modern microservices, Sagas with compensations are preferred. Know 2PC to explain why you're not using it.
Coordinator → all: PREPARE? All → Coordinator: READY ✓ / ABORT ✗ Coordinator → all: COMMIT (if all READY) Problem: coordinator crashes after PREPARE → all participants blocked waiting forever
Fallacies of Distributed Computing — Peter Deutsch 1994
1. The network is reliable  |  2. Latency is zero  |  3. Bandwidth is infinite  |  4. The network is secure
5. Topology doesn't change  |  6. There is one administrator  |  7. Transport cost is zero  |  8. The network is homogeneous

Every junior engineer assumes at least three of these. Every production incident traces back to at least one. Design systems assuming the opposite of all eight.
04
The New Discipline

AI Engineering Patterns

Building systems with LLMs is a new engineering discipline with its own failure modes, patterns, and architecture primitives. This is the core of your new role.

Retrieval Augmented Generation (RAG)
NAIVE RAG PIPELINE: User Query ↓ embed(query) Query Vector →→→ Vector DB (pgvector, Pinecone, Qdrant) ↓ similarity search (cosine/dot) Top-K Chunks (relevant document fragments) ↓ inject into prompt LLM Prompt: "Context: {chunks} Question: {query} Answer:" ↓ generate Grounded Response ← cites retrieved context, not hallucination ADVANCED RAG: Query Rewriting → HyDE → Hybrid Search (BM25 + vector) → Re-ranking → Filtered retrieval
Agent Architectures
ReAct Pattern
Reason + Act · Yao et al. 2022
The LLM alternates between Thought (reasoning about what to do), Action (calling a tool), and Observation (processing the tool result). Continues until it can answer the original question. The foundation of most production agents.
Thought: I need to check the payment status Action: get_payment_status("PAY-123") Observation: {status: "pending", created: "2h ago"} Thought: It's been pending 2h, I should escalate Action: escalate_payment("PAY-123", "timeout") Observation: {escalated: true, ticket: "ESC-456"} Answer: Payment escalated, ticket ESC-456 created.
Tool Use / Function Calling
LLM → Structured API Calls
Provide the LLM with a schema of available tools (function names, parameter types, descriptions). The model decides when to call which tool and with what arguments. Your Spring Boot service becomes an LLM tool — the AI calls your APIs. This is how AI-enhanced CI/CD pipelines work.
// Tool definition (LangChain4j)
@Tool("Get payment status by ID")
PaymentStatus getPaymentStatus(
    @P("The payment ID") String paymentId) {
  return paymentService.findById(paymentId).status();
}
Chain-of-Thought Prompting
CoT · Intermediate Reasoning Steps
Instruct the model to show its reasoning before answering. "Think step by step" dramatically improves accuracy on multi-step problems. Zero-shot CoT: just add "Let's think step by step." Few-shot CoT: provide examples of reasoning chains. Critical for complex code generation and security analysis.
Prompt Template "You are a Java security expert. Analyze this code for OWASP Top 10 vulnerabilities. Think through each category step by step, then provide your findings."
Guardrails Pattern
Input/Output Validation for LLMs
In regulated fintech environments, LLM inputs and outputs must be validated. Input guardrails: check for PII, inject system context, rate limit. Output guardrails: validate JSON schema, check for harmful content, strip sensitive data before logging. Never log raw prompts containing customer financial data.
User Input → [PII Scrubber] → [Input Guard] → LLM ↓ Response ← [PII Re-inject] ← [Output Guard] ← Raw Output Audit log: sanitized_prompt, model, latency NEVER log: raw_prompt with customer PII
Structured Output
JSON Mode · Constrained Generation
Force the LLM to output valid JSON conforming to a schema. Pydantic (Python) or Jackson annotations (Java) define the expected structure; the framework enforces it. Essential when LLM output drives code execution — freeform text cannot drive state machine transitions.
// LangChain4j structured output
record RiskAssessment(
    RiskLevel level,
    String reason,
    boolean requiresManualReview) {}

RiskAssessment assessment = aiService
    .assessRisk(transaction);  // returns typed record
Prompt Caching
Cost & Latency Optimization
Long system prompts (compliance rules, code context, documentation) can be cached by the model provider. Subsequent requests reuse the cached prefix — dramatically reducing tokens processed and therefore latency and cost. Anthropic, OpenAI, Google Gemini all support prompt caching. In fintech, your compliance ruleset can be a cached system prompt.
Request 1: [Long System Prompt 8k tokens] + [Query] → Full processing: $0.024 + $0.001 Request 2: [cache_control: ephemeral] + [New Query] → Cache hit: $0.003 + $0.001 (87% cheaper)
AI Evaluation — The Missing Discipline
Eval MethodWhat It MeasuresWhen to Use
Unit EvalsDoes a single prompt produce expected output on known examples? Pass/fail per case.Regression testing after prompt changes
LLM-as-JudgeUse a stronger model to grade outputs: faithfulness, relevance, coherence, groundednessRAG pipeline quality, continuous monitoring
Human EvalDomain experts rate a sample of outputs. Gold standard but expensive.Pre-production sign-off, periodic audits
A/B TestingRoute % of traffic to new prompt/model, compare downstream metricsPrompt optimization, model upgrades
Hallucination DetectionCheck if claims in output are supported by retrieved context (for RAG)Always, for RAG over compliance documents
Latency P9999th percentile response time under load — LLM APIs have high varianceProduction readiness, SLA definition
💡 The One AI Pattern Nobody Teaches Treat your AI pipelines exactly like distributed services: circuit break on API timeout (LLMs drop connections), retry with exponential backoff (provider rate limits), cache aggressively (same query twice = waste), stream responses (don't block on 3-second generations), and have a deterministic fallback when the LLM is degraded. Every production AI system eventually needs all five.
05
The Three Pillars + SRE

Observability & SRE

If you can't measure it, you can't improve it. In fintech, if you can't prove it in logs, it didn't happen legally. Observability is compliance infrastructure.

Pillar 1

Metrics

What: Numeric time-series data aggregated over windows. Tools: Micrometer (Spring Boot) → Prometheus → Grafana. Types: Counter (requests_total), Gauge (queue_depth), Histogram (request_duration_p99), Summary. Java: @Timed, @Counted annotations on service methods.

Pillar 2

Logs

What: Structured, timestamped event records. Tools: SLF4J + Logback → JSON format → ELK or CloudWatch Logs. Critical: structured JSON (not plain text), correlation IDs on every log line, PII masking before persistence. Logs are audit evidence in fintech — retention policy is a compliance requirement.

Pillar 3

Traces

What: The path of a single request through distributed services. Tools: OpenTelemetry (OTEL) → Jaeger or Zipkin or Tempo. Java: Spring Boot 3 + Micrometer Tracing auto-instruments. Each trace = spans with parent-child relationships. Mandatory for debugging microservice latency.

SRE Reliability Mathematics
SLI → SLO → SLA
SLI (Service Level Indicator): the actual measurement
e.g., % of payment API requests completing in <200ms

SLO (Service Level Objective): the internal target
e.g., 99.9% of requests complete in <200ms, measured monthly

SLA (Service Level Agreement): external contractual commitment
e.g., 99.5% availability or service credits apply

SLO is stricter than SLA — the buffer is your error budget. Always set SLO tighter than SLA.
Error Budget
Error Budget = 1 - SLO
At 99.9% SLO: budget = 0.1% = 43.8 minutes/month downtime allowed
At 99.99% SLO: budget = 0.01% = 4.38 minutes/month

Error budget creates alignment: if you're within budget, deploy freely. If you've burned it, freeze deployments and focus on reliability. Engineering velocity is a function of reliability.
AvailabilityDowntime/MonthDowntime/Year
99%7.2 hours3.65 days
99.5%3.6 hours1.83 days
99.9%43.8 min8.77 hours
99.95%21.9 min4.38 hours
99.99%4.38 min52.6 min
99.999%26.3 sec5.26 min
Fintech Baseline Payment processing APIs typically target 99.99% (4 min/month). Settlement and reporting can tolerate 99.9%. Real-time fraud scoring must be 99.999% — it blocks transactions.
Key Metrics Every Backend Engineer Must Know
MetricMeasuresTool (Spring Boot)Alert Threshold Example
http_server_requests_seconds_p9999th percentile API latencyMicrometer autoAlert if P99 > 500ms for 5min
http_server_requests_total{status=5xx}Server error rateMicrometer autoAlert if error rate > 0.1%
jvm_memory_used_bytesHeap/non-heap usageMicrometer JVM metricsAlert if heap > 85% for 10min
jvm_gc_pause_secondsGC pause durationMicrometer JVM metricsAlert if GC pause > 200ms
hikaricp_connections_pendingDB connection pool saturationMicrometer HikariCPAlert if pending > 5 for 2min
kafka_consumer_lagMessage processing backlogKafka exporterAlert if lag growing for 15min
resilience4j_cb_stateCircuit breaker stateResilience4j metricsAlert on OPEN state immediately
06
REST · gRPC · GraphQL · AsyncAPI

API Design Patterns

The contract between services is the most expensive thing to change. Get it right the first time. Or at least version it.

REST Richardson Maturity Model
LevelNameWhat It MeansExample
Level 0The Swamp of POXSingle URI, HTTP as transport only, no semanticsPOST /api {"action":"getPayment","id":"123"}
Level 1ResourcesMultiple URIs representing resourcesPOST /payments/123 (but all POST)
Level 2HTTP VerbsUse GET/POST/PUT/PATCH/DELETE semanticallyGET /payments/123 — this is "REST" in most orgs
Level 3Hypermedia (HATEOAS)Responses include links to valid next actionsResponse includes _links: {capture: {href:...}, refund: {href:...}}
gRPC vs REST vs GraphQL — Decision Matrix
DimensionREST/JSONgRPC/ProtobufGraphQL
TransportHTTP/1.1 or 2HTTP/2 only, bidirectional streamingHTTP/1.1 or 2
SchemaOpenAPI (optional)Protobuf IDL (required, strict)SDL type system (required)
PerformanceBaseline2–10x faster (binary, no serialization overhead)Similar to REST, query complexity risk
Browser Support✓ Native✗ Needs grpc-web proxy✓ Native
Best ForExternal APIs, CRUD services, public APIsInternal microservice communication, streaming, mobileComplex, flexible client data requirements (dashboards)
VersioningURL /v1/, /v2/ or headersProtobuf field numbers (backward compatible)Field deprecation, additive-only
FintechExternal partner APIs, webhooksInternal payment processing pipelinesFinance dashboards, reporting queries
API Versioning Strategies
// URL versioning — simple, cache-friendly
GET /v1/payments/123
GET /v2/payments/123

// Header versioning — clean URLs
GET /payments/123
Accept: application/vnd.myapi.v2+json

// Consumer-Driven Contract Testing (Pact)
// The consumer defines the contract;
// the provider verifies it on every build
@Pact(consumer = "payment-service")
public RequestResponsePact createPact(PactDslWithProvider builder) {
    return builder
        .given("payment 123 exists")
        .uponReceiving("a request for payment 123")
        .path("/v1/payments/123")
        .method("GET")
        .willRespondWith()
        .status(200)
        .body("{\"id\":\"123\",\"amount\":100.00}")
        .toPact();
}
🏦 Fintech API Non-Negotiables 1. Idempotency-Key header on all POST mutations
2. Correlation-ID header on all requests (for audit trails)
3. ISO 4217 currency codes (never "dollars")
4. ISO 8601 timestamps in UTC (never local time)
5. Decimal amounts as strings (never float — IEEE 754 is not your friend with money)
6. Error responses with machine-readable error codes (not just HTTP status)
REST API Design Rule Nouns, not verbs. POST /payments not POST /createPayment. State transitions via sub-resources: POST /payments/123/captures to capture, POST /payments/123/refunds to refund. HTTP verb encodes the action; the resource noun encodes the target.
07
Zero Trust · OAuth2 · STRIDE · PCI

Security Patterns

In fintech, security is not a feature. It is the product. Every engineering decision is also a security decision.

OAuth2 / OIDC — The Full Flow
AUTHORIZATION CODE FLOW WITH PKCE (the secure one for web/mobile): 1. User clicks Login 2. Client generates code_verifier (random), code_challenge = SHA256(verifier) 3. Client redirects: GET /authorize?response_type=code&client_id=X& code_challenge=Y&code_challenge_method=S256 4. Auth Server: user authenticates → issues authorization_code 5. Auth Server redirects back: ?code=ABC 6. Client: POST /token {code=ABC, code_verifier=original} 7. Auth Server: validates code + verifier match → returns access_token, refresh_token, id_token 8. Client: GET /api/payments Authorization: Bearer {access_token} 9. Resource Server: validates JWT signature, expiry, scopes → serves request PKCE prevents authorization code interception attacks — mandatory for SPAs and mobile
JWT Deep Dive
// JWT = Header.Payload.Signature (base64url encoded, dot-separated)
// Spring Security 6 JWT Resource Server
@Configuration
@EnableWebSecurity
class SecurityConfig {
    @Bean
    SecurityFilterChain chain(HttpSecurity http) throws Exception {
        return http
            .authorizeHttpRequests(auth -> auth
                .requestMatchers("/actuator/health").permitAll()
                .requestMatchers(HttpMethod.POST, "/payments")
                    .hasAuthority("SCOPE_payments:write")   // scope-based
                .requestMatchers(HttpMethod.GET, "/payments/**")
                    .hasAuthority("SCOPE_payments:read")
                .anyRequest().authenticated())
            .oauth2ResourceServer(o -> o
                .jwt(j -> j.jwtAuthenticationConverter(jwtConverter())))
            .sessionManagement(s -> s
                .sessionCreationPolicy(STATELESS))   // no session cookies
            .build();
    }
}
STRIDE Threat Model
ThreatWhat It IsJava/Spring MitigationOWASP Category
S — SpoofingClaiming to be someone else (identity)JWT validation, mTLS between services, Spring SecurityBroken Authentication
T — TamperingModifying data in transit or at restTLS 1.3, HMAC request signing, field-level encryption for PIICryptographic Failures
R — RepudiationDenying an action occurredImmutable audit logs, digital signatures, event sourcingInsufficient Logging
I — Information DisclosureExposing private dataPII masking in logs, response filtering, no stack traces in prod responsesSensitive Data Exposure
D — Denial of ServiceMaking the system unavailableRate limiting (Resilience4j), circuit breakers, CDN, WAFSecurity Misconfiguration
E — Elevation of PrivilegeGaining unauthorized permissionsLeast privilege IAM roles, RBAC with Spring Security, scoped JWTBroken Access Control
⚠ PCI-DSS Scope — The Fintech Engineer's Burden PCI-DSS applies to any system that stores, processes, or transmits cardholder data (CHD). Scope reduction is the key strategy: tokenize card numbers at the edge (Stripe, Braintree, Spreedly), process only tokens internally — your services never see raw PANs. This shrinks your PCI scope from "entire infrastructure" to "the tokenization boundary." Know this. It will come up in your first sprint.
Zero Trust Architecture
Zero Trust Model
Never Trust, Always Verify · BeyondCorp · NIST 800-207
Traditional perimeter security: "inside the network = trusted." Zero Trust: every request must be authenticated and authorized regardless of source. No implicit trust from network location. Applies to: service-to-service calls (mTLS), user access (MFA + device posture), data access (attribute-based authorization).
OLD MODEL: [Firewall] → [Internal Network = TRUSTED] ↓ Any internal service can call any other service ZERO TRUST: Every call authenticated (mTLS certificates) Every call authorized (JWT scopes checked) Every call logged (audit trail) Minimal blast radius: service A can only call service B if B explicitly permits A's service account
08
Kafka · CDC · Polyglot · Data Mesh

Data Architecture

Data is the liability that never goes away. How you store, stream, and govern it determines what you can and can't build later.

Kafka — The Fintech Backbone
KAFKA ARCHITECTURE: Producer (Payment Service) ↓ publish to topic "payment.events" ┌────────────────────────────────────────────────────────┐ │ Topic: payment.events (partitioned, replicated) │ │ Partition 0: [P1][P2][P5][P8]... │ │ Partition 1: [P3][P4][P6]... │ ← ordered within partition │ Partition 2: [P7][P9]... │ └────────────────────────────────────────────────────────┘ ↓ consume (independent consumer groups, no data loss) [Ledger Service (group=ledger)] [Risk Service (group=risk)] [Audit Service (group=audit)] Key invariant: messages with same key always go to same partition → ordering guarantee per entity e.g., key=paymentId → all events for payment 123 always on same partition, always in order
Change Data Capture (CDC)
CDC with Debezium
Database → Kafka → Everything Else
Debezium reads Postgres/MySQL binary logs (WAL) and streams every INSERT/UPDATE/DELETE as a Kafka event. Your application writes only to Postgres; all downstream consumers (search index, analytics, cache invalidation, audit) receive the changes via Kafka. No dual-write. No polling. The outbox pattern's industrial-strength brother.
Postgres (payments table) ↓ WAL (Write-Ahead Log) reader Debezium Connector ↓ streams to Kafka topic: "dbserver.public.payments" ↓ [Elasticsearch Consumer] → search index updated [Analytics Consumer] → data warehouse updated [Cache Consumer] → Redis invalidated [Notification Consumer] → customer email sent
Polyglot Persistence — The Right Database for Each Job
Data StoreModelBest For in FintechJava Integration
PostgreSQLRelational + ACIDAccounts, transactions, ledger entries — anything requiring ACID guarantees and complex queries. The default choice.Spring Data JPA, JDBC, jOOQ
RedisKey-Value / CacheSession tokens, rate limiting counters, idempotency key cache, frequently-read balances. Sub-millisecond reads.Spring Data Redis, Lettuce, Redisson
KafkaLog / Event StreamEvent streaming, audit log, CDC, async communication between services. Not a database — retention is finite.Spring Kafka, Kafka Streams, Schema Registry
ElasticsearchDocument / SearchTransaction search, compliance reporting, log aggregation. NOT for primary data — use as a derived read model fed by CDC.Spring Data Elasticsearch, High-Level REST Client
DynamoDB / CassandraWide-Column / NoSQLHigh-write-throughput time-series: fraud signals, click events, pricing feeds. When you know the access patterns ahead of time.AWS SDK, Spring Data Cassandra
S3 / GCSObject StorageAudit log archives, compliance exports, ML training data, batch reports. Not queryable (use Athena/BigQuery on top).AWS SDK v2, Spring Cloud AWS
💡 The Golden Rule of Fintech Data Double-entry bookkeeping is the 700-year-old constraint that still governs all financial systems. Every debit has a corresponding credit. The sum of all ledger entries must always equal zero. Any system that touches money must maintain this invariant — no exceptions, no "we'll reconcile later." If your service moves money, it must implement a ledger, not just a balance field.
09
Domain Knowledge · Payment Flows · Compliance

Fintech Specifics

The knowledge that separates engineers who happen to work at a fintech from engineers who understand the domain. This is your moat.

Card Payment Lifecycle — State Machine
INITIATED
→ validate →
AUTHORIZED
→ capture →
CAPTURED
→ settle →
SETTLED
AUTHORIZED
→ void →
VOIDED
// pre-capture cancellation
SETTLED
→ refund →
REFUNDING
→ complete →
REFUNDED
SETTLED
→ customer disputes →
CHARGEBACK
→ evidence submitted →
CHARGEBACK_REVIEW
INITIATED
→ any failure →
DECLINED
// risk, insufficient funds, expired card, etc.
🏦 Key Distinction: Authorization vs Capture Authorization = reserving funds on the cardholder's account (instant). Capture = actually taking the funds (usually next business day). Hotels authorize on check-in, capture on checkout. E-commerce authorizes on order, captures on shipment. Your code must handle the gap between the two — a customer's card can be canceled between auth and capture.
Double-Entry Bookkeeping in Code
// Every financial movement = two ledger entries that sum to zero
record LedgerEntry(
    UUID       id,
    UUID       transactionId,
    AccountId accountId,
    Money      amount,       // positive = debit, negative = credit
    EntryType type,         // DEBIT or CREDIT
    Instant   postedAt
) {}

// Transfer $100 from Customer A to Customer B:
var entries = List.of(
    new LedgerEntry(uuid(), txId, customerA, Money.of(100), DEBIT,  now()),
    new LedgerEntry(uuid(), txId, customerB, Money.of(100), CREDIT, now())
);

// INVARIANT: sum of all entries for a transaction = 0
// DEBIT 100 + CREDIT -100 = 0  ✓
assert entries.stream()
    .map(e -> e.type() == DEBIT ? e.amount() : e.amount().negate())
    .reduce(Money.ZERO, Money::add)
    .equals(Money.ZERO);  // Must always pass
Compliance Landscape Cheat Sheet
Standard/RegulationApplies ToKey Requirement for Developers
PCI-DSS v4Any system touching payment card dataTokenize PANs, encrypt CHD at rest/transit, vulnerability scanning, penetration testing, strict access control, detailed audit logs
GDPR / CCPAAny system with EU/CA personal dataRight to erasure (pseudonymize don't delete for audit), data minimization, purpose limitation, breach notification <72hrs, data subject request APIs
SOC 2 Type IISaaS fintech platforms (trust)Evidenced security controls: access reviews, change management logs, incident response records, availability monitoring — your observability IS the evidence
AML / BSABanks, money transmittersTransaction monitoring, suspicious activity reports (SARs), customer due diligence data retention, OFAC sanctions screening API
FFIECUS banks and fintech partnersIT risk management, authentication standards, vendor risk management. Affects architecture decisions — multi-cloud, disaster recovery, BCP.
10
Interview Preparation · Back of Envelope · Common Patterns

System Design Reference

The interview format where your 30 years shows most clearly. Know the numbers, know the patterns, know when to ask questions.

Back-of-Envelope Numbers Every Engineer Must Know
OperationLatency
L1 cache reference0.5 ns
L2 cache reference7 ns
Main memory (RAM) reference100 ns
SSD read (NVMe)100 µs
HDD seek10 ms
Same datacenter round trip0.5 ms
US cross-country round trip40 ms
US to Europe round trip150 ms
Postgres query (indexed)1–5 ms
Redis GET0.1–0.5 ms
LLM API call (Claude Sonnet)500ms–3s
Scale NumberValue
Requests/sec, single server1,000–10,000 RPS
Postgres transactions/sec1,000–10,000 TPS
Kafka throughput/partition100 MB/s
S3 PUT throughput3,500 req/s per prefix
Redis ops/sec100,000+
Twitter: peak tweets/sec~150,000
Visa: transactions/sec~24,000
Bytes per day: 1B users, 10 actions~1 TB/day
Encoding 1 minute video (720p)~60 MB
Characters per token (LLM)~4 chars
System Design Framework — The Interview Structure
StepDurationWhat to DoKey Questions
1. Clarify Requirements5 minAsk before designing. Scope aggressively. Don't assume.Read vs write heavy? DAU? Latency SLA? Consistency required? Global or single region?
2. Capacity Estimation3 minBack-of-envelope: storage, bandwidth, QPS. Show your math.100M users × 10 payments/month = 1B payments/month = ~400 TPS avg, ~4000 TPS peak
3. High-Level Design10 minDraw the boxes: API gateway, services, databases, caches, queues.Show data flow end-to-end. Name the components. Don't skip the client.
4. Deep Dive15 minPick the hardest parts (consistency, scale, fault tolerance). Show tradeoffs.How do you handle duplicate payments? What happens when DB goes down? How do you scale reads?
5. Identify Bottlenecks5 minProactively surface weak points. Show you know the limits of your design.Single DB is a bottleneck → read replicas → sharding → when each applies
💡 Design a Payment System — The Complete Answer API Gateway (auth, rate limit, idempotency check) → Payment Service (Spring Boot, Hexagonal) → Outbox write (Postgres, same transaction) → Kafka event → downstream consumers: Risk Service (sync, pre-authorization), Ledger Service (async), Notification Service (async). State machine enforces valid transitions. Circuit breakers on all external calls (card networks). Idempotency keys prevent double-charge. Stripe-style tokenization keeps CHD out of scope. That's a complete answer.
11
The Frameworks Behind the Frameworks

Mental Models

The patterns that don't belong to any language or framework but that senior engineers apply to every problem. These are the invisible curriculum.

Leaky Abstractions
Joel Spolsky's Law · "All abstractions leak"
Every abstraction eventually exposes the details it was meant to hide. TCP hides network unreliability — until packet loss forces retransmit and you see the latency spike. JPA hides SQL — until N+1 queries destroy your performance. The better you understand what's underneath, the better you use what's on top. A greybeard's advantage: you've seen more leaks.
Mechanical Sympathy
Martin Thompson · Hardware-Aware Software
Code that understands the hardware it runs on outperforms code that ignores it. CPU cache lines (64 bytes), memory access patterns (spatial/temporal locality), NUMA topology, false sharing on atomic variables. Your JVM profiling instincts should include: "is this cache-miss heavy?" before "is the algorithm wrong?"
The Two Hard Problems
Cache Invalidation · Naming · Off-By-One
Caching makes reads fast but creates consistency problems. The hard question is never "should we cache?" but "when does this cache entry become invalid and how do we propagate that?" Cache-aside, write-through, write-behind, TTL, event-driven invalidation — each a different consistency tradeoff. Know them all before you cache anything in fintech.
Conway's Law
Melvin Conway 1967 · Org Structure = System Structure
Organizations design systems that mirror their communication structures. If three teams build a compiler, you get a three-stage compiler. The Inverse Conway Maneuver: design the team structure you want, and the architecture will follow. Microservices work when team ownership matches service boundaries. When it doesn't, you get distributed monoliths.
Hyrum's Law
"All observable behaviors become contracts"
With enough users of an API, it doesn't matter what you specify in the contract — all observable behaviors will be depended on by someone. The order of items in a JSON array. The exact text of an error message. The response time. Someone depends on it. Fintech implication: API changes that seem harmless break clients in ways you never anticipated. Version everything.
Simple vs Easy
Rich Hickey · "Simple Made Easy" 2011
"Easy" means close to you — familiar, low ceremony. "Simple" means not complex — not entangled with other things. Spring Boot is easy (familiar, quick to start) but not necessarily simple (the magic intertwines many concerns). Plain JDBC is harder (more ceremony) but simpler (explicit). Choose simple over easy for things that matter long-term. Easy for throwaway scripts.
You Ain't Gonna Need It
YAGNI · XP Principle · Resist Premature Generalization
Don't build features or abstractions before they're needed. The cost of wrong abstraction is much higher than the cost of refactoring simple code. In AI-assisted development this is critical: AI generates generic, over-engineered code by default. Your job is to prune it back to what's actually required. Less code = fewer bugs = less to containerize = less to secure.
The Pit of Success
Rico Mariani · API Design Philosophy
Design your APIs, frameworks, and systems so that the easiest path leads to the correct, safe, well-performing outcome. The opposite of a "pit of failure" (where easy = wrong). Spring Security's new lambda DSL is a pit of success — the most natural way to write it is the secure way. Your custom APIs should do the same: make the safe path the natural path.
The Senior Engineer's Decision Framework
When You Hear...First Question to AskThe Trap to Avoid
"We need microservices"What coordination problem are you solving? What's the team ownership model?Distributed monolith — services that need to deploy together and share a DB
"We need event sourcing"Do you genuinely need audit history, time travel, or multiple projections? Or do you want an audit log?Event sourcing for simple CRUD — the complexity cost is enormous
"We need to use AI for this"What's the deterministic solution? Is it inadequate? Why?Using LLMs where regex, a lookup table, or a trained classifier is 10x better
"We should cache this"What's the invalidation strategy? What's the consistency SLA?Stale cache in fintech = customer sees wrong balance = support tickets = regulatory risk
"This needs to be async"What's the consumer? What's the failure model? Who owns the retry?Fire-and-forget messaging where the consumer can silently fail
"Let's rewrite this legacy service"What specific problem does a rewrite solve that incremental improvement doesn't?Second-system syndrome — the rewrite accumulates all the complexity the original had, plus new ones
"The AI can handle that"What are the failure modes? What's the fallback? How do we eval quality?Deploying LLM-driven logic without evaluation framework = non-deterministic production bug
💡 The Final Insight — What Greybeards Actually Have Junior engineers learn patterns. Mid-level engineers apply them. Senior engineers know when NOT to apply them. After 30 years, you've seen every pattern implemented wrong, every abstraction leak, every "we'll fix it later" that never got fixed. That's not nostalgia — that's a production incident database in your head. The art is translating that database into precise, actionable insight at the right moment in a design discussion. That's what "senior" means.