Service · KumoMTA

KumoMTA troubleshooting & rescue

When mail stops landing and you do not know why, every hour costs. We read your delivery logs and your automation history, separate a temporary brake from a real block, and fix the cause — a Gmail 421, a Microsoft block, a queue that will not drain, a node that stopped accepting mail — without the panic fixes that make it worse.

Request the free audit Talk to an expert

KumoMTA troubleshooting is the work of diagnosing why a running KumoMTA stops delivering — queues that will not drain, deferrals, provider blocks, blacklist listings, a park that suddenly broke — from the engine’s own logs, metrics and automation history rather than guesswork. The method is to read the provider response first (the difference between a 4xx temporary deferral and a 5xx hard rejection changes everything), trace the symptom to a cause with kcli and the logs, and apply the smallest fix that holds, because the reflexes that feel right under pressure usually deepen the hole.

In short

→ The first digit of the SMTP response decides the path: 4xx is a temporary deferral the engine will retry; 5xx is a hard rejection that needs a fix, not patience.
→ A queue that will not drain is a symptom, not the problem — raising the send rate to clear it pushes harder on the upstream brake that caused it.
→ A Gmail 421 is a brake, not a wall: it means slow down (usually concurrency), not stop; suspending the queue for hours makes it worse.
→ Most incidents are diagnosed from the engine itself — kcli queue-summary, the structured logs and the TSA automation history — before anything is changed.
→ The TSA automation history is the first thing to read on a Kumo incident: an active suspension explains a stalled queue more often than any provider change.

When mail stops landing, the clock runs and the temptation is to do something — anything — now. That is exactly the problem: in a delivery incident, panic reactions reliably make things worse. Retrying harder, raising rates, switching off the automation because it "keeps suspending things", changing ten settings in one sitting... each of those impulses can turn a passing brake into a serious block. KumoMTA troubleshooting replaces panic with method: reading what the engine and the receivers are already telling you, identifying the real cause, and applying the right correction — which is often the opposite of what instinct demands. We run KumoMTA in production every day, which means we have seen most of these fires before and know which extinguisher each one takes. This page explains how we read a problem and how we work an incident, because knowing what is actually happening is, almost always, half the fix.

Where do you start when KumoMTA stops delivering?

The good news about KumoMTA is that nothing fails silently: every rejection arrives with a code and a message from the receiver, every delivery attempt lands in a structured log record, and the engine's own automation keeps a history of what it observed and what it did about it. The first step of any diagnosis is reading that material instead of reacting to the scare. A park that delivers badly is not an inscrutable mystery; it is a set of SMTP responses telling a story to anyone who reads them in volume. The difference between a sender that recovers in days and one that sinks for weeks is largely whether those signals get heard or trampled. So we start with the logs and the responses, never with conjecture: written there is which provider is braking you, since when, and very often why. Troubleshooting is, before anything else, an exercise in careful reading.

What is the difference between a 4xx and a 5xx SMTP response?

The most important distinction in any rejection is its first digit. A 4xx code is a temporary failure: the receiver will not take the message now but might later, so the right response is patient retrying under sane backoff. A 5xx code is permanent: insisting achieves nothing, and the fix lives in configuration, content or reputation, never in another attempt. KumoMTA classifies these responses and acts accordingly — but only as well as its policy and shaping tell it to, and the classic, expensive mistake survives every generation of MTA: treating a 4xx as final, or a 5xx as something retries will wear down. There is a wrinkle worth knowing, too: some receivers return permanent codes for conditions that are actually temporary, which is why we read the text and the pattern, not the digit alone. Getting this one distinction right is the difference between an orderly recovery and an aggravation, and it is the first thing we check in your logs.

From symptom to cause

Most incidents arrive wearing one of a few familiar faces, and each face points to a short list of likely causes and a sensible first move. The table summarizes what we get called about most, before the detail below.

Reading a KumoMTA incident, response first

Every KumoMTA incident starts with the provider response. A 4xx is a temporary deferral the engine retries — check the TSA history for a suspension and ease concurrency. A 5xx is a hard rejection that needs a real fix: a Gmail 421 means slow down, a Microsoft S3150 or sudden mass 5xx points to reputation or a block, a blacklist needs delisting at the source. At every branch, raising the send rate is the reflex that deepens the problem.

Symptom	Likely cause	First step
421 from Gmail	Temporary brake on reputation, pace or engagement	Slow that path and let automation hold it — never retry at full speed
550 5.7.1 (S3150) from Microsoft	IP-level block returned as a permanent error	Start delisting and fix the cause, without poisoning the queue
Queue that will not drain	A provider braking you, an active TSA suspension, or retry policy out of tune	Diagnose which — from queue summary and the automation history — before touching rates
Bounce storm	Dirty list, broken authentication or damaged reputation	Read the codes; split temporary from permanent before reacting
Placement drop	Reputation, engagement or content, more than the engine	Measure real placement and look beyond the configuration
Listeners refusing mail	Memory headroom exhausted, or spool and disk pressure upstream	Check memory and spool metrics first — it is protection, not malfunction

What did Traffic Shaping Automation do during the incident?

In a KumoMTA park there is a diagnostic question that does not exist on older engines, and it comes first: what has Traffic Shaping Automation already done about this? A queue that stopped draining is often not a mystery at all — a rule matched a provider response and suspended that path for two hours, or quietly dropped its message rate, exactly as designed. That is the system working; the incident, if there is one, is upstream in whatever triggered the rule, or in a rule that over-reacts — suspending for hours where a short rate reduction would have kept mail moving. So we read the automation history alongside the queues: which rules fired, how often, on what response text, with what action and duration. Sometimes the finding is that nothing is broken and the right move is to let the suspension expire while fixing the cause; sometimes it is a stock rule that needs tuning to your traffic. Either way, diagnosing a Kumo park without reading TSA's diary means guessing at decisions the engine already wrote down.

The logs: where everything is written

The truth of an incident lives in the delivery logs, not in impressions. KumoMTA records every attempt as structured data — the response code and text, the receiver, the queue, the source, the timing — in compressed JSONL segments built to be processed at volume; rejection records even capture the command that triggered them. Two practical rules keep the reading honest. Accounting comes from these logs, not from queue snapshots: the live counters reset as ready queues are reaped, so only the log stream gives you trustworthy totals. And the diagnostic log — what the daemon itself reports through the system journal — has adjustable verbosity that can be raised dynamically while you investigate and must be lowered after, because the chattiest levels will fill a disk with impressive speed. A park with well-kept logs gets diagnosed in hours; one that disabled or starved its logging forces you to wait for the problem to recur before anyone can see it. First thing we ask for, always: read access to this material.

The toolbox: live answers from a running engine

KumoMTA ships a command-line client that turns "what is happening right now" into answerable questions, and an incident is when it earns its keep. The queue summary shows, per destination site and per source, what is delivering, what is in transit and what is waiting — the fastest way to see whether a problem is one provider, one IP or everything at once. A provider summary aggregates the same picture per receiver; a live top view watches it move. The log filter can be raised on the spot, without restarts, when the journal needs to say more. And two operational verbs matter in a rescue: rebinding, which moves a stream of queued messages onto a different queue with the rules re-evaluated — invaluable when a path is poisoned and its mail needs a healthier route — and administrative bouncing or transferring of queued mail when a node has to be drained. None of this requires touching the policy; it is observation and surgery on a live engine, which is exactly what the first hours of an incident call for.

The first five minutes of a stalled queue

ops@mta-01 — incident

# Which queue is stuck, and what is the last response? (Q=queued)
$ kcli queue-summary --by-site
SITE              D       T    C       Q   last-response
yahoo.com       402      0    0   28140   421 4.7.0 [TSS04] deferred

# Before touching shaping: did the automation already suspend this route?
$ kcli tsa-status --domain yahoo.com
SUSPEND active 47m remaining · trigger: 421 4.7.0 burst · auto-applied

# Read the actual events, not a guess — last 5 deferrals on the route
$ kcli tail-log --domain yahoo.com --type deferral --limit 5
421 4.7.0 [TSS04] Messages temporarily deferred due to volume

# Cause found: automation is doing its job. Let it expire, do not rebind.

A real first response: queue-summary shows 28,140 messages queued for Yahoo with a 421 deferral; tsa-status reveals the automation already suspended the route for 47 more minutes after a 421 burst; tail-log confirms the cause is volume, not a block. The fix here is to do nothing — the engine is recovering correctly, and rebinding or raising the rate would fight its own safety system.

Watching the wire: tracing real conversations

When logs say what happened and you need to see it happen, the engine can show you the conversation itself. Client-side tracing streams the outbound SMTP dialogue in real time — the banner, the EHLO, the TLS step, the exact moment and wording of the receiver's rejection — filtered down to the ready queue or source you care about, because on a busy server unfiltered tracing drowns. Server-side tracing does the same for inbound connections, which makes short work of injection problems: the application that says it is sending while the engine sees nothing, the authentication that fails before the first byte of payload, the malformed message a generator produces under load. This is the difference between deducing a handshake problem from its wreckage and watching it occur; some classes of fault — a receiver greeting-rejecting one specific IP, TLS failing against one specific MX — are minutes with a trace and days without one. We use it surgically, with filters, and we leave you knowing how.

Deferrals or hard bounces: which one are you seeing?

Two things constantly get conflated in a crisis, and treating them alike is a mistake with compound interest. A deferral is "not now, come back later": the receiver is holding your mail, and with the right pace it will land. A hard bounce is "this does not exist" or "never": insisting only spends reputation. When an incident mixes both — and bounce storms usually do — the first task is separation, because the responses they demand are opposite: patience and backoff for deferrals, immediate suppression for the dead. An engine retrying hard bounces as if they were deferrals sinks itself; one that writes off deferrals as dead loses mail that was minutes from landing. KumoMTA's bounce classification does this sorting well when it is wired to do so; part of every diagnosis is verifying the sorting actually happens, and that the suppression list is fed by it rather than existing in parallel folklore.

What does a Gmail 421 response mean?

The most common call is also the most mishandled. When Gmail returns its 421 rate language, it is applying a temporary brake — reputation, pace or engagement — and asking you to slow down; it is not building a wall. Gmail adjusts dynamically how much it accepts from each sender, and crossing your current allowance produces exactly this code. The trap is treating it as a hard failure and retrying at full speed, which amplifies the very signal that triggered the brake. The right move is the opposite: lower the pace, let the automation hold the path — this is precisely the response the community shaping rules are written for — and rebuild trust, at which point the 421s fade on their own. If the origin was a rushed warm-up or a list problem, we identify and correct it. This one almost always resolves with patience and method, almost never with force, and the fastest fix is frequently stopping the "fix" already in progress.

When Microsoft closes: the S3150

Few errors frustrate like Microsoft's S3150, so we explain it without varnish. It arrives as a 550 5.7.1 — formally a permanent rejection — saying part of your network is on their block list, pointing to a portal that sometimes does not load. Operator experience over the years associates it with an IP-level block from their filtering, returned as permanent even though the underlying cause is usually temporary: reputation or pace. That contradiction is the operational problem: a permanent code for a non-permanent condition generates bounces, suppressions and addresses marked dead that are not — so the first defensive move in KumoMTA is making sure your bounce handling does not let an S3150 storm poison the suppression list. Resolution runs through their delisting channels and, above all, through fixing the reputation cause, because without that the block returns. We say from day one which part of the timeline is ours and which is Redmond's, because they do not publish the thresholds and honesty beats theater.

Sudden, massive 5xx: a block in motion

When permanent rejections spike at once against one receiver, it is almost always an active block, and the reaction decides the outcome. This is not the normal drizzle of dead addresses; it is a wall a provider just raised against your IP or domain, for reputation or policy. The instinct to keep pushing is the worst available: every attempt confirms the block's reasoning. First, stop pressure on that path — suspend it deliberately if automation has not already — then identify exactly which IP or domain is blocked and read the reason the rejection gives, which in KumoMTA is sitting verbatim in the rejection records. From there the route is fixing the cause and using the delisting channel where one exists. A sudden mass 5xx is one of the few situations where speed of reaction — in the sense of stopping, never of pushing — genuinely changes how the story ends.

How does blacklist delisting actually work?

When the problem is a blocklist, honesty is the only useful policy. Landing on a list like Spamhaus's, or on a receiver's internal list, brakes delivery across the board — and getting off is possible, but not by magic: most lists require the cause fixed first — a compromised source, a dirty list, a complaint spike — and some re-list automatically if they detect the problem persists. Serious delisting is therefore always two steps in strict order: repair the origin, then request removal. We cover the full discipline on the blacklist delisting page, but inside an incident the principle is identical: delisting without repairing buys time, not a solution, and the time it buys is usually short.

Why will a KumoMTA queue not drain?

A growing queue frightens people into exactly the wrong move: raising rates to "empty it". A queue that does not drain is a signal, never the disease — something upstream is braking delivery, and in KumoMTA the suspect list is short and checkable. An automation suspension still active on that path. A provider ceiling crossed, with the deferrals to prove it. A retry policy too patient or too eager for that receiver. A throttle shared across nodes doing its job. Or — the one people forget — the scheduled queue refilling faster than the ready queue may legally drain, which is an injection-side problem wearing a delivery-side costume. The queue summary separates these in minutes: per-site, per-source counts show whether the clot is one provider, one IP or systemic, and the automation history says whether the engine itself is holding the door. Pressing the accelerator while the handbrake is on burns fuel and reputation; we find the handbrake first.

Why would KumoMTA stop accepting mail?

A specifically Kumo incident, alarming the first time and instructive after: the listeners stop accepting new messages, injectors see refusals, and the team concludes the MTA crashed. Usually it did not — it protected itself. Under real memory pressure the engine shrinks message bodies back to spool, purges its caches, and at zero headroom deliberately stops accepting new work until it recovers; the health check says unhealthy on purpose. The same family includes spool I/O saturation — a storage layer that cannot keep up turning a fast engine into a stalling one — and the self-inflicted variant, a diagnostic log left on its chattiest setting until the disk filled. The diagnosis runs through the metrics the engine publishes about its own memory and spool, and the fix is upstream sizing: ready-queue limits matched to the traffic distribution, spool on storage that can take the writes, log verbosity returned to sane. The protection behaving is not the incident; the sizing that made it necessary is.

The fixes that make everything worse

A good share of incident work is preventing panic from compounding the damage, so we stop the bleeding before anything else. Retrying a 421 at full speed shouts at the receiver precisely what annoyed it. Hammering a permanent 5xx wastes resources and broadcasts carelessness. Raising rates to drain a queue adds pressure on a receiver already holding you. Disabling the automation because it "keeps suspending things" removes the safety system mid-incident — the Kumo-specific version of cutting the brake lines because the car keeps slowing down. And changing many things at once makes cause and effect permanently unknowable, which converts a one-week incident into a one-month mystery. The first rule of an incident is not to deepen it; some of the most valuable minutes we bill are the ones spent stopping the fixes already in motion.

Complaint spikes: the silent cause

Behind many sudden delivery drops sits a complaint spike nobody saw arrive. When too many recipients mark your mail as spam — after a campaign to a stale segment, a sender-name change, content people did not expect — receivers react fast and without ceremony: deferrals rise, placement slides, blocks follow. Unlike a bounce, a complaint does not always leave a loud trace in your own logs; it has to be looked for, in the feedback loops and in correlation with what was sent and to whom. So every incident includes the question: does a complaint spike explain this? Because if it does, no engine adjustment fixes anything — the problem is the audience or the message, and the 0.30% threshold where receivers act does not negotiate with configuration. Finding this early spares days of hunting the fault in an engine that was never at fault.

Authentication as the hidden cause

Sometimes the deferrals are not about pace or reputation at all, but about authentication that broke without announcing itself. An SPF record that crept past its lookup limit after someone added a vendor; a DKIM key rotated on one side and not the other; alignment that quietly stopped holding after a DNS migration — any of these reads, from outside, as mail losing placement or collecting odd deferrals with no message that says "your authentication failed". In any placement incident without an obvious cause, we check the authentication layer as a standing suspect: signatures active and aligned for every stream, SPF within its limits, DMARC coherent with what is actually sent. It hides well because a park that "always worked" can break here through a change made nowhere near the mail team. Ruling it out early prevents ghost-hunting in shaping files while the real fault sits in a DNS record.

When the problem is the message, not the engine

There are incidents where the configuration is impeccable, the reputation clean, the bounces unremarkable — and placement fell anyway, because the mail itself changed. A redesigned template heavy with images and links, a subject line that diverged from the body, a URL shortener that picked up a bad reputation, a new tracking domain the filters have never seen: any of these can walk a stream that always landed straight into the spam folder. We check whether the drop coincides with a content change, comparing what delivered with what stopped delivering — the logs date the cliff precisely, which usually dates the cause. We are not a copy agency, but the technical signals filters punish are legible, and part of honest diagnosis is saying when the fix is editorial rather than infrastructural. Tuning shaping against a content problem is adjusting screws that were already tight while the fault rides along in every message.

New park that never started, veteran that suddenly broke

Two origin stories need different forensics. A fresh deployment that has delivered badly since day one usually carries an original sin: a quickstart configuration promoted to production — the project itself is explicit that the tutorial install is not production-ready — a skipped warm-up, authentication never finished, outbound port 25 never actually opened at the cloud provider, or IPs with a history nobody checked. There we audit the foundations one by one, because a crooked start is straightened by returning to the base, not by sending more. A veteran park that worked for years and broke on a Tuesday is the opposite investigation: something changed, even if nobody admits to it — and KumoMTA gives this hunt a gift no legacy engine offers, because the configuration is code in version control. The diff between "worked" and "broken" is frequently sitting in the repository history with a timestamp and an author. We reconstruct the timeline, cross it with what the responses say, and the small recent change with the large effect usually surrenders quickly.

What does KumoMTA troubleshooting not promise?

Part of honest troubleshooting is naming what is out of reach. We have no button that erases a block controlled entirely by a receiver that publishes no criteria; we have the method to fix the cause, the channels to request removal, and the candor to say how much of the timeline is not ours. We do not rebuild a ruined reputation in an afternoon, because reputations rebuild through behavior over weeks, not through a config change. And we do not make dirty sending deliver clean: if the cause is a bought list or consent that never existed, no engine repair helps, and we will say so rather than sell a comforting fiction. We would rather promise less and deliver it than promise a miracle and add disappointment to your incident. Knowing what cannot be done is part of doing well what can.

What we need from you to start

The faster we can read, the faster the cause appears, and the list is short. Read access to the delivery logs, the policy and the metrics — that is where the evidence lives. A timeline as you experienced it: when it started, what you noticed first, and above all what changed in the days around it — a DNS edit, a new IP, an unusual campaign, an upgrade, a new injector. Those "what changed?" answers are usually the thread that unwinds the whole knot, and with policy in version control, "what changed" often has a literal answer we can read. A handful of real rejection messages beats any paraphrase of them. And the engine version plus the policy file, which is the same checklist the project itself asks for when an issue might be a genuine bug — a request we can assemble in our sleep. You describe what you see; we translate the codes into a cause and a plan.

How we work an incident

Deliberately methodical, because hurried hands are what break things. First we stabilize: stop the fixes that worsen, suspend deliberately where pressure is doing damage, contain. Then we diagnose on evidence — logs, automation history, queue state, reputation — until we have a cause, not a suspicion. Then the right correction, which is often slow down, wait and adjust rather than push, applied one change at a time and validated against the numbers as delivery recovers. And finally the write-up: what happened, why, what was changed, and what will keep it from returning — committed next to the policy, where the next responder will actually find it. Coverage spans European, North American and Latin American time zones, because delivery incidents do not respect office hours. The difference between a managed incident and a suffered one is somebody with the context and the calm to read the situation before acting on it.

The clock, and the pattern behind the fire

In a delivery incident time is literal money — every hour mail does not land is revenue not converting, and every day a reputation problem runs makes it costlier to reverse, because the damage compounds. Speed of diagnosis is therefore the lever that matters, and it does not come from acting fast blind; it comes from knowing where to look and recognizing the pattern on sight, which is what running the same engine in production buys. But the question that saves money beyond this week is why the fire started. Many incidents are symptoms: a warm-up that will produce more 421s, a pool design mixing reputations, an automation rule that will overflow again, a list quietly souring. When we close an incident we tell you whether it was an isolated event or the visible tip of a pattern — and what it would take to not meet again. That reading is what separates a patch from a fix, and it is frequently the argument for moving from calling us when it burns to having us watch so it does not, with the underlying causes handled as optimization rather than emergencies.

If you are in the middle of an incident, the first step is understanding what is happening: the free 25-point audit gives a fast reading of your sending and your KumoMTA and tells us where to look first. If the origin turns out to live outside the engine entirely, the deliverability audit chases it where it lives. Either way: evidence first, then action — never the other order.

FAQ

Frequently asked questions

How is this different from the KumoMTA community and docs?

The project’s documentation and community are excellent at answering questions about KumoMTA the software. We diagnose your operation: your policy, your shaping and automation history, your logs, your reputation with each receiver — and we carry the incident to resolution. The two are complementary; when an issue turns out to be a genuine engine bug, we produce exactly the reproduction the maintainers ask for, because we speak their checklist natively.

Do you need access to my server?

We need to see the delivery logs, the policy and the metrics, which is where the evidence lives — normally read access is enough to diagnose. How that access works is agreed your way. Diagnosis means understanding before touching; changes come after, with your sign-off, and the diagnosis itself usually says precisely what to change.

How fast can you resolve an incident?

Diagnosis is fast — knowing what is happening, and stopping the actions that worsen it, usually happens in the first hours, and that is where most of the damage is prevented. Resolution depends on the cause: a Gmail brake stabilizes in days once the right adjustment lands; a Microsoft block can take longer because part of the timeline is theirs. We tell you which kind you have, and what depends on whom, on day one.

Can you remove a Microsoft block (S3150)?

We run the delisting through their channels and, above all, we fix the reputation cause that triggered it — without that, the block returns. But we are honest about this one: S3150 is a thorny case where part of the resolution sits with Microsoft, against thresholds they do not publish. We tell you what is in our hands and what is not, instead of promising a magic button that does not exist.

Is this a one-off service or ongoing?

Either. Call us for a single incident and walk away with it understood, fixed and documented — or add the continuous watch that catches the next one before it ignites. Putting out today’s fire is a project; preventing tomorrow’s is what a managed operation buys.

What if the problem turns out not to be KumoMTA?

We diagnose the same way and say so plainly. Plenty of delivery incidents do not live in the engine: a list gone stale, content that trips filters, a reputation already wounded before the first deferral appeared. If the cause sits outside KumoMTA, we point you at where it actually is rather than billing for adjusting an engine that was already right.

Mail stopped landing? Let's read it.

Clear diagnosis and measured action, without the panic fixes that make it worse. Start with the free 25-point audit, no strings attached.