AME by Zentra / Fargate queue strategy

Stable web tier.
Independent queue scale.
Rollback-ready workers.

A professional Fargate queue strategy for AME by Zentra built from the supplied overview, architecture, scaling, runbook, and troubleshooting notes — styled to match the release strategy example.

isolated load queue-specific scale burst/drain ready conservative rollout rollback available

Primary heavy queues

sync feed imports and sync finalisation

image image fetch, create, backfill, and media bursts

bulk cache warmers, rebuilds, and non-urgent fan-out

Goal: move heavy Laravel queue work off the main app hosts and onto dedicated ECS Fargate workers. Keep default separate so user-facing background work never gets mixed into heavy draining.

Laravel → Redis → ECS scale one queue at a time keep host rollback

Executive summary

Why change the queue model?

The supplied strategy separates heavy queue load from the web tier so queue pressure can grow or shrink without destabilising production app hosts.

Protect the web tier

Image and feed jobs stop competing with PHP-FPM, Horizon, and request/response traffic on the main fleet.

Scale only the bottleneck

Raise image without widening sync or bulk. Pay for extra capacity only where backlog proves it.

Keep rollback simple

ECS services can be scaled down and host consumers kept available during rollout until full ownership is proven.

Heavy work becomes

isolated observable service-owned queue-specific reversible

The problem we are solving

Shared hosts create avoidable queue risk.

When heavy queue work runs next to request/response traffic, tuning decisions and failures spill across unrelated parts of production.

What happens today

image processing competes with PHP-FPM and Horizon
long-running feed jobs make worker tuning harder
scaling queue throughput means touching production app servers

Operational downside

temporary worker hosts create drift and repeated manual setup
failures in one queue can destabilise unrelated work
mixed workloads blur what is actually causing pressure

Why it matters

web stability becomes hostage to background bursts
incident response depends too much on host-by-host behaviour
capacity changes become slower and riskier than they need to be

The fix is not “more shared workers”. The fix is a separate runtime layer for heavy queues.

The operating model

Dispatch normally. Drain heavily. Scale per queue.

The platform keeps normal Laravel queue semantics, but moves heavy execution onto ECS services that can scale by workload class.

Five-layer layout

Laravel app dispatches jobs to the correct queue
Redis stores the backlog and reserved / delayed state
ECS services run the Laravel workers
Aurora and S3 provide durable data dependencies
Lambda and app-side wake-up logic control burst/drain scaling

Runtime shape

each heavy queue gets separate desired counts, logs, and restart behaviour
task definitions differ by memory, timeout, queue name, retries, and max jobs
SSM provides DB, Redis, app key, and S3 runtime values instead of baking secrets into images

sync

Owns heavy feed imports, sync fan-out, and manual property save finalisation. Long-running sync work stays clear of image draining.

image

Owns image fetch, create, backfill, and media processing. It is usually the main backlog driver and the first scaling candidate.

bulk

Owns cache warmers, cache rebuilds, and non-urgent heavy fan-out. It stays low priority and is cheap to run in burst/drain mode.

Important rule

default stays out of the heavy-drain path Redis remains the single queue truth

Service layout and dependencies

Each heavy queue gets its own service and clear runtime dependencies.

Splitting ownership by queue means each workload can be observed, sized, and rolled back separately instead of treating the whole worker fleet as one pool.

Production service map

cluster: propertyfeedtemplate-production

propertyfeedtemplate-sync
propertyfeedtemplate-image
propertyfeedtemplate-bulk

Network path must exist to

Redis on 6379
Aurora writer on 3306
S3, SSM, ECR, and CloudWatch Logs
the correct subnet and ENI path for workers and Lambda where used

Queue truth stays in Redis

queues:sync
queues:image
queues:bulk
plus reserved and delayed variants for each queue

What separation buys you

separate desired counts separate logs separate restart behaviour queue-specific scaling

Rollout and cutover

Start low. Prove health. Keep host rollback available.

The strategy is intentionally conservative: prove correctness first, then widen only the queue that genuinely needs more throughput.

Baseline

sync=1
image=1
bulk=1

Prove basics

Confirm image build, DB, Redis, S3, and queue routing.

Keep rollback

Do not disable host consumers until ECS is healthy.

Change one queue

Increase only one service at a time.

Prefer image first

Raise image from 1 to 2 first.

Watch soak

Observe logs and Sentry after steady state.

Expand carefully

Move the next queue only after the first is proven.

Cutover rules

keep host rollback available during initial rollout
move one queue class at a time if risk is unclear
after ownership is proven, host sync, image, and bulk should not compete with ECS

Rollback rules

scale the affected ECS service down
re-enable the host consumer if needed
keep queue routing explicit and reversible
do not delete ECS resources until diagnosis is complete

Scale is the last step, not the first.

Scaling and cost model

Scale only the bottleneck queue.

More workers are not automatically better. Extra capacity helps only after routing, connectivity, and job behaviour are already correct.

image

usual backlog driver
preferred first scale-up target
burst/drain range: 0..2
move above 2 only after stability is proven

sync

hold at 1 in the safe baseline
widen only after image is stable
burst/drain range: 0..1
protect sync from image contention

bulk

deliberately low priority
burst/drain range: 0..1
good candidate for cheap selective scaling
do not widen before more urgent queues are healthy

Cost control principles

keep minimum counts low
scale only the queue that needs it
use burst/drain for non-constant work
avoid idle EC2 worker hosts and large always-on fleets without proof

Main cost levers

number of Fargate tasks
CPU and memory size per task
CloudWatch log volume
transfer and S3 side effects
Lambda controller cost is negligible compared with task runtime

Operations runbook

Operate with health checks, not queue depth alone.

Healthy operations require watching service state, connectivity, and error behaviour together rather than treating backlog as the only signal.

Normal daily checks

ECS desired, running, and pending counts
CloudWatch worker logs
queue backlog trend
Sentry worker errors
Aurora and Redis connectivity health

Healthy state

service reaches steady state
no restart storms
jobs do not fail immediately in loops
backlog trends down under normal ingest
no fresh Sentry bursts for connectivity

Safe change rules

increase only one queue
wait for steady state
watch a short soak window
keep scale-down reversible
drop to 0 only when ready, reserved, and delayed work is empty

Production commands

aws ecs describe-services \
  --cluster propertyfeedtemplate-production \
  --services propertyfeedtemplate-sync propertyfeedtemplate-image propertyfeedtemplate-bulk

aws ecs update-service \
  --cluster propertyfeedtemplate-production \
  --service propertyfeedtemplate-image \
  --desired-count 2

Troubleshooting guide

Fix connectivity and runtime correctness before adding workers.

Most queue incidents are not solved by wider scale. First prove that each worker can start, reach dependencies, and consume the correct backlog.

Jobs fail immediately

Most likely: bad DB host, missing Aurora or Redis ingress, or stale SSM runtime values.

No steady state

Most likely: bad container image, missing Laravel runtime directories, or bad bootstrap at startup.

Queue not draining

Workers may not be consuming, jobs may be failing and requeueing, or the queue may point at the wrong Redis.

Timeout symptoms

DB timeout can mean DNS improved but the path is still blocked. Redis timeout means the broker path is not healthy.

First checks

SSM runtime values
ECS worker security group
Aurora ingress on 3306
Redis ingress on 6379
CloudWatch startup logs and task stop reason

Incident priorities

Redis reachability
Aurora reachability
S3 access
worker startup logs
queue routing correctness
deployment drift between prod env and SSM

Operator rule of thumb: do not scale a failing queue wider first.

Risk controls and monitoring

Keep the platform conservative and observable.

The safest posture is explicit routing, clear networking, and monitoring that combines backlog with restart and dependency signals.

Guardrails

start low
keep default separate
scale one queue at a time
keep rollback simple
do not widen broken workers first

Monitor these together

queue depth
ECS restarts
Sentry worker errors
Redis timeout symptoms
Aurora timeout symptoms

Production lessons already learned

grant Redis and Aurora access from the current ECS worker SG, not only old worker IPs
stale DB hostnames can survive in both prod .env and SSM
historical Sentry issues can remain open after the active failure mode changes

Practical safety summary

explicit queue routing SSM-managed runtime values network access verified queue depth is not enough

Benefits and end state

Why this model is better over the long term.

The end state is a background execution layer that is safer for the web tier, easier to scale, and cheaper to keep mostly idle.

Operational benefits

better isolation between web and queue load
safer production app hosts
cleaner AWS visibility for queue consumption
simpler rollback

Financial benefits

small baseline footprint
burst/drain support during cheap idle periods
no need for extra always-on EC2 worker hosts
pay for more worker capacity only where backlog justifies it

Engineering benefits

queue-specific tuning and scaling
clearer bottleneck detection
less ad hoc worker server setup
normal Laravel worker semantics for long-running jobs

Corporate-style end state

stable web tier queue-specific scaling burst/drain ready clearer diagnosis rollback-ready

Plain-English explanation

Think of Fargate as the separate warehouse crew for heavy jobs.

The website keeps serving customers at the front counter while a dedicated crew in the back handles the pallets, images, and bulk work. When there is more freight, you add warehouse workers — not more people at the tills.

Recommended operating rules

do not mix default into heavy ECS workers
do not scale multiple queues at once without observation
do not remove host rollback before ECS health is proven
do not rely on queue depth alone when deciding health
treat networking and runtime correctness as first-class release criteria

Outcome

This strategy gives AME by Zentra a background execution model that is safer, cheaper when idle, and much easier to scale and roll back intentionally.

stable web selective scale clean rollback

operations overview worker architecture scaling & cost model operations runbook troubleshooting guide

AME by Zentra / Fargate queue strategy deck