01 / 12 Cover
Zentra
AME by Zentra / Fargate queue strategy

Stable web tier.
Independent queue scale.
Rollback-ready workers.

A professional Fargate queue strategy for AME by Zentra built from the supplied overview, architecture, scaling, runbook, and troubleshooting notes — styled to match the release strategy example.

isolated load queue-specific scale burst/drain ready conservative rollout rollback available
Primary heavy queues
sync feed imports and sync finalisation
image image fetch, create, backfill, and media bursts
bulk cache warmers, rebuilds, and non-urgent fan-out

Goal: move heavy Laravel queue work off the main app hosts and onto dedicated ECS Fargate workers. Keep default separate so user-facing background work never gets mixed into heavy draining.

Laravel → Redis → ECS scale one queue at a time keep host rollback
Executive summary

Why change the queue model?

The supplied strategy separates heavy queue load from the web tier so queue pressure can grow or shrink without destabilising production app hosts.

01

Protect the web tier

Image and feed jobs stop competing with PHP-FPM, Horizon, and request/response traffic on the main fleet.

02

Scale only the bottleneck

Raise image without widening sync or bulk. Pay for extra capacity only where backlog proves it.

03

Keep rollback simple

ECS services can be scaled down and host consumers kept available during rollout until full ownership is proven.

Heavy work becomes
isolated observable service-owned queue-specific reversible
The problem we are solving

Shared hosts create avoidable queue risk.

When heavy queue work runs next to request/response traffic, tuning decisions and failures spill across unrelated parts of production.

What happens today
  • image processing competes with PHP-FPM and Horizon
  • long-running feed jobs make worker tuning harder
  • scaling queue throughput means touching production app servers
Operational downside
  • temporary worker hosts create drift and repeated manual setup
  • failures in one queue can destabilise unrelated work
  • mixed workloads blur what is actually causing pressure
Why it matters
  • web stability becomes hostage to background bursts
  • incident response depends too much on host-by-host behaviour
  • capacity changes become slower and riskier than they need to be

The fix is not “more shared workers”. The fix is a separate runtime layer for heavy queues.

The operating model

Dispatch normally. Drain heavily. Scale per queue.

The platform keeps normal Laravel queue semantics, but moves heavy execution onto ECS services that can scale by workload class.

Five-layer layout
  1. Laravel app dispatches jobs to the correct queue
  2. Redis stores the backlog and reserved / delayed state
  3. ECS services run the Laravel workers
  4. Aurora and S3 provide durable data dependencies
  5. Lambda and app-side wake-up logic control burst/drain scaling
Runtime shape
  • each heavy queue gets separate desired counts, logs, and restart behaviour
  • task definitions differ by memory, timeout, queue name, retries, and max jobs
  • SSM provides DB, Redis, app key, and S3 runtime values instead of baking secrets into images
sync

Owns heavy feed imports, sync fan-out, and manual property save finalisation. Long-running sync work stays clear of image draining.

image

Owns image fetch, create, backfill, and media processing. It is usually the main backlog driver and the first scaling candidate.

bulk

Owns cache warmers, cache rebuilds, and non-urgent heavy fan-out. It stays low priority and is cheap to run in burst/drain mode.

Important rule
default stays out of the heavy-drain path Redis remains the single queue truth
Service layout and dependencies

Each heavy queue gets its own service and clear runtime dependencies.

Splitting ownership by queue means each workload can be observed, sized, and rolled back separately instead of treating the whole worker fleet as one pool.

Production service map
cluster: propertyfeedtemplate-production
  • propertyfeedtemplate-sync
  • propertyfeedtemplate-image
  • propertyfeedtemplate-bulk
Network path must exist to
  • Redis on 6379
  • Aurora writer on 3306
  • S3, SSM, ECR, and CloudWatch Logs
  • the correct subnet and ENI path for workers and Lambda where used
Queue truth stays in Redis
  • queues:sync
  • queues:image
  • queues:bulk
  • plus reserved and delayed variants for each queue
What separation buys you
separate desired counts separate logs separate restart behaviour queue-specific scaling
Rollout and cutover

Start low. Prove health. Keep host rollback available.

The strategy is intentionally conservative: prove correctness first, then widen only the queue that genuinely needs more throughput.

1

Baseline

sync=1
image=1
bulk=1

2

Prove basics

Confirm image build, DB, Redis, S3, and queue routing.

3

Keep rollback

Do not disable host consumers until ECS is healthy.

4

Change one queue

Increase only one service at a time.

5

Prefer image first

Raise image from 1 to 2 first.

6

Watch soak

Observe logs and Sentry after steady state.

7

Expand carefully

Move the next queue only after the first is proven.

Cutover rules
  • keep host rollback available during initial rollout
  • move one queue class at a time if risk is unclear
  • after ownership is proven, host sync, image, and bulk should not compete with ECS
Rollback rules
  • scale the affected ECS service down
  • re-enable the host consumer if needed
  • keep queue routing explicit and reversible
  • do not delete ECS resources until diagnosis is complete
Scale is the last step, not the first.
Scaling and cost model

Scale only the bottleneck queue.

More workers are not automatically better. Extra capacity helps only after routing, connectivity, and job behaviour are already correct.

image
  • usual backlog driver
  • preferred first scale-up target
  • burst/drain range: 0..2
  • move above 2 only after stability is proven
sync
  • hold at 1 in the safe baseline
  • widen only after image is stable
  • burst/drain range: 0..1
  • protect sync from image contention
bulk
  • deliberately low priority
  • burst/drain range: 0..1
  • good candidate for cheap selective scaling
  • do not widen before more urgent queues are healthy
Cost control principles
  • keep minimum counts low
  • scale only the queue that needs it
  • use burst/drain for non-constant work
  • avoid idle EC2 worker hosts and large always-on fleets without proof
Main cost levers
  • number of Fargate tasks
  • CPU and memory size per task
  • CloudWatch log volume
  • transfer and S3 side effects
  • Lambda controller cost is negligible compared with task runtime
Operations runbook

Operate with health checks, not queue depth alone.

Healthy operations require watching service state, connectivity, and error behaviour together rather than treating backlog as the only signal.

Normal daily checks
  • ECS desired, running, and pending counts
  • CloudWatch worker logs
  • queue backlog trend
  • Sentry worker errors
  • Aurora and Redis connectivity health
Healthy state
  • service reaches steady state
  • no restart storms
  • jobs do not fail immediately in loops
  • backlog trends down under normal ingest
  • no fresh Sentry bursts for connectivity
Safe change rules
  • increase only one queue
  • wait for steady state
  • watch a short soak window
  • keep scale-down reversible
  • drop to 0 only when ready, reserved, and delayed work is empty
Production commands
aws ecs describe-services \
  --cluster propertyfeedtemplate-production \
  --services propertyfeedtemplate-sync propertyfeedtemplate-image propertyfeedtemplate-bulk

aws ecs update-service \
  --cluster propertyfeedtemplate-production \
  --service propertyfeedtemplate-image \
  --desired-count 2
Troubleshooting guide

Fix connectivity and runtime correctness before adding workers.

Most queue incidents are not solved by wider scale. First prove that each worker can start, reach dependencies, and consume the correct backlog.

Jobs fail immediately

Most likely: bad DB host, missing Aurora or Redis ingress, or stale SSM runtime values.

No steady state

Most likely: bad container image, missing Laravel runtime directories, or bad bootstrap at startup.

Queue not draining

Workers may not be consuming, jobs may be failing and requeueing, or the queue may point at the wrong Redis.

Timeout symptoms

DB timeout can mean DNS improved but the path is still blocked. Redis timeout means the broker path is not healthy.

First checks
  • SSM runtime values
  • ECS worker security group
  • Aurora ingress on 3306
  • Redis ingress on 6379
  • CloudWatch startup logs and task stop reason
Incident priorities
  • Redis reachability
  • Aurora reachability
  • S3 access
  • worker startup logs
  • queue routing correctness
  • deployment drift between prod env and SSM

Operator rule of thumb: do not scale a failing queue wider first.

Risk controls and monitoring

Keep the platform conservative and observable.

The safest posture is explicit routing, clear networking, and monitoring that combines backlog with restart and dependency signals.

Guardrails
  • start low
  • keep default separate
  • scale one queue at a time
  • keep rollback simple
  • do not widen broken workers first
Monitor these together
  • queue depth
  • ECS restarts
  • Sentry worker errors
  • Redis timeout symptoms
  • Aurora timeout symptoms
Production lessons already learned
  • grant Redis and Aurora access from the current ECS worker SG, not only old worker IPs
  • stale DB hostnames can survive in both prod .env and SSM
  • historical Sentry issues can remain open after the active failure mode changes
Practical safety summary
explicit queue routing SSM-managed runtime values network access verified queue depth is not enough
Benefits and end state

Why this model is better over the long term.

The end state is a background execution layer that is safer for the web tier, easier to scale, and cheaper to keep mostly idle.

Operational benefits
  • better isolation between web and queue load
  • safer production app hosts
  • cleaner AWS visibility for queue consumption
  • simpler rollback
Financial benefits
  • small baseline footprint
  • burst/drain support during cheap idle periods
  • no need for extra always-on EC2 worker hosts
  • pay for more worker capacity only where backlog justifies it
Engineering benefits
  • queue-specific tuning and scaling
  • clearer bottleneck detection
  • less ad hoc worker server setup
  • normal Laravel worker semantics for long-running jobs
Corporate-style end state
stable web tier queue-specific scaling burst/drain ready clearer diagnosis rollback-ready
Plain-English explanation

Think of Fargate as the separate warehouse crew for heavy jobs.

The website keeps serving customers at the front counter while a dedicated crew in the back handles the pallets, images, and bulk work. When there is more freight, you add warehouse workers — not more people at the tills.

Recommended operating rules
  • do not mix default into heavy ECS workers
  • do not scale multiple queues at once without observation
  • do not remove host rollback before ECS health is proven
  • do not rely on queue depth alone when deciding health
  • treat networking and runtime correctness as first-class release criteria
Outcome

This strategy gives AME by Zentra a background execution model that is safer, cheaper when idle, and much easier to scale and roll back intentionally.

stable web selective scale clean rollback
operations overview worker architecture scaling & cost model operations runbook troubleshooting guide
Zentra AME by Zentra / Fargate queue strategy deck