Data & ML infrastructure

Duy Nguyen

Data engineer building reliable data platforms, ML pipelines, and real-time analytics—where scale meets clarity.

Open to conversations on data platforms, ML in production, and pragmatic system design.

About

I design and build data infrastructure that stays understandable under load: clear contracts, observable pipelines, and systems that teams can evolve.

I care about the gap between a prototype notebook and something that runs every day in production—scheduling, failure modes, cost, and the human side of operating platforms.

If you’re working on similar problems, the notes below are where I document what actually worked (and what didn’t).

Core expertise

Hands-on work across the stack—from streaming and batch to model serving and observability.

Data engineering

Scalable lakehouse and warehouse patterns, Spark and distributed batch, plus operational rigor for production datasets.

Machine learning in production

Training-to-serving paths, batch and online inference, and APIs that stay maintainable as models and traffic evolve.

Cloud & Kubernetes

Operators, workloads on K8s, and automation so teams can ship data products without fragile one-off scripts.

Real-time analytics

Streaming ingestion, low-latency paths, and tooling so stakeholders see fresh signal—not yesterday’s snapshot.

Games

Small browser experiments—good for a short break between deep work.

More games

Latest Posts

Spark Operator on Kubernetes (Computing Layer)
- May 15, 2026 [Updated]
- 3 min
- Duy Nguyen
Raspberry Pi K3s Alpine Linux Part 2
- May 15, 2026 [Updated]
- 4 min
- Duy Nguyen
Raspberry Pi K3s Alpine Linux Part 1
- May 15, 2026 [Updated]
- 4 min
- Duy Nguyen
Real-time Analytics: Airflow + Kafka + Druid + Superset
- May 15, 2026 [Updated]
- 4 min
- Duy Nguyen
Serving ML Models in Production with FastAPI and Celery
- May 15, 2026 [Updated]
- 3 min
- Duy Nguyen
Cats vs Dogs Classification using CNN Keras
- May 15, 2026 [Updated]
- 6 min
- Duy Nguyen
Spark Distributed ML model with Pandas UDFs
- May 15, 2026 [Updated]
- 7 min
- Duy Nguyen

Featured work

Selected projects and long-form notes from the blog.

Engineering

Spark Operator on Kubernetes Orchestrating distributed Spark on Kubernetes.
ML models in production End-to-end serving and ops patterns.
Real-time analytics platform High-throughput streaming analytics.

Articles

Real-time analytics: Airflow, Kafka, Druid, Superset Architecture for modern streaming analytics.
Cats vs dogs with CNN and Keras From dataset to a trained vision model.
Spark distributed ML with Pandas UDFs Scaling ML workflows on the cluster.
Serving ML models with FastAPI and Celery Async workers and APIs for inference.

Support

If my writing or tools helped you, TRC20 donations are welcome.