At a Glance
- Tasks: Design and maintain Django services for ML inference workflows with high reliability.
- Company: Graswald AI, a leader in AI-driven content creation for fashion brands.
- Benefits: Remote work, competitive salary, and opportunities for professional growth.
- Other info: Collaborative culture focused on quality and innovation in a fast-scaling environment.
- Why this job: Join a dynamic team tackling real-world reliability challenges in AI infrastructure.
- Qualifications: Strong Python backend experience and familiarity with Django in production.
The predicted salary is between 36000 - 60000 € per year.
Your mission
Location: Remote (CET)
Type: Full-time
About Us
Graswald AI is transforming how the world’s most iconic brands create content using AI. Backed by leading investors and powered by a world-class team, we’re redefining the fashion content process - no physical studios, samples, logistics - just cutting-edge AI and automation. We build AI systems that power large-scale content generation for global brands. Our core production application is a Django-based backend that coordinates high-throughput ML inference across many internal systems and external providers. As usage grows, reliability, orchestration, and operational correctness are critical to the business. This role exists to ensure those systems remain dependable, observable, and scalable as we grow.
Your profile
The Role
This is a backend software engineering role with end-to-end reliability ownership. You will design, build, and operate a Django production backend that orchestrates ML inference workflows across internal services and third-party APIs. The core challenge is high-throughput orchestration: asynchronous execution, retries, idempotency, backpressure, failure handling, and system-level observability. Infrastructure and Terraform are supporting tools. The primary output of this role is reliable production software. You will work closely with ML engineers and backend teams to turn research systems into robust, production-grade services.
What You’ll Do
- Design, build, and maintain Django services that coordinate and serve ML inference workflows.
- Own high-throughput asynchronous execution using queues, workers, and schedulers.
- Design safe orchestration patterns: idempotency, deduplication, retries, rate limiting, and backpressure.
- Build and operate systems with clear SLOs, error budgets, and on-call ownership.
- Lead incident response, write postmortems, and drive long-term reliability improvements.
- Implement end-to-end observability: metrics, logs, traces, dashboards, alerts, and runbooks.
- Improve reliability of service integrations using timeouts, circuit breakers, fallbacks, and dependency health modeling.
- Collaborate with ML engineers to productionize training and inference pipelines.
- Own CI/CD and deployment workflows for backend and ML-facing services.
- Use Infrastructure as Code (Terraform) to support reliability, scalability, and repeatability.
- Optimize performance and cost across compute, storage, databases, and external dependencies.
What We’re Looking For
Required
- Strong background as a Python backend engineer with ownership of production systems.
- Hands-on experience running Django in production (ORM usage, migrations, performance tuning, request lifecycle).
- Experience integrating with multiple internal and external services in reliability-critical paths.
- Proven experience building and operating asynchronous job systems (e.g., Celery, RQ, Arq, or equivalents).
- Hands-on experience with workflow or orchestration systems (Temporal, Prefect, Airflow, Step Functions).
- Solid understanding of distributed systems reliability: timeouts, retries, idempotency, rate limiting, backpressure, and failure isolation.
- Experience defining and operating SLOs/SLAs, including alerting and on-call participation.
- Strong Linux, networking, and debugging fundamentals.
- Working knowledge of cloud platforms (AWS and/or GCP).
- Practical experience using Infrastructure as Code (Terraform) as part of a broader system.
Nice to Have
- Experience operating ML inference or training infrastructure at scale.
- Familiarity with MLOps tooling (SageMaker, Vertex AI, Kubeflow, MLflow, Argo Workflows).
- Experience with distributed tracing and observability stacks (OpenTelemetry, Prometheus, Grafana, ELK/Loki).
- Experience operating Postgres and caches (e.g., Redis) in high-throughput systems.
- Startup or greenfield system ownership experience.
Role Boundaries
This role is not primarily focused on:
- Writing Terraform modules or managing clusters full-time.
- Ticket-driven infrastructure support.
- Platform enablement without production ownership.
Success is measured by reliable backend orchestration, production stability, and system-level outcomes.
Why This Role
- High ownership over core production systems that power ML inference.
- Real reliability and scale problems, not maintenance work.
- Close collaboration with backend and ML engineers.
- Opportunity to define reliability standards as the platform scales.
If you’ve owned Django services in production, built high-throughput async systems, and care deeply about reliability, this role should feel familiar.
Why us? Why Join Us
- Impact: Build and own the core infrastructure that powers AI experiences for global brands.
- Scale & Performance: Tackle challenging reliability and performance problems across training and inference.
- Autonomy: High ownership to define standards, tooling, and best practices for reliability.
- Growth: Work with a high-caliber team in a fast-scaling environment with significant career upside.
- Culture: Pragmatic, collaborative, and quality-focused engineering culture.
About us
At Graswald AI, we are building the AI operating system for fashion brands and retailers, to drive efficiency, flexibility and profitability. Today we specialise in generating eCommerce and campaign imagery and video. In just the past year, we’ve brought on 50 enterprise fashion brands, helping them reduce costs, accelerate timelines, and maintain the highest standards of visual quality. Backed by leading VCs and strategic investors - including Lakestar, Orendt Studios, and prominent angels - we are building the full software stack and Operating System for enterprise fashion brands, enabling brands to create, scale, and connect with their customers like never before.
Senior Backend Engineer, ML Infrastructure & Reliability employer: Graswald GmbH
Graswald AI is an exceptional employer, offering a dynamic remote work environment that fosters innovation and collaboration among a world-class team. With a strong focus on employee growth, you will have the opportunity to tackle real reliability challenges while defining best practices in a fast-scaling company. Our culture prioritises quality and pragmatism, ensuring that you can make a meaningful impact in the AI-driven fashion industry.
StudySmarter Expert Advice🤫
We think this is how you could land Senior Backend Engineer, ML Infrastructure & Reliability
✨Tip Number 1
Network like a pro! Reach out to folks in your industry on LinkedIn or at meetups. A personal connection can often get you a foot in the door faster than any application.
✨Tip Number 2
Show off your skills! If you’ve got a portfolio or GitHub with projects that highlight your backend engineering prowess, make sure to share it during interviews. It’s a great way to demonstrate what you can bring to the table.
✨Tip Number 3
Prepare for technical interviews by brushing up on your Django and Python skills. Practice coding challenges and system design questions that are relevant to the role. We want to see how you think and solve problems!
✨Tip Number 4
Don’t forget to apply through our website! It’s the best way to ensure your application gets seen by the right people. Plus, we love seeing candidates who take the initiative to connect directly with us.
We think you need these skills to ace Senior Backend Engineer, ML Infrastructure & Reliability
Some tips for your application 🫡
Show Off Your Django Skills:Make sure to highlight your experience with Django in your application. We want to see how you've used it in production, so share specific examples of projects where you’ve built or maintained Django services.
Talk About Reliability:Since reliability is key for us, don’t shy away from discussing your experience with high-throughput systems and orchestration patterns. Let us know how you've tackled challenges like retries, idempotency, and error handling in your past roles.
Be Clear and Concise:When writing your application, keep it clear and to the point. We appreciate straightforward communication, so avoid fluff and focus on what makes you a great fit for the role.
Apply Through Our Website:We encourage you to apply directly through our website. It’s the best way for us to receive your application and ensures you’re considered for this exciting opportunity!
How to prepare for a job interview at Graswald GmbH
✨Know Your Django Inside Out
Make sure you’re well-versed in Django, especially its ORM, migrations, and performance tuning. Brush up on how to handle request lifecycles and be ready to discuss your hands-on experience running Django in production.
✨Showcase Your Asynchronous Skills
Be prepared to talk about your experience with asynchronous job systems like Celery or RQ. Highlight specific projects where you’ve implemented high-throughput orchestration and how you tackled challenges like retries and idempotency.
✨Demonstrate Reliability Knowledge
Familiarise yourself with concepts of distributed systems reliability, such as timeouts, rate limiting, and failure isolation. Be ready to share examples of how you've defined and operated SLOs/SLAs in previous roles.
✨Get Comfortable with Infrastructure as Code
Since Terraform is a key tool for this role, make sure you can discuss your practical experience using Infrastructure as Code. Prepare to explain how you’ve used it to support reliability and scalability in past projects.