Must-have | Nice-to-have / differentiators |
Principal-level hands-on data engineering on AWS β 7+ years | Prior simulation / CAE / HPC data lake experience (Ansys, Siemens NX, BETA CAE, OpenFOAM, etc.) |
Deep production experience with S3, S3 Tables, Glue, Athena, and OpenSearch (including k-NN / vector search) | Familiarity with surrogate model training data pipelines |
Built and shipped vector embedding workloads | Experience with SageMaker Unified Studio or comparable governed data-mesh tooling (in case of required integration) |
Strong metadata modelling and data taxonomy design experience for scientific or engineering domains | Multi-cloud data engineering (AWS GCP) experience |
Comfort working with Parquet, JSON-LD, and large binary scientific data formats (mesh, time-series, spectra) | Published or contributed to AWS data architecture patterns or blueprints |
Python proficiency; PySpark / Glue job tuning experience |
Responsibilities
Key responsibilities on this engagement
β’ Run the Sprint 1 architecture review of the existing UAT codebase (S3 + Glue + S3 Tables + OpenSearch + Athena) and deliver written gap findings.
β’ Design the metadata schema, taxonomy, and field catalogue (Light, Brain, Power).
β’ Tune data orchestration β Glue jobs, Athena queries, S3 Tables config, scheduling. Lead the deep-dive technical sessions with analysts on visualization requirements
β’ Build and validate the simulation data onboarding pipeline against real data β including the 30 GB-per-run acoustic spectra dataset.
β’ Configure and validate the OpenSearch k-NN vector store and the Bedrock embedding pipeline.
β’ Author the AI/ML data export format specification and the AI onboarding pattern document.
β’ Co-design the API middleware blueprint with the Cloud Infrastructure Architect.
Qualifications
Must-have | Nice-to-have / differentiators |
Principal-level hands-on data engineering on AWS β 7+ years | Prior simulation / CAE / HPC data lake experience (Ansys, Siemens NX, BETA CAE, OpenFOAM, etc.) |
Deep production experience with S3, S3 Tables, Glue, Athena, and OpenSearch (including k-NN / vector search) | Familiarity with surrogate model training data pipelines |
Built and shipped vector embedding workloads | Experience with SageMaker Unified Studio or comparable governed data-mesh tooling (in case of required integration) |
Strong metadata modelling and data taxonomy design experience for scientific or engineering domains | Multi-cloud data engineering (AWS GCP) experience |
Comfort working with Parquet, JSON-LD, and large binary scientific data formats (mesh, time-series, spectra) | Published or contributed to AWS data architecture patterns or blueprints |
Python proficiency; PySpark / Glue job tuning experience |