Job Title or Location

Sr MLOps Engineer

People Machine
Toronto, ON
Posted yesterday
Job Details:
Full-time
Experienced

Project Goals and Objectives:

The primary goal of this engagement is to accelerate the adoption of MLOps best practices and establish automated ML pipelines on GCP, enabling efficient and reliable model development, deployment, and monitoring.

Key objectives include:

  • Achieve MLOps Level 1: Implement automated ML pipelines for continuous integration, continuous delivery, and continuous training of ML models on GCP with forward looking approach to enable MLOps Level 2. Implementing Level 2 must be a smooth evolution from Level 1.
  • Establish a Reusable MLOps Framework: Develop a modular and reusable framework for building and deploying ML models, adhering to industry best practices.
  • Implement Performance Monitoring: Establish a system for monitoring model performance in production. Specific metrics will be determined based on the ML project.
  • Evaluate Feature Store Needs: Assess the need for a feature store based on the requirements of the selected ML project. Implementation will only proceed if deemed necessary and beneficial.
  • Documentation: Create comprehensive documentation for all implemented MLOps components and processes.

Scope of Work:

  • Assess the current state of ML infrastructure and identify areas for improvement.
  • Develop a detailed plan for achieving MLOps Level 1, including specific tasks, timelines, and resource requirements. This plan should be flexible to accommodate the evolving needs of the ML project and be inline with further evolution of Level 1 to Level 2.
  • Contribute to our MLOPs tooling by implementing MLOps best practices and automating the ML pipeline. This may include:
  • Developing CI/CD pipelines for model training and deployment.
  • Implementing model versioning and management using Vertex AI Model Registry.
  • Setting up performance monitoring and alerting systems.
  • Implementing a feature store solution (if applicable and deemed necessary).
  • Automating data validation and preprocessing steps.
  • A/B Testing Framework: manage multiple versions of a model and tracking user Call-To-Actions tied to model versions
  • Ensure all implemented components are modular, reusable, and adhere to industry best practices.

Deliverables:

  • MLOps Implementation Plan: A detailed plan for achieving MLOps Level 1.
  • Automated ML Pipelines: Fully functional and automated ML pipelines for the selected ML project.
  • Performance Monitoring System: A system for monitoring model performance in production.
  • Feature Store Evaluation Report (if applicable): A report outlining the evaluation of feature store needs and recommendations. If implemented, a fully functional feature store solution.
  • Comprehensive Documentation: Detailed documentation for all implemented MLOps components and processes.

Contractor Requirements:

  • Minimum of 4 years of experience in MLOps, with a proven track record of implementing automated ML pipelines on GCP.
  • Minimum of a Bachelor's degree in Computer Science or a related field.
  • Strong understanding of MLOps principles and best practices.
  • Experience with continuous / incremental learning.
  • Experience with CI/CD tools and technique.
  • Experience with performance monitoring and alerting systems.
  • Experience with GPU-based inference acceleration (e.g. CUDA, TensorRT)
  • Strong programming skills in Python and other relevant languages.
  • Proficiency in GCP services, including Vertex AI, Cloud Storage, BigQuery, and other relevant services.
  • Experience with ML Frameworks (e.g. TensorFlow, PyTorch, HuggingFace, ONNX)
  • Exposure to ML architectures (e.g. Recommendation Systems, Similarity Learning, LLMs).

Nice To Have:

  • Experience with infrastructure automation (e.g. Terraform, Docker , Kubernetes)
  • Experience with distributed processing (e.g. Ray, Dask, asyncio) is a plus.
  • Experience with feature store solutions (e.g., Feast, Tecton) is a plus.
  • Experience with streaming systems (e.g. RabbitMQ, Pub/Sub, Kafka, etc.)
  • Knowledge of orchestration tools (e.g. Airflow, Prefect, Dagster)

Share This Job: