42 Site Reliability Engineer jobs in the United Arab Emirates

Site Reliability Engineer

Dubai, Dubai Flex Dental

Posted today

Job Viewed

Tap Again To Close

Job Description

At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental’s practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.

Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.

Responsibilities
  • Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
  • Proactively monitor application health and performance across cloud infrastructure (AWS).
  • Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
  • Lead and participate in disaster recovery drills and security incident simulations.
  • Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
  • Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
  • Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
  • Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
  • Champion best practices in security, availability, performance, and incident response.
Required Technologies & Tools
  • Cloud Infrastructure : Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
  • Programming/Scripting : Proficiency in Node.js and scripting for automation and tooling.
  • Containerization : Experience with Docker for container-based deployment pipelines.
  • Frontend Awareness : Familiarity with React and Ember.js to understand performance implications at the frontend level.
  • Backend Stack : Understanding of NestJS and scalable Node-based services.
  • Databases : Proficient in MySQL and performance monitoring of relational databases.
  • Version Control : Proficiency with Git for collaborative code management and DevOps workflow integration.
Core Competencies
  • Incident Response : Calm and focused under pressure with a structured approach to resolving outages and degradation.
  • System Design : Ability to contribute to and review architectural designs for scalability and resiliency.
  • Collaboration : Strong communication skills to coordinate across developers, QA, and product teams.
  • Automation & Efficiency : Passion for automation, repeatability, and continuous improvement.
  • Security Mindset : Consistent implementation of security best practices and a strong grasp of data protection standards.
Qualifications
  • 3+ years of experience in a Site Reliability, DevOps, or related engineering role.
  • Proven track record managing and scaling applications in a production AWS environment.
  • Familiarity with full stack environments , particularly those using Node.jss .
  • Experience maintaining and deploying databases such as MySQL with performance tuning.
  • Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
  • Commitment to uptime, performance, and security in fast-moving SaaS environments.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Abu Dhabi, Abu Dhabi Etihad Airways

Posted today

Job Viewed

Tap Again To Close

Job Description

Press Tab to Move to Skip to Content Link

The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs/SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.

Accountabilities
  • Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
  • Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
  • Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
  • Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
  • Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
  • Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
  • Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
  • Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
  • Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
  • Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
Education & Experience
  • 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
  • Experience working in computing, distributed systems, storage, or networking.
  • Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
  • Ability to debug, optimize code, and to automate routine tasks.
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills.
  • Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
  • Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
  • Strong analytical and problem-solving skills are necessary , TSM processes & tools.

Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world’s leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad’s codeshare partners, Etihad’s network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad’s ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly!

Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video/telephone before any formal offer. If you are asked for money, please treat it as fraudulent.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Dubai, Dubai Flex Dental

Posted today

Job Viewed

Tap Again To Close

Job Description

At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental's practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.

Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.

Responsibilities
  • Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
  • Proactively monitor application health and performance across cloud infrastructure (AWS).

  • Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.

  • Lead and participate in disaster recovery drills and security incident simulations.

  • Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.

  • Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).

  • Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.

  • Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.

  • Champion best practices in security, availability, performance, and incident response.
Required Technologies & Tools
  • Cloud Infrastructure : Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.

  • Programming/Scripting : Proficiency in and scripting for automation and tooling.

  • Containerization : Experience with Docker for container-based deployment pipelines.

  • Frontend Awareness : Familiarity with React and to understand performance implications at the frontend level.

  • Backend Stack : Understanding of NestJS and scalable Node-based services.

  • Databases : Proficient in MySQL and performance monitoring of relational databases.

  • Version Control : Proficiency with Git for collaborative code management and DevOps workflow integration.

Core Competencies
  • Incident Response : Calm and focused under pressure with a structured approach to resolving outages and degradation.

  • System Design : Ability to contribute to and review architectural designs for scalability and resiliency.

  • Collaboration : Strong communication skills to coordinate across developers, QA, and product teams.

  • Automation & Efficiency : Passion for automation, repeatability, and continuous improvement.

  • Security Mindset : Consistent implementation of security best practices and a strong grasp of data protection standards.

Qualifications
  • 3+ years of experience in a Site Reliability, DevOps, or related engineering role.

  • Proven track record managing and scaling applications in a production AWS environment.

  • Familiarity with full stack environments , particularly those using .

  • Experience maintaining and deploying databases such as MySQL with performance tuning.

  • Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
  • Commitment to uptime, performance, and security in fast-moving SaaS environments.
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Abu Dhabi, Abu Dhabi Etihad Airways

Posted today

Job Viewed

Tap Again To Close

Job Description

Press Tab to Move to Skip to Content Link

The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs / SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.

Accountabilities

  • Team Leadership & Reporting : Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
  • Toil Reduction & Automation : Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
  • Service Reliability & Uptime : Maintain and improve service availability by aligning with SLAs / SLOs, designing failover strategies, and hardening systems.
  • Performance & Latency Optimization : Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
  • Change & Deployment Management : Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
  • Monitoring & Observability : Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
  • Incident Management & RCA : Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
  • Capacity & Cost Optimization : Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
  • Development & Code Review : Contribute to system development, participate in design / code reviews, and ensure alignment with engineering best practices.
  • Governance, Compliance & Documentation : Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.

Education & Experience

  • 7+ years of experience with data structures / algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
  • Experience working in computing, distributed systems, storage, or networking.
  • Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
  • Ability to debug, optimize code, and to automate routine tasks.
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills.
  • Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
  • Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
  • Strong analytical and problem-solving skills are necessary , TSM processes & tools

Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world's leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad's codeshare partners, Etihad's network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad's ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly

Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video / telephone before any formal offer. If you are asked for money, please treat it as fraudulent.

J-18808-Ljbffr

Site Engineer
• Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Abu Dhabi, Abu Dhabi Etihad Airways

Posted today

Job Viewed

Tap Again To Close

Job Description

Press Tab to Move to Skip to Content Link

The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs/SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.

Accountabilities
  • Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
  • Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
  • Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
  • Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
  • Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
  • Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
  • Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
  • Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
  • Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
  • Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
Education & Experience
  • 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
  • Experience working in computing, distributed systems, storage, or networking.
  • Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
  • Ability to debug, optimize code, and to automate routine tasks.
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills.
  • Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
  • Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
  • Strong analytical and problem-solving skills are necessary , TSM processes & tools.

Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world's leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad's codeshare partners, Etihad's network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad's ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly

Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video/telephone before any formal offer. If you are asked for money, please treat it as fraudulent.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Dubai, Dubai Flex Dental

Posted 6 days ago

Job Viewed

Tap Again To Close

Job Description

At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental’s practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.

Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.

Responsibilities
  • Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
  • Proactively monitor application health and performance across cloud infrastructure (AWS).
  • Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
  • Lead and participate in disaster recovery drills and security incident simulations.
  • Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
  • Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
  • Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
  • Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
  • Champion best practices in security, availability, performance, and incident response.
Required Technologies & Tools
  • Cloud Infrastructure : Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
  • Programming/Scripting : Proficiency in Node.js and scripting for automation and tooling.
  • Containerization : Experience with Docker for container-based deployment pipelines.
  • Frontend Awareness : Familiarity with React and Ember.js to understand performance implications at the frontend level.
  • Backend Stack : Understanding of NestJS and scalable Node-based services.
  • Databases : Proficient in MySQL and performance monitoring of relational databases.
  • Version Control : Proficiency with Git for collaborative code management and DevOps workflow integration.
Core Competencies
  • Incident Response : Calm and focused under pressure with a structured approach to resolving outages and degradation.
  • System Design : Ability to contribute to and review architectural designs for scalability and resiliency.
  • Collaboration : Strong communication skills to coordinate across developers, QA, and product teams.
  • Automation & Efficiency : Passion for automation, repeatability, and continuous improvement.
  • Security Mindset : Consistent implementation of security best practices and a strong grasp of data protection standards.
Qualifications
  • 3+ years of experience in a Site Reliability, DevOps, or related engineering role.
  • Proven track record managing and scaling applications in a production AWS environment.
  • Familiarity with full stack environments , particularly those using Node.jss .
  • Experience maintaining and deploying databases such as MySQL with performance tuning.
  • Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
  • Commitment to uptime, performance, and security in fast-moving SaaS environments.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer

Abu Dhabi, Abu Dhabi Etihad Airways

Posted 6 days ago

Job Viewed

Tap Again To Close

Job Description

Press Tab to Move to Skip to Content Link

The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs/SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.

Accountabilities
  • Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
  • Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
  • Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
  • Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
  • Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
  • Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
  • Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
  • Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
  • Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
  • Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
Education & Experience
  • 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
  • Experience working in computing, distributed systems, storage, or networking.
  • Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
  • Ability to debug, optimize code, and to automate routine tasks.
  • Systematic problem-solving approach, coupled with effective verbal and written communication skills.
  • Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
  • Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
  • Strong analytical and problem-solving skills are necessary , TSM processes & tools.

Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world’s leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad’s codeshare partners, Etihad’s network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad’s ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly!

Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video/telephone before any formal offer. If you are asked for money, please treat it as fraudulent.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
Be The First To Know

About the latest Site reliability engineer Jobs in United Arab Emirates !

Senior Site Reliability Engineer

Abu Dhabi, Abu Dhabi Cerebras

Posted today

Job Viewed

Tap Again To Close

Job Description

About the Role:

Orbitworks is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit. We operate satellites, fly customer payloads, and handle entire missions from end-to-end. Orbitworks is a joint venture between Marlan Space (UAE-based) and Loft Orbital.

As a Senior Site Reliability Engineer on our Infrastructure Team, you’ll play a pivotal role in maintaining and scaling our ground segment infrastructure. You’ll collaborate across development, operations, and IT to ensure the integration, delivery, and reliability of services that support our test infrastructure and our space operations on Earth and in orbit.

This is an exciting opportunity to work on cutting-edge technology and help build modern automated space infrastructure. This is not your typical SRE role, we apply DevOps principles even to spacecraft control.

Responsibilities
  • Collaborate with developers, test engineers and satellite operators to foster a strong SatDevOps culture .
  • Design and roll-out cloud solutions for our testing and operations infrastructure . Find the best trade-offs between existing and additional cloud resources to scale and help Orbitworks achieve its mission.
  • Design, implement, and maintain scalable, reliable, and secure infrastructure in a hybrid cloud environment.
  • Improve our developer and test engineers experience by building better tools, workflows, and environment to streamline
  • Lead efforts to automate and optimize systems, including CI/CD pipelines , infrastructure provisioning (IaC) , and deployment workflows for test on the ground and operations in space.
  • Own and evolve our observability stack (metrics, tracing, logs) to improve usability and performance. Grafana-centric ecosystems are a plus.
  • Implement and advocate for best practices in software reliability, fault tolerance, and performance tuning.
  • Proactively identify, investigate, and resolve system reliability issues , performing root cause analyses and implementing long-term fixes.
  • Partner with teams to design and operate Software Defined Network (SDN) solutions.
  • Contribute to a collaborative and inclusive team culture where respectful debate and continuous learning are celebrated.
  • Initially, handle and manage the link between cloud and network/software/hardware infrastructure. Assume I&T (Information technology) responsibilities as much as necessary to start with.
Must Haves:
  • Strong experience with public cloud infrastructure , ideally GCP.
  • Deep expertise in Kubernetes , architecture, deployment, ops, and resource optimization.
  • Demonstrated ability to design and build scalable, highly available systems .
  • Familiarity with Software Defined Networking (SDN) concepts and tools.
  • Experience implementing and maintaining observability stacks (Grafana, Prometheus, Loki, etc.).
  • Proficiency in at least one backend language: Go, Python, Rust, C/C++, or Java .
  • Deep understanding and hands-on experience with DevOps practices : CI/CD, infrastructure as code (IaC), and automation.
  • Proven track record of working in fast-paced, high-growth technical environments .
  • Strong networking knowledge (TCP/IP, DNS, routing, switching, firewalls, VPNs, secure networks).
  • Deep experience in Systems Administration.
  • Excellent problem-solving skills and ability to operate independently with a proactive, results-driven mindset .
  • Strong communication skills; thrives in a multicultural, cross-functional team.
Nice to Have:
  • Hands-on experience with GitOps frameworks (ArgoCD, FluxCD).
  • Interest or experience in FinOps and cost-optimized architectures .
  • Understanding of orchestration in resource-constrained environments , like space systems.
  • Knowledge of infrastructure as code frameworks (Terraform, Ansible or similar)
  • Knowledge of systems engineering tools and SDLC governance.
  • Cybersecurity Awareness.
  • Familiarity with security practices , vulnerability scanning, threat detection, risk mitigation.

Orbitworks' mission is to make space simple for organizations that want to deploy physical and virtual missions to space. Building on Loft Orbital's heritage, Orbitworks will be the first commercial firm in the United Arab Emirates to mass-manufacture satellites. Orbitworks aims to manufacture tens of satellites annually and operates out of a 50,000-square-foot facility in Abu Dhabi.

#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.

Senior Site Reliability Engineer

Vng Solutions

Posted today

Job Viewed

Tap Again To Close

Job Description

VSOL is a digital enabler with a mission to help public and private organizations evolve their businesses through data and technology. We provide an end-to-end service from consulting to execution that drives the growth and innovation of our clients. As VSOL is in a phase of rapid expansion, we offer a dynamic, creative environment that accelerates your personal and professional development. We are looking for talented individuals eager to develop in international markets while contributing to the company’s future in a constructive and supportive manner.

Responsibilities:

  • Lead deployment and management of web applications, ensuring stability, scalability and reliability.
  • Design and manage hybrid environment reliability solutions (cloud and on-premises), optimizing for availability and performance.
  • Knowledge of orchestrate and administer containerized applications using Kubernetes, focusing on efficient deployment and runtime management.
  • Administer, including Geographic Information System (GIS) and databases (SQL Server), maintaining data integrity and high performance.
  • Analyze and mitigate service disruptions, developing strategic preventative measures to minimize downtime.
  • Understanding of network engineering principles.
  • Participate in evaluation and integration of new technologies, enhancing service reliability and operational capabilities.
  • Develop and automate critical system health metrics, using tools like ELK stack.
  • Manage major incident response efforts, ensuring effective resolution to maintain system stability.
  • Coordinate with cross-functional teams to align SRE practices with business objectives and IT standards.
  • Create and review technical documentation for system architecture and operational procedures.
  • Assure regulatory compliance and security assessments, implementing best practices to protect system integrity.
  • Participate in pager-duty rotations, resolving critical incidents.
Note: The position may require international travel for periods of 6 months continuously. Candidates will be required to accept this requirement as part of the positions

Requirements

  • Over 4 years of experience with cloud environments and containerization technologies, including designing and implementing scalable, resilient infrastructure solutions using platforms GCP (and other cloud platforms) , and Kubernetes.
  • Experience with monitoring and logging tools such as ELK Stack.
  • Demonstrated excellence in network management, advanced troubleshooting, and system optimization, with a focus on enhancing efficiency and reducing downtime.
  • Awareness of experience in IT, with advanced expertise in network engineering and system administration.
  • Awareness of experience in site reliability practices, any experience with GIS platforms is a plus.
  • Strong skills in scripting and automation, particularly with Python and Bash is a big plus.
  • Good knowledge of GitOps tools (e.g., Argo CD, FluxCD).
  • Knowledge of security frameworks and compliance standards.

Qualifications:

  • Bachelor’s degree in Computer Science, Information Technology, or a related field.
  • Cisco Certified Network Associate (CCNA) is a plus.
  • Certified Kubernetes Administrator (CKA) is a plus.
  • Written and spoken English communication skills at CEFR B1 level or above.

Why you’ll love working here:

  • Working in start-up environment, English-speaking, with opportunity to be part of innovation team and global projects
  • Onsite opportunities in UAE (United Arab Emirates) and KSA (Kingdom of Saudi Arabia)
  • 13th-month salary bonus
  • Premium Health insurance for employees and family members (depending on level), Annual Health Check, Government Insurance in probation
  • 14+ days of Annual leave and 5 days of Outing leave
  • Lunch allowance and free parking
  • Taxi & phone allowance (depending on level)
Apply for this job

Job Application

Full name *

Email *

Phone number *

Attach Resume *

Maximum file size: 3MB

Accepted file types: DOC, DOCX, PDF

Profile URL

If you are human, leave this field blank.

#J-18808-Ljbffr

This advertiser has chosen not to accept applicants from your region.

Site Reliability Engineer II

Dubai, Dubai Esri

Posted today

Job Viewed

Tap Again To Close

Job Description

Join us to work collaboratively with our talented team of dynamic and passionate engineers to deliver capabilities that enable our customers to make a difference. You'll deploy and operate ArcGIS Velocity and ArcGIS Workflow Manager SaaS solutions. You will also have the opportunity to design, deploy, and operate next-generation real-time and big data GIS software-as-a-service (SaaS) capabilities for thousands of cloud users worldwide.

Our teams have a broad mix of experience levels and tenures that support an environment that promotes professional development. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.

Our team also puts a high value on work-life balance, and we understand that striking a healthy balance between your personal and professional life is crucial to your happiness and success here. We offer a flexible hybrid schedule so you can have a more productive and well-balanced life both in and outside of work.

Responsibilities

  • Collaborate with a team of SRE engineers to operate SaaS capabilities across multiple regions on the cloud platform
  • Design, implement, configure, and utilize monitoring systems to monitor the health of SaaS products
  • Manage infrastructure used for ArcGIS Velocity and ArcGIS Workflow Manager, respond to alerts, and troubleshoot problems to resolution
  • Develop, implement, and maintain automation solutions for repetitive operational tasks, such as deployment pipelines, incident resolution, and scaling processes
  • Design and implement the deployment and upgrade containerized micro-service components that, when combined, power Esris SaaS offerings
  • Create and automate Git workflows to simplify code integration, testing, and infrastructure deployments
  • Participate in technical spike efforts, bringing new innovative ideas to future versions of our software
  • Troubleshoot the system incidents and provide root cause analysis reports
  • Provide rotational on-call technical support

Requirements

  • 5+ years of experience managing Kubernetes (EKS), logging and monitoring (ELK, Prometheus), and container technologies (Docker)
  • Proficient in using Terraform for automating infrastructure provisioning and management
  • Ability to design and automate Git workflows for streamlined code integration, testing, and infrastructure deployment
  • Ability to write scripts to deploy infrastructure and/or applications (Bash, Python, Terraform)
  • Expert level understanding and experience with cloud computing platforms (AWS or Microsoft Azure)
  • Strong knowledge of Linux Operating system administration, including troubleshooting, performance tuning, and shell scripting
  • Proficient in cloud networking, including VPCs, subnets, security groups, and VPNs in platforms like AWS or Azure
  • Skilled in identifying and resolving system and application issues through effective troubleshooting and root cause analysis
  • Working knowledge of a source control and issue management system
  • Bachelors in computer science, computer engineering, GIS, or information systems

Recommended Qualifications

  • Experience designing, administering, and/or maintaining cloud environments, such as AWS or Azure, supporting 24×7 high-availability production environments
  • Interest in working with GitOps principles to automate the deployment of applications on Kubernetes clusters
  • Certifications: AWS Certified Solution Architect Associate, CKA/CKAD or similar
  • Experience managing OpenSearch (datastore or logstore), and Kafka for managing distributed data streams and ensuring high availability in large-scale systems
  • Ability to work with continuous integration and delivery best practices
  • Knowledge of operating resilient, highly available, scalable, and performance SaaS capabilities
  • Knowledge of Esri ArcGIS or other web mapping technologies
  • Working knowledge of GitHub
#J-18808-Ljbffr
This advertiser has chosen not to accept applicants from your region.
 

Nearby Locations

Other Jobs Near Me

Industry

  1. request_quote Accounting
  2. work Administrative
  3. eco Agriculture Forestry
  4. smart_toy AI & Emerging Technologies
  5. school Apprenticeships & Trainee
  6. apartment Architecture
  7. palette Arts & Entertainment
  8. directions_car Automotive
  9. flight_takeoff Aviation
  10. account_balance Banking & Finance
  11. local_florist Beauty & Wellness
  12. restaurant Catering
  13. volunteer_activism Charity & Voluntary
  14. science Chemical Engineering
  15. child_friendly Childcare
  16. foundation Civil Engineering
  17. clean_hands Cleaning & Sanitation
  18. diversity_3 Community & Social Care
  19. construction Construction
  20. brush Creative & Digital
  21. currency_bitcoin Crypto & Blockchain
  22. support_agent Customer Service & Helpdesk
  23. medical_services Dental
  24. medical_services Driving & Transport
  25. medical_services E Commerce & Social Media
  26. school Education & Teaching
  27. electrical_services Electrical Engineering
  28. bolt Energy
  29. local_mall Fmcg
  30. gavel Government & Non Profit
  31. emoji_events Graduate
  32. health_and_safety Healthcare
  33. beach_access Hospitality & Tourism
  34. groups Human Resources
  35. precision_manufacturing Industrial Engineering
  36. security Information Security
  37. handyman Installation & Maintenance
  38. policy Insurance
  39. code IT & Software
  40. gavel Legal
  41. sports_soccer Leisure & Sports
  42. inventory_2 Logistics & Warehousing
  43. supervisor_account Management
  44. supervisor_account Management Consultancy
  45. supervisor_account Manufacturing & Production
  46. campaign Marketing
  47. build Mechanical Engineering
  48. perm_media Media & PR
  49. local_hospital Medical
  50. local_hospital Military & Public Safety
  51. local_hospital Mining
  52. medical_services Nursing
  53. local_gas_station Oil & Gas
  54. biotech Pharmaceutical
  55. checklist_rtl Project Management
  56. shopping_bag Purchasing
  57. home_work Real Estate
  58. person_search Recruitment Consultancy
  59. store Retail
  60. point_of_sale Sales
  61. science Scientific Research & Development
  62. wifi Telecoms
  63. psychology Therapy
  64. pets Veterinary
View All Site Reliability Engineer Jobs