1 244 Devops Engineers jobs in the United Arab Emirates
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Overview
We’re Hiring: Site Reliability Engineer
Join us as a Site Reliability Engineer and help build scalable, secure, and high-performance infrastructure for cutting-edge fintech platforms in wealth management, digital wallets, trading, and blockchain.
Responsibilities- Contribute to designing, deploying, and maintaining reliable cloud infrastructure (AWS/Azure).
- Manage databases, integrations, and DevOps automation to streamline operations.
- Support cybersecurity and compliance frameworks to ensure secure, compliant services.
- Collaborate with cross-functional teams to deliver resilient services for fintech platforms.
- Proven experience in cloud infrastructure (AWS/Azure).
- Strong in DB management, integrations & DevOps automation.
- Familiar with cybersecurity & compliance frameworks.
- Bonus: Knowledge of fintech trends & emerging tech.
We craft, deploy, and manage bespoke services in CRM, data and AI, cybersecurity and consulting.
#J-18808-LjbffrSite Reliability Engineer
Posted today
Job Viewed
Job Description
About TechHive AI
TechHive AI is a global tech solutions provider delivering expertise in
cloud computing, DevOps, AI/ML, software engineering
, and scalable enterprise infrastructure. With clients across the US, UK, EU, and GCC, we empower digital transformation for startups and enterprises alike.
We are now hiring a
UAE-based Site Reliability Engineer (SRE)
to join our DevOps & Cloud Engineering team. This role focuses on
reliability, performance, automation, and incident response
across high-availability production systems.
About the Role
As a
Site Reliability Engineer (SRE)
, you'll blend
software engineering, system administration, and DevOps principles
to ensure system uptime, scalability, and performance. You will help build fault-tolerant infrastructure, automate deployment pipelines, and respond to incidents swiftly and effectively.
This is a
remote role
, but
you must be located in the UAE
due to regional time zone coordination and client compliance.
Key Responsibilities
- Maintain and improve system reliability, availability, and performance across cloud-based infrastructure (AWS, GCP, or Azure)
- Automate and manage deployment pipelines using
CI/CD tools
like GitLab CI, GitHub Actions, or Jenkins - Develop and maintain
observability and monitoring systems
(e.g., Prometheus, Grafana, ELK, CloudWatch) - Build tools/scripts to automate operational tasks and incident recovery
- Participate in on-call rotations and ensure SLAs are met
- Lead
root cause analysis (RCA)
after critical incidents and implement postmortem action items - Ensure security best practices across infrastructure and deployments
- Implement
infrastructure as code (IaC)
using Terraform or CloudFormation - Collaborate closely with DevOps, Software Engineers, QA, and Product teams
Required Qualifications
- 3+ years
of experience in DevOps, SRE, or Cloud Infrastructure Engineering - Strong knowledge of
AWS
,
Azure
, or
Google Cloud Platform - Experience with
Docker
,
Kubernetes
, and container orchestration - Proficiency in
Linux system administration
, shell scripting, and automation - Strong experience with
monitoring, alerting, and logging tools
(Grafana, Prometheus, ELK, Datadog, etc.) - Familiarity with
incident response
,
on-call practices
, and
SLAs/SLOs - Proficient in at least one scripting language (Python, Bash, or Go)
- Familiar with
Git
,
CI/CD pipelines
, and
version control workflows - Excellent problem-solving and communication skills
- Must be residing in the UAE
Preferred Qualifications
- Experience with
multi-region, highly available architectures - Knowledge of
SRE best practices
and Google's SRE principles - Certifications: AWS DevOps Engineer, Google SRE, Kubernetes Admin (CKA), etc.
- Experience working with AI/ML pipelines or data-intensive systems
Why Join TechHive AI?
Work remotely while collaborating with international teams
Flexible hours with a performance-driven culture
Competitive compensation package
Opportunity to work on high-impact, cloud-native projects
Continuous learning and certification support
How to Apply
Send your CV to
Subject Line:
"SRE – Remote (UAE Based)"
Or apply via LinkedIn.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Location: Abu Dhabi
Duration: Yearly Renewable Contract
Role Summary:
We are looking for a Site Reliability Engineer (SRE) to maintain the availability, scalability, and performance of critical services deployed across cloud and on-premise environments. This role combines software engineering and systems engineering to automate operations and improve reliability in CI/CD and production environments.
Key Responsibilities:
- Maintain uptime and performance of applications deployed across hybrid infrastructure
- Implement observability (logging, metrics, tracing) using Prometheus, Grafana, ELK, Azure Monitor
- Troubleshoot production issues, participate in incident response, and root cause analysis
- Automate infrastructure, monitoring, and runbooks using IaC tools and scripting
- Implement and track SLOs, SLIs, and error budgets
- Build self-healing systems and resilient deployments
- Collaborate with developers, security teams, and cloud engineers to enforce reliability practices
Required Skills:
· Experience with Azure/AWS/GCP monitoring tools and on-prem observability stacks
· Strong in Linux/Unix administration, scripting (Python, Bash)
· Hands-on with CI/CD pipelines, Kubernetes, and Helm
· Good understanding of load balancing, failover, HA architecture
· Familiar with incident management, postmortem writing, and runbook creation
Preferred Qualifications:
- Experience with Terraform, Ansible, or Pulumi
- Knowledge of service mesh (Istio, Linkerd) and API gateway configurations
- Certifications: SRE Foundation, Azure/AWS Cloud Practitioner, or Kubernetes Administrator (CKA)
- Awareness of compliance standards (CIS, NIST, ISO 27001)
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Company Description
Since 2000, Avanza Solutions has been empowering organizations around the world to embrace digital transformation through innovative digital platforms and services. Renowned for its excellence and continual evolution, Avanza specializes in developing, deploying, and integrating advanced technologies like Digital Banking, AI, Blockchain, and Smart City applications. Avanza has delivered a wide array of digital solutions across various sectors, including Banking, Finance, Telecommunications, Insurance, Pharmaceuticals, and Government. With a vision of people-centric innovation and automation, Avanza has transformed the operational and functional landscapes of numerous enterprises globally.
Role Description
This is a full-time on-site role for a Site Reliability Engineer focused on Bank Applications, located in Abu Dhabi. The Site Reliability Engineer will be responsible for ensuring the reliability and performance of banking applications by managing software development, system administration, troubleshooting, and infrastructure tasks. Day-to-day tasks include monitoring system performance, automating operations, enhancing system stability, and responding to incidents and issues promptly to maintain seamless operations.
Qualifications
- Expertise in Site Reliability Engineering and Troubleshooting
- Proficiency in Software Development and System Administration
- Experience with Infrastructure management
- Strong problem-solving and analytical skills
- Excellent written and verbal communication skills
- Bachelor's degree in Computer Science, IT, or related field
- Relevant certifications are a plus
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Company Description
Since 2000, Avanza Solutions has been empowering organizations globally to embrace digital transformation through innovative digital platforms and services. Renowned for its excellence, Avanza specializes in digital banking, customer relationship & experience management, artificial intelligence, blockchain, smart city applications, business automation, and cognitive platforms. The company has successfully delivered digital solutions to sectors including Banking, Finance, Telecommunications, Insurance, Pharmaceuticals, and Government. Avanza Solutions is committed to driving innovation and automation in numerous enterprises worldwide.
Role Description
This is a full-time on-site role, located in Abu Dhabi, for a Site Reliability Engineer - Only Bank Application. The Site Reliability Engineer will be responsible for ensuring the reliability, efficiency, and performance of banking applications. Day-to-day tasks include troubleshooting issues, developing software solutions, and managing system administration tasks. The engineer will also maintain infrastructure and ensure the stability of banking applications.
Qualifications
- Proficiency in Site Reliability Engineering and Troubleshooting skills
- Experience in Software Development
- Skills in System Administration
- Knowledge of Infrastructure management
- Excellent problem-solving and analytical skills
- Strong communication and teamwork abilities
- Experience in the banking or financial sector is a plus
- Bachelor's degree in Computer Science, Information Technology, or related field
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Location: Abu Dhabi or Remote/Hybrid
Experience: 6–10 Years
Role Summary:
We are looking for a Site Reliability Engineer (SRE)
to maintain the
availability, scalability, and performance
of critical services deployed across
cloud and on-premise environments
. This role combines software engineering and systems engineering to automate operations and improve reliability in CI/CD and production environments.
Key Responsibilities:
Maintain uptime and performance of applications deployed across hybrid infrastructure
Implement observability (logging, metrics, tracing) using Prometheus, Grafana, ELK, Azure Monitor
Troubleshoot production issues, participate in incident response, and root cause analysis
Automate infrastructure, monitoring, and runbooks using IaC tools and scripting
Implement and track SLOs, SLIs, and error budgets
Build self-healing systems and resilient deployments
Collaborate with developers, security teams, and cloud engineers to enforce reliability practices
Required Skills:
Experience with Azure/AWS/GCP
monitoring tools and
on-prem observability stacks
Strong in Linux/Unix administration
, scripting (Python, Bash)
Hands-on with
CI/CD pipelines
,
Kubernetes
, and
Helm
Good understanding of load balancing
,
failover
,
HA architecture
Familiar with incident management
, postmortem writing, and runbook creation
Preferred Qualifications:
Experience with Terraform
,
Ansible
, or
Pulumi
Knowledge of service mesh
(Istio, Linkerd) and
API gateway
configurations
Certifications: SRE Foundation, Azure/AWS Cloud Practitioner, or Kubernetes Administrator (CKA)
- Awareness of
compliance standards
(CIS, NIST, ISO 27001)
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Senior Trading Operations Engineer | Dubai | Crypto / HFT
(SRE & DevOps)
Our client is a high-performance trading firm operating at the forefront of low-latency crypto markets. They're lean, profitable, and punch far above their weight — executing over
1 million trades per day
and touching
6% of daily volume on major exchanges
, all with a tight-knit global team.
They're now hiring a
Senior Trading Operations Engineer
in
Dubai
to ensure seamless real-time performance across their trading infrastructure.
The Role
You'll sit at the core of live operations, working hand-in-hand with traders and developers to optimise performance, automate systems, and rapidly troubleshoot issues. This is a hands-on engineering role with future
leadership potential
.
What You'll Do
- Monitor and manage real-time high-frequency trading systems
- Deep-dive into latency and performance metrics (Grafana, Prometheus, etc.)
- Improve system reliability and uptime through automation & tooling
- Own CI/CD pipelines, config management (Ansible), and on-call rotation
- Collaborate on deployments, A/B testing, and incident response
What They're Looking For
- 8+ years in software or trading operations
- Strong Python scripting and Linux systems knowledge
- Experience in low-latency, distributed, or HFT-style environments
- Familiar with AWS, Grafana, Prometheus, ClickHouse, ELK
- Bonus: Kubernetes, containerization, incident frameworks
Why Join?
- High-impact role in a firm with
zero external funding
and real PnL - Access to cutting-edge infrastructure + strategies
- Dubai-based with full relocation support (visa, flight, ID, etc.)
- Regular global "workations" and a multicultural, remote-friendly team
Ready to own production in one of the world's fastest trading stacks?
Drop us a message or apply directly — all conversations held in confidence.
Be The First To Know
About the latest Devops engineers Jobs in United Arab Emirates !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Lead Site Reliability Engineer – Dubai, Full Time
Our client, who are a technology organization that are leaders within their field, are seeking an experienced
Lead Site Reliability Engineer
to join their team.
As Lead Site Reliability Engineer, you will take ownership of their production environment, which processes hundreds of billions of data points every year. This is a senior opportunity for a Kubernetes expert with strong experience in distributed data systems, automation, and infrastructure security.
Key Responsibilities:
- Design and manage scalable Kubernetes-based infrastructure.
- Automate provisioning, configuration management, and deployments.
- Optimise and administer distributed databases including Cassandra, ScyllaDB, PostgreSQL, MongoDB, and ElasticSearch.
- Build and maintain advanced monitoring, logging, and alerting systems.
- Manage Linux-based systems with a focus on security, patching, and hardening.
- Develop and test disaster recovery and business continuity strategies.
- Lead performance tuning, capacity planning, and cost optimisation.
- Act as the highest escalation point for complex infrastructure incidents.
- Collaborate with development teams to refine CI/CD pipelines.
- Produce clear documentation of systems, processes, and architecture.
- Provide mentorship and technical guidance to the wider technology team.
Qualifications & Experience:
- Degree in Computer Science, Engineering, or related field.
- 8+ years in SRE, DevOps, or Systems Engineering across large-scale production environments.
- Expert in Kubernetes cluster design, scaling, and security.
- Proven experience with both NoSQL and relational databases.
- Strong Linux administration skills (Ubuntu/SUSE) with system hardening expertise.
- Proficiency in scripting (Bash, Python, Go) and Infrastructure as Code tools (Terraform, Ansible, Pulumi).
- Knowledge of load balancing, networking, and storage solutions.
Nice to Have:
- Strong understanding of infrastructure and data security.
- Hands-on disaster recovery and resilience planning.
- Advanced database tuning and recovery skills.
- Familiarity with Go-based application support.
- Excellent troubleshooting and analytical skills.
- Proactive, accountable, and effective in high-pressure environments.
- Strong collaboration with cross-functional engineering teams.
Salary
: Competitive
Benefits
: Medical Insurance and Visa (family), flight ticket allowance (employee only)
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Requirements:
Experience in Site Reliability Engineering with a good hands-on experience on the list of tech stack as listed in the responsibilities section. Additionally, if you are certified on any of the technologies, we would love to see you prove it with your detail-oriented problem-solving skillset and knowledge of the products.
Job description:
• Possess a strong coding background with expertise in any language like Python, Go, or Java.
• Ability to troubleshoot code errors efficiently, identifying root causes for effective communication with development teams.
• Advocate for and implement best practices in code development, prioritizing maintainability, scalability, and reliability.
• Keep current with industry trends and technologies, introducing innovative solutions to enhance system reliability and performance.
• Leverage proven design patterns to connect databases, middleware's, and other components, ensuring robust and fault-tolerant system interactions.
• Actively contribute to the evolution of the organization's design patterns by researching and proposing enhancements based on industry trends.
• Gather and analyze metrics from operating systems as well as critical application services to assist in quick identification of issues and faults.
• Partner with development to improve the reliability of application services and release procedures.
• Participate in system design consulting, platform management and capacity planning.
• Should have deep understanding on Observability enablement for the different application stacks.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Khazna was founded in 2012 and has grown rapidly into becoming the leading and trusted wholesale Data Center provider in the Middle East and North Africa region. Through our Data Centers, we provide industry benchmark levels of power supply and cooling services to better serve the growing need for data center operations in the UAE and wider region.
We are seeking a
Site Reliability Engineer
to support the reliability engineering program across multiple data centers in our fleet. Reporting to the Reliability Manager, you will be responsible for monitoring system performance, driving preventative and predictive maintenance initiatives, leading root cause analysis efforts, and collaborating with cross-functional teams to minimize downtime and enhance infrastructure resilience.
Key Accountabilities:
- Monitor real-time and historical performance metrics for critical power, cooling, and IT systems.
- Analyse system data to identify trends, failure modes, and reliability risks.
- Execute Root Cause Analyses (RCA) and Failure Mode & Effects Analyses (FMEA), then drive corrective and preventive actions.
- Develop and maintain condition-based and predictive maintenance routines, leveraging IoT, data analytics, and machine learning tools.
- Support preventive maintenance programs: schedule, document, and validate maintenance activities.
- Assist in asset lifecycle planning, including upgrades, decommissioning, and end-of-life strategies.
- Contribute to capacity runway assessments to forecast infrastructure needs.
- Implement and enforce availability management plans, risk assessments, and mitigation strategies.
- Ensure data collection and reporting processes for reliability KPIs (e.g., MTBF, MTTR, availability) are standardized and accurate.
- Prepare reliability reports and dashboards; present findings and recommendations to site leadership.
- Respond to and lead failure-response efforts during site incidents, ensuring rapid recovery and root-cause follow-through.
- Maintain compliance with industry standards and regulations (Uptime Institute, ISO, ASHRAE).
- Collaborate with Operations, Engineering, Facilities, and Vendors to integrate reliability best practices into day-to-day workflows.
- Propose continuous-improvement initiatives and pilot emerging reliability technologies.
- The job holder may be required to undertake additional duties, which may be reasonably expected and forms part of the function of the job.
Minimum Qualifications:
- Bachelor's degree in mechanical, Electrical, Reliability, or related Engineering discipline.
Minimum Experience:
- 3+ years of experience in reliability engineering, maintenance engineering, or a data center operations environment.
- Hands-on experience with RCA, FMEA, and predictive maintenance methodologies.
- Proficiency with monitoring platforms, data-analytics tools, and scripting (e.g., Python, R).
- Familiarity with IoT sensors, machine-learning frameworks, and condition-based monitoring systems.
- Knowledge of industry reliability standards and regulations (ISO, ASHRAE, Uptime Institute).
Job-Specific Skills (Generic / Technical):
- Strong analytical and problem-solving skills, with acute attention to detail.
- Effective communicator, able to present technical findings to diverse audiences.
- Project coordination skills and the ability to manage multiple reliability initiatives.
- Collaborative mindset, comfortable working in cross-functional teams.
- Self-starter with a continuous-improvement attitude and commitment to resilience.