166 Reliability Engineer jobs in the United Arab Emirates
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Press Tab to Move to Skip to Content Link
The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs/SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.
Accountabilities- Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
- Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
- Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
- Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
- Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
- Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
- Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
- Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
- Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
- Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
- 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
- Experience working in computing, distributed systems, storage, or networking.
- Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
- Ability to debug, optimize code, and to automate routine tasks.
- Systematic problem-solving approach, coupled with effective verbal and written communication skills.
- Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
- Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
- Strong analytical and problem-solving skills are necessary , TSM processes & tools.
Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world’s leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad’s codeshare partners, Etihad’s network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad’s ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly!
Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video/telephone before any formal offer. If you are asked for money, please treat it as fraudulent.
#J-18808-LjbffrSite Reliability Engineer
Posted today
Job Viewed
Job Description
1 day ago Be among the first 25 applicants
Direct message the job poster from Gibraltar Technologies LLC
Certified Naukri gulf recruiter | Talent Acquisition | Recruitment Professional | Evaluation | Driving Institutional Effectiveness | Head Hunter |…
Overview
The Site Reliability Engineer (SRE) is responsible for maintaining the performance, availability, and security of critical systems supporting enterprise customers. This role bridges system administration with network security, ensuring a seamless and resilient user experience.
Key Responsibilities System Administration (30%–70%)
- Windows & Active Directory (AD) : Maintain and secure Windows servers and Active Directory infrastructure.
- Database Management : Administer and optimize SQL databases; manage backups, recovery, and disaster recovery processes using tools like Rubrik.
- Linux Systems : Ensure uptime and performance of Linux-based systems through proactive management and troubleshooting.
Network & Security (30%–70%)
- Network Security : Configure, monitor, and maintain Fortinet firewalls to safeguard infrastructure against threats.
- Load Balancing : Manage F5 load balancers to ensure application reliability, scalability, and optimized traffic distribution.
Additional Responsibilities
- Incident Management : Quickly respond to and resolve outages and system incidents, reducing downtime and impact.
- Cloud Technologies : Support cloud infrastructure, especially in Azure or Oracle Cloud environments.
- Virtualization : Administer VMware environments to support efficient and scalable virtual infrastructure.
Qualifications & Skills Education & Experience
- Degree : Bachelor's degree in Computer Science, Information Technology, or related field.
- Experience : 3–5 years in systems administration, network security, or a similar technical domain.
Certifications (Preferred)
- Fortinet NSE Certifications
- VMware VCP or equivalent
- Strong experience in Windows and Linux administration
- Proficient with SQL databases and Rubrik backup / recovery tools
- Hands-on experience with Fortinet firewalls and F5 load balancers
- Working knowledge of cloud platforms (Azure, Oracle Cloud)
- Experience with VMware administration
Seniority level
Seniority level
Mid-Senior level
Employment type
Employment type
Full-time
Job function
Job function
Engineering and Information Technology
IT Services and IT Consulting
Referrals increase your chances of interviewing at Gibraltar Technologies LLC by 2x
Sign in to set job alerts for "Site Reliability Engineer" roles.
Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates 1 month ago
We're unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
J-18808-Ljbffr
Site Engineer
• Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
Senior Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Role:
Orbitworks is revolutionizing access to space by building reliable, shareable satellites that drastically reduce the time and complexity traditionally required to get to orbit. We operate satellites, fly customer payloads, and handle entire missions from end-to-end. Orbitworks is a joint venture between Marlan Space (UAE-based) and Loft Orbital.
As a Senior Site Reliability Engineer on our Infrastructure Team, you’ll play a pivotal role in maintaining and scaling our ground segment infrastructure. You’ll collaborate across development, operations, and IT to ensure the integration, delivery, and reliability of services that support our test infrastructure and our space operations on Earth and in orbit.
This is an exciting opportunity to work on cutting-edge technology and help build modern automated space infrastructure. This is not your typical SRE role, we apply DevOps principles even to spacecraft control.
Responsibilities- Collaborate with developers, test engineers and satellite operators to foster a strong SatDevOps culture .
- Design and roll-out cloud solutions for our testing and operations infrastructure . Find the best trade-offs between existing and additional cloud resources to scale and help Orbitworks achieve its mission.
- Design, implement, and maintain scalable, reliable, and secure infrastructure in a hybrid cloud environment.
- Improve our developer and test engineers experience by building better tools, workflows, and environment to streamline
- Lead efforts to automate and optimize systems, including CI/CD pipelines , infrastructure provisioning (IaC) , and deployment workflows for test on the ground and operations in space.
- Own and evolve our observability stack (metrics, tracing, logs) to improve usability and performance. Grafana-centric ecosystems are a plus.
- Implement and advocate for best practices in software reliability, fault tolerance, and performance tuning.
- Proactively identify, investigate, and resolve system reliability issues , performing root cause analyses and implementing long-term fixes.
- Partner with teams to design and operate Software Defined Network (SDN) solutions.
- Contribute to a collaborative and inclusive team culture where respectful debate and continuous learning are celebrated.
- Initially, handle and manage the link between cloud and network/software/hardware infrastructure. Assume I&T (Information technology) responsibilities as much as necessary to start with.
- Strong experience with public cloud infrastructure , ideally GCP.
- Deep expertise in Kubernetes , architecture, deployment, ops, and resource optimization.
- Demonstrated ability to design and build scalable, highly available systems .
- Familiarity with Software Defined Networking (SDN) concepts and tools.
- Experience implementing and maintaining observability stacks (Grafana, Prometheus, Loki, etc.).
- Proficiency in at least one backend language: Go, Python, Rust, C/C++, or Java .
- Deep understanding and hands-on experience with DevOps practices : CI/CD, infrastructure as code (IaC), and automation.
- Proven track record of working in fast-paced, high-growth technical environments .
- Strong networking knowledge (TCP/IP, DNS, routing, switching, firewalls, VPNs, secure networks).
- Deep experience in Systems Administration.
- Excellent problem-solving skills and ability to operate independently with a proactive, results-driven mindset .
- Strong communication skills; thrives in a multicultural, cross-functional team.
- Hands-on experience with GitOps frameworks (ArgoCD, FluxCD).
- Interest or experience in FinOps and cost-optimized architectures .
- Understanding of orchestration in resource-constrained environments , like space systems.
- Knowledge of infrastructure as code frameworks (Terraform, Ansible or similar)
- Knowledge of systems engineering tools and SDLC governance.
- Cybersecurity Awareness.
- Familiarity with security practices , vulnerability scanning, threat detection, risk mitigation.
Orbitworks' mission is to make space simple for organizations that want to deploy physical and virtual missions to space. Building on Loft Orbital's heritage, Orbitworks will be the first commercial firm in the United Arab Emirates to mass-manufacture satellites. Orbitworks aims to manufacture tens of satellites annually and operates out of a 50,000-square-foot facility in Abu Dhabi.
#J-18808-LjbffrData Reliability Engineer - Intern
Posted today
Job Viewed
Job Description
Bayut & dubizzle have the unique distinction of being iconic, homegrown brands with a strong presence across the seven emirates in the UAE. Connecting millions of users across the country, we are committed to delivering the best online search experience.
As part of Dubizzle Group, we are alongside some of the strongest classified brands in the market. With a collective strength of 5 brands, we have more than 123 million monthly users that trust in our dedication to providing them with the best platform for their needs.
As the Data Reliability Engineer - Intern, you will participate in projects involving the management of our hybrid cloud based on AWS and GCP. You will have exposure to the latest technologies used in the world of data, analytics and artificial intelligence. You will have the opportunity to learn how to administer and monitor both batch and real-time data processing pipelines. You will work in a modern cloud-based data environment alongside a team of diverse, intense and interesting co-workers. You will liaise with other teams – such as product & tech, the core business verticals, trust & safety, finance and others – to enable them to be successful.
In this role, you will:
- Support the administration of AWS and GCP cloud for data, analytics and AI.
- Ensure systems are available and working properly.
- Report any system that doesn't comply with companywide security policies, administer security groups.
- Setup security and stability alerts using tools such as guardduty, cloudwatch, slack integrations.
- Continuously control costs and optimise them to keep them within budgets.
- Ensure all systems have a recent backup that can be restored in case of disaster.
- Administration of GitHub, password manager and other security tokens, keys and secrets.
- Grant and revoke access on all BI systems to and from all users.
- Kubernetes cluster management.
- Administer Redshift Database, Data Lake, RDS, Google Big query.
- Administer Data visualisation (Sisense, Tableau).
- Administer ETL platform (Matillion, Airflow).
- Automate repetitive tasks using Python and shell scripting.
- Install, Upgrade, Patch production EC2.
- Working knowledge of DNS, proxies and general routing.
Minimum Requirements:
- Fresh Graduates or Final Year students from top-of-class technical degrees such as computer science, engineering, maths, and physics.
- Strong knowledge of Linux and shell scripting.
- Knowledge of SQL and network security protocols.
- Familiar with data pipeline and job scheduling.
- Ability to work under pressure, driven and self-motivated.
- Entrepreneurial spirit and ability to think creatively with strong curiosity and strive for continuous learning.
- Thrive in a fast-paced, innovative environment.
- Living the team values: Simpler. Better. Faster.
What We Offer:
- Ability to contribute to a platform used by more than 5M users in UAE and other platforms in the region.
- Strengthen your resume and build your network.
- Opportunity to find a full-time career with the region's leading organization.
- Working in a multicultural environment with over 50 different nationalities.
- Access to the Learning & Development tools and courses provided by the company.
Bayut & dubizzle is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.
#J-18808-LjbffrSenior Site Reliability Engineer
Posted today
Job Viewed
Job Description
VSOL is a digital enabler with a mission to help public and private organizations evolve their businesses through data and technology. We provide an end-to-end service from consulting to execution that drives the growth and innovation of our clients. As VSOL is in a phase of rapid expansion, we offer a dynamic, creative environment that accelerates your personal and professional development. We are looking for talented individuals eager to develop in international markets while contributing to the company’s future in a constructive and supportive manner.
Responsibilities:
- Lead deployment and management of web applications, ensuring stability, scalability and reliability.
- Design and manage hybrid environment reliability solutions (cloud and on-premises), optimizing for availability and performance.
- Knowledge of orchestrate and administer containerized applications using Kubernetes, focusing on efficient deployment and runtime management.
- Administer, including Geographic Information System (GIS) and databases (SQL Server), maintaining data integrity and high performance.
- Analyze and mitigate service disruptions, developing strategic preventative measures to minimize downtime.
- Understanding of network engineering principles.
- Participate in evaluation and integration of new technologies, enhancing service reliability and operational capabilities.
- Develop and automate critical system health metrics, using tools like ELK stack.
- Manage major incident response efforts, ensuring effective resolution to maintain system stability.
- Coordinate with cross-functional teams to align SRE practices with business objectives and IT standards.
- Create and review technical documentation for system architecture and operational procedures.
- Assure regulatory compliance and security assessments, implementing best practices to protect system integrity.
- Participate in pager-duty rotations, resolving critical incidents.
Requirements
- Over 4 years of experience with cloud environments and containerization technologies, including designing and implementing scalable, resilient infrastructure solutions using platforms GCP (and other cloud platforms) , and Kubernetes.
- Experience with monitoring and logging tools such as ELK Stack.
- Demonstrated excellence in network management, advanced troubleshooting, and system optimization, with a focus on enhancing efficiency and reducing downtime.
- Awareness of experience in IT, with advanced expertise in network engineering and system administration.
- Awareness of experience in site reliability practices, any experience with GIS platforms is a plus.
- Strong skills in scripting and automation, particularly with Python and Bash is a big plus.
- Good knowledge of GitOps tools (e.g., Argo CD, FluxCD).
- Knowledge of security frameworks and compliance standards.
Qualifications:
- Bachelor’s degree in Computer Science, Information Technology, or a related field.
- Cisco Certified Network Associate (CCNA) is a plus.
- Certified Kubernetes Administrator (CKA) is a plus.
- Written and spoken English communication skills at CEFR B1 level or above.
Why you’ll love working here:
- Working in start-up environment, English-speaking, with opportunity to be part of innovation team and global projects
- Onsite opportunities in UAE (United Arab Emirates) and KSA (Kingdom of Saudi Arabia)
- 13th-month salary bonus
- Premium Health insurance for employees and family members (depending on level), Annual Health Check, Government Insurance in probation
- 14+ days of Annual leave and 5 days of Outing leave
- Lunch allowance and free parking
- Taxi & phone allowance (depending on level)
Job Application
Full name *
Email *
Phone number *
Attach Resume *
Maximum file size: 3MB
Accepted file types: DOC, DOCX, PDFProfile URL
If you are human, leave this field blank.
#J-18808-LjbffrSenior Site Reliability Engineer
Posted today
Job Viewed
Job Description
- Collaborate with developers, test engineers and satellite operators to foster a strong SatDevOps culture.
- Design and roll-out cloud solutions for our testing and operations infrastructure. Find the best trade-offs between existing and additional cloud resources to scale and help Orbitworks achieve its mission.
- Design, implement, and maintain scalable, reliable, and secure infrastructure in a hybrid cloud environment.
- Improve our developer and test engineers experience by building better tools, workflows, and environment to streamline
- Lead efforts to automate and optimize systems, including CI/CD pipelines, infrastructure provisioning (IaC), and deployment workflows for test on the ground and operations in space.
- Own and evolve our observability stack (metrics, tracing, logs) to improve usability and performance. Grafana-centric ecosystems are a plus.
- Implement and advocate for best practices in software reliability, fault tolerance, and performance tuning.
- Proactively identify, investigate, and resolve system reliability issues, performing root cause analyses and implementing long-term fixes.
- Partner with teams to design and operate Software Defined Network (SDN) solutions.
- Contribute to a collaborative and inclusive team culture where respectful debate and continuous learning are celebrated.
- Initially, handle and manage the link between cloud and network/software/hardware infrastructure. Assume I&T (Information technology) responsibilities as much as necessary to start with.
- Strong experience with public cloud infrastructure, ideally GCP.
- Deep expertise in Kubernetes, architecture, deployment, ops, and resource optimization.
- Demonstrated ability to design and build scalable, highly available systems.
- Familiarity with Software Defined Networking (SDN) concepts and tools.
- Experience implementing and maintaining observability stacks (Grafana, Prometheus, Loki, etc.).
- Proficiency in at least one backend language: Go, Python, Rust, C/C++, or Java.
- Deep understanding and hands-on experience with DevOps practices: CI/CD, infrastructure as code (IaC), and automation.
- Proven track record of working in fast-paced, high-growth technical environments.
- Strong networking knowledge (TCP/IP, DNS, routing, switching, firewalls, VPNs, secure networks).
- Deep experience in Systems Administration.
- Excellent problem-solving skills and ability to operate independently with a proactive, results-driven mindset.
- Strong communication skills; thrives in a multicultural, cross-functional team.
- Hands-on experience with GitOps frameworks (ArgoCD, FluxCD).
- Interest or experience in FinOps and cost-optimized architectures.
- Understanding of orchestration in resource-constrained environments, like space systems.
- Knowledge of infrastructure as code frameworks (Terraform, Ansible or similar)
- Knowledge of systems engineering tools and SDLC governance.
- Cybersecurity Awareness.
- Familiarity with security practices, vulnerability scanning, threat detection, risk mitigation.
#J-18808-Ljbffr
Reliability Engineer - Coiled Tubing Drilling
Posted today
Job Viewed
Job Description
Job Purpose
To ensure the reliability, performance, and continuous improvement of Coiled Tubing Drilling (CTD) tools and systems. The Reliability Engineer plays a key role in reducing non-productive time (NPT), improving tool life, and enhancing service delivery through data-driven analysis and root cause investigations.
Key Responsibilities:
Tool Reliability & Performance Monitoring
- Track and analyze CTD tool performance across jobs and regions.
- Identify failure trends and initiate corrective actions to improve tool reliability.
- Maintain a database of tool runs, failure modes, and repair history.
Root Cause Analysis (RCA)
- Lead investigations into tool failures, service quality incidents, and NPT events.
- Use structured RCA methodologies (e.g., 5 Whys, Fishbone, FMEA) to identify root causes.
- Collaborate with engineering, manufacturing, and field teams to implement corrective and preventive actions (CAPA).
Data Analytics & Reporting
- Develop dashboards and reports to monitor KPIs such as MTBF (Mean Time Between Failures), tool utilization, and service quality.
- Provide insights to operations and engineering teams to support decision-making.
- Support digital initiatives for predictive maintenance and reliability modeling.
Tool Qualification & Field Trials
- Support the qualification of new CTD tools and technologies.
- Plan and monitor field trials, ensuring proper data capture and post-run analysis.
- Provide feedback to R&D and product engineering teams.
Documentation & Compliance
- Maintain accurate records of tool configurations, modifications, and performance logs.
- Ensure compliance with internal quality standards and client-specific requirements.
- Participate in audits and service quality reviews.
Qualifications & Experience:
- Bachelor’s degree in Mechanical, Petroleum, or Reliability Engineering.
- 4–6 years of experience in Coiled Tubing or Well Intervention operations, with a focus on tool reliability or maintenance.
- Strong understanding of CTD tools, downhole dynamics, and failure mechanisms.
- Familiarity with reliability tools and software (e.g., Weibull analysis, SAP PM, Power BI).
Key Competencies:
- Analytical mindset with strong problem-solving skills.
- Excellent communication and cross-functional collaboration.
- Proficiency in data analysis and visualization tools.
- Commitment to safety, quality, and continuous improvement.
*Please remember that joining the Talent Community is not an application for any specific job at Baker Hughes but to have the privilege of being considered for an opportunity that suits your profile on priority.
#J-18808-LjbffrBe The First To Know
About the latest Reliability engineer Jobs in United Arab Emirates !
Site Reliability Engineer (SRE) AWS
Posted 4 days ago
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer (SRE) role at Pragmatike.
Get AI-powered advice on this job and more exclusive features.
Job DescriptionLocation : Full remote, EU timezone (CET + / - 2 hours)
Start Date : As soon as possible
We are looking for a skilled Site Reliability Engineer (SRE) with deep expertise in AWS to help us scale and secure our infrastructure. As an SRE, you will be instrumental in ensuring the reliability, performance, and scalability of our production systems. You will work closely with engineering teams to automate operations, improve monitoring, and design resilient systems.
Responsibilities :- Design, implement, and maintain scalable, resilient AWS infrastructure
- Develop and manage CI / CD pipelines and infrastructure-as-code (Terraform or similar)
- Set up and optimize monitoring, alerting, and incident response processes
- Proactively identify and resolve performance, reliability, and security issues
- Collaborate with development teams to integrate SRE best practices into their workflows
- Conduct post-mortems and root cause analyses on incidents
- Participate in on-call rotations to support 24 / 7 system reliability
- 5+ years of experience as an SRE or similar role
- Deep knowledge of AWS services (EC2, ECS, RDS, Lambda, S3, etc.)
- Proficient in infrastructure-as-code tools (Terraform, CloudFormation, etc.)
- Solid experience with Linux systems administration and networking concepts
- Experience with CI / CD tools (GitLab CI, Jenkins, etc.)
- Familiarity with observability tools (Prometheus, Grafana, Datadog, etc.)
- Experience with container orchestration (ECS, EKS, or Kubernetes)
- Understanding of security best practices in cloud environments
- Exposure to incident management frameworks (SRE handbook, etc.)
- 100% remote work with flexible hours
- High-impact role with autonomy and ownership
- Collaborative and international engineering team
- Cutting-edge tech stack with a strong focus on reliability and automation
Seniority level : Not Applicable
Employment type : Full-time
Job function : Engineering and Information Technology / IT Services and IT Consulting
Referrals increase your chances of interviewing at Pragmatike by 2x.
Sign in to set job alerts for “Site Reliability Engineer” roles.
Location: Abu Dhabi, United Arab Emirates
Note: This job posting appears to be outdated or not directly related to the role at Pragmatike, but the core job description has been refined for clarity and formatting.
#J-18808-LjbffrSite Reliability Engineer II - Real-Time and Big Data
Posted today
Job Viewed
Job Description
Join us to work collaboratively with our talented team of dynamic and passionate engineers to deliver capabilities that enable our customers to make a difference. You'll deploy and operate ArcGIS Velocity and ArcGIS Workflow Manager SaaS solutions. You will also have the opportunity to design, deploy, and operate next-generation real-time and big data GIS software-as-a-service (SaaS) capabilities for thousands of cloud users worldwide.
Our teams have a broad mix of experience levels and tenures that support an environment that promotes professional development. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.
Our team also puts a high value on work-life balance, and we understand that striking a healthy balance between your personal and professional life is crucial to your happiness and success here. We offer a flexible hybrid schedule so you can have a more productive and well-balanced life both in and outside of work.
Responsibilities- Collaborate with a team of SRE engineers to operate SaaS capabilities across multiple regions on the cloud platform
- Design, implement, configure, and utilize monitoring systems to monitor the health of SaaS products
- Manage infrastructure used for ArcGIS Velocity and ArcGIS Workflow Manager, respond to alerts, and troubleshoot problems to resolution
- Develop, implement, and maintain automation solutions for repetitive operational tasks, such as deployment pipelines, incident resolution, and scaling processes
- Design and implement the deployment and upgrade containerized micro-service components that, when combined, power Esri’s SaaS offerings
- Create and automate Git workflows to simplify code integration, testing, and infrastructure deployments
- Participate in technical spike efforts, bringing new innovative ideas to future versions of our software
- Troubleshoot the system incidents and provide root cause analysis reports
- Provide rotational on-call technical support
- 5+ years of experience managing Kubernetes (EKS), logging and monitoring (ELK, Prometheus), and container technologies (Docker)
- Proficient in using Terraform for automating infrastructure provisioning and management
- Ability to design and automate Git workflows for streamlined code integration, testing, and infrastructure deployment
- Ability to write scripts to deploy infrastructure and/or applications (Bash, Python, Terraform)
- Expert level understanding and experience with cloud computing platforms (AWS or Microsoft Azure)
- Strong knowledge of Linux Operating system administration, including troubleshooting, performance tuning, and shell scripting
- Proficient in cloud networking, including VPCs, subnets, security groups, and VPNs in platforms like AWS or Azure
- Skilled in identifying and resolving system and application issues through effective troubleshooting and root cause analysis
- Working knowledge of a source control and issue management system
- Bachelor’s in computer science, computer engineering, GIS, or information systems
- Experience designing, administering, and/or maintaining cloud environments, such as AWS or Azure, supporting 24×7 high-availability production environments
- Interest in working with GitOps principles to automate the deployment of applications on Kubernetes clusters
- Certifications: AWS Certified Solution Architect Associate, CKA/CKAD or similar
- Experience managing OpenSearch (datastore or logstore), and Kafka for managing distributed data streams and ensuring high availability in large-scale systems
- Ability to work with continuous integration and delivery best practices
- Knowledge of operating resilient, highly available, scalable, and performance SaaS capabilities
- Knowledge of Esri ArcGIS or other web mapping technologies
- Working knowledge of GitHub
#LI-DR5
#LI-Hybrid
About EsriAt Esri, diversity is more than just a word on a map. When employees of different experiences, perspectives, backgrounds, and cultures come together, we are more innovative and ultimately a better place to work. We believe in having a diverse workforce that is unified under our mission of creating positive global change. We understand that diversity, equity, and inclusion is not a destination but an ongoing process. We are committed to the continuation of learning, growing, and changing our workplace so every employee can contribute to their life’s best work. Our commitment to these principles extends to the global communities we serve by creating positive change with GIS technology. For more information on Esri’s Racial Equity and Social Justice initiatives, please visit our website here .
If you don’t meet all of the preferred qualifications for this position, we encourage you to still apply!
Esri is an equal opportunity employer (EOE) and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, protected veteran status, or any other characteristic protected by law. If you need reasonable accommodation for any part of the employment process, please email and let us know the nature of your request and your contact information. Please note that only those inquiries concerning a request for reasonable accommodation will be responded to from this e-mail address.
Esri Privacy Esri takes our responsibility to protect your privacy seriously. We are committed to respecting your privacy by providing transparency in how we acquire and use your information, giving you control of your information and preferences, and holding ourselves to the highest national and international standards, including CCPA and GDPR compliance.
#J-18808-LjbffrSite Reliability Engineer II - Real-Time and Big Data
Posted 6 days ago
Job Viewed
Job Description
Join to apply for the Site Reliability Engineer II - Real-Time and Big Data role at Esri
Site Reliability Engineer II - Real-Time and Big DataJoin to apply for the Site Reliability Engineer II - Real-Time and Big Data role at Esri
Get AI-powered advice on this job and more exclusive features.
Overview
Join us to work collaboratively with our talented team of dynamic and passionate engineers to deliver capabilities that enable our customers to make a difference. You'll deploy and operate ArcGIS Velocity and ArcGIS Workflow Manager SaaS solutions. You will also have the opportunity to design, deploy, and operate next-generation real-time and big data GIS software-as-a-service (SaaS) capabilities for thousands of cloud users worldwide.
Overview
Join us to work collaboratively with our talented team of dynamic and passionate engineers to deliver capabilities that enable our customers to make a difference. You'll deploy and operate ArcGIS Velocity and ArcGIS Workflow Manager SaaS solutions. You will also have the opportunity to design, deploy, and operate next-generation real-time and big data GIS software-as-a-service (SaaS) capabilities for thousands of cloud users worldwide.
Our teams have a broad mix of experience levels and tenures that support an environment that promotes professional development. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future.
Our team also puts a high value on work-life balance, and we understand that striking a healthy balance between your personal and professional life is crucial to your happiness and success here. We offer a flexible hybrid schedule so you can have a more productive and well-balanced life both in and outside of work.
Responsibilities
- Collaborate with a team of SRE engineers to operate SaaS capabilities across multiple regions on the cloud platform
- Design, implement, configure, and utilize monitoring systems to monitor the health of SaaS products
- Manage infrastructure used for ArcGIS Velocity and ArcGIS Workflow Manager, respond to alerts, and troubleshoot problems to resolution
- Develop, implement, and maintain automation solutions for repetitive operational tasks, such as deployment pipelines, incident resolution, and scaling processes
- Design and implement the deployment and upgrade containerized micro-service components that, when combined, power Esri’s SaaS offerings
- Create and automate Git workflows to simplify code integration, testing, and infrastructure deployments
- Participate in technical spike efforts, bringing new innovative ideas to future versions of our software
- Troubleshoot the system incidents and provide root cause analysis reports
- Provide rotational on-call technical support
- 5+ years of experience managing Kubernetes (EKS), logging and monitoring (ELK, Prometheus), and container technologies (Docker)
- Proficient in using Terraform for automating infrastructure provisioning and management
- Ability to design and automate Git workflows for streamlined code integration, testing, and infrastructure deployment
- Ability to write scripts to deploy infrastructure and/or applications (Bash, Python, Terraform)
- Expert level understanding and experience with cloud computing platforms (AWS or Microsoft Azure)
- Strong knowledge of Linux Operating system administration, including troubleshooting, performance tuning, and shell scripting
- Proficient in cloud networking, including VPCs, subnets, security groups, and VPNs in platforms like AWS or Azure
- Skilled in identifying and resolving system and application issues through effective troubleshooting and root cause analysis
- Working knowledge of a source control and issue management system
- Bachelor’s in computer science, computer engineering, GIS, or information systems
- Experience designing, administering, and/or maintaining cloud environments, such as AWS or Azure, supporting 24×7 high-availability production environments
- Interest in working with GitOps principles to automate the deployment of applications on Kubernetes clusters
- Certifications: AWS Certified Solution Architect Associate, CKA/CKAD or similar
- Experience managing OpenSearch (datastore or logstore), and Kafka for managing distributed data streams and ensuring high availability in large-scale systems
- Ability to work with continuous integration and delivery best practices
- Knowledge of operating resilient, highly available, scalable, and performance SaaS capabilities
- Knowledge of Esri ArcGIS or other web mapping technologies
- Working knowledge of GitHub
At Esri, diversity is more than just a word on a map. When employees of different experiences, perspectives, backgrounds, and cultures come together, we are more innovative and ultimately a better place to work. We believe in having a diverse workforce that is unified under our mission of creating positive global change. We understand that diversity, equity, and inclusion is not a destination but an ongoing process. We are committed to the continuation of learning, growing, and changing our workplace so every employee can contribute to their life’s best work. Our commitment to these principles extends to the global communities we serve by creating positive change with GIS technology. For more information on Esri’s Racial Equity and Social Justice initiatives, please visit our website here.
If you don’t meet all of the preferred qualifications for this position, we encourage you to still apply!
Esri is an equal opportunity employer (EOE) and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability status, protected veteran status, or any other characteristic protected by law. If you need reasonable accommodation for any part of the employment process, please email and let us know the nature of your request and your contact information. Please note that only those inquiries concerning a request for reasonable accommodation will be responded to from this e-mail address.
Esri Privacy Esri takes our responsibility to protect your privacy seriously. We are committed to respecting your privacy by providing transparency in how we acquire and use your information, giving you control of your information and preferences, and holding ourselves to the highest national and international standards, including CCPA and GDPR compliance.
Requisition ID: 2025-2366
Seniority level
- Seniority levelNot Applicable
- Employment typeFull-time
- Job functionEngineering and Information Technology
- IndustriesSoftware Development, IT Services and IT Consulting, and Technology, Information and Internet
Referrals increase your chances of interviewing at Esri by 2x
Sign in to set job alerts for “Site Reliability Engineer” roles.Critical Environment Engineer - ELV SystemsDubai, Dubai, United Arab Emirates 1 year ago
Technical Engineer - Infrastructure PlatformsDubai, Dubai, United Arab Emirates 1 year ago
Dubai, Dubai, United Arab Emirates 1 year ago
Dubai, Dubai, United Arab Emirates 20 hours ago
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
#J-18808-Ljbffr