221 Condition Monitoring jobs in the United Arab Emirates
Condition Monitoring Analyst
Posted today
Job Viewed
Job Description
Career Area:
Product SupportJob Description:
Your Work Shapes the World at Caterpillar Inc.
When you join Caterpillar yourejoining a global team who cares not just about the work we do but also about each other. We are the makers problem solvers and future world builders who are creating stronger more sustainable communities. We dontjust talk about progress and innovation here we make it happen with our customers where we work and live. Together we are building a better world so we can all enjoy living in it.
Job Purpose
As a Condition Monitoring Analyst you will support internal Caterpillar teams and dealer partners by diagnosing machine issues researching technical data and providing clear actionable recommendations to help technicians and fleet managers maintain equipment performance and reliability.
What You Will Do
- Information gathering: Use Caterpillars internal databases and systems to research issues and consolidate relevant findings.
- Technical problem-solving: Investigate machine issues using internal systems and provide recommendations to help fleet managers or technicians resolve problems.
- Documentation & communication: Prepare clear concise and structured reports or recommendations
- Addressing minor coverage issues and resolving minor complaints. Ensuring all customer communication is clearly documented.
- Answering inbound customer service inquiries. Providing health analysis or troubleshooting and redirecting them when appropriate.
- Identifying issues and determining appropriate course of action for effective resolution.
- Processing results from analysis of technical data
- Understand prime product or component health or status if action is needed and required next steps.
Skills You Will Have
Customer Focus: Knowledge of the values and practices that align customer needs and satisfaction as primary considerations in all business decisions and ability to leverage that information in creating customized customer solutions.
Data Gathering & Analysis : Knowledge of data gathering and analysis tools techniques and processes; ability to collect and synthesize data from a variety of stakeholders and sources in an objective manner to reach a conclusion goal or judgment.
Service Excellence : Knowledge of customer service concepts and techniques; ability to meet or exceed customer needs and expectations and provide excellent service in a direct or indirect manner.
Consulting : Knowledge of techniques roles and responsibilities in providing technical or business guidance to clients both internal and external; ability to apply consulting knowledge appropriately.
Decision Making and Critical Thinking : Knowledge of the decision-making process and associated tools and techniques; ability to accurately analyze situations and reach productive decisions based on informed judgment.
Effective Communications : Understanding of effective communication concepts tools and techniques; ability to effectively transmit receive and accurately interpret ideas information and needs through the application of appropriate communication behaviors.
Problem Solving : Knowledge of approaches tools techniques for recognizing anticipating and resolving organizational operational or process problems; ability to apply knowledge of problem solving appropriately to diverse situations.
Relationship Management : Knowledge of relationship management techniques; ability to establish and maintain healthy working relationships with clients vendors and peers.
What Will Put You Ahead
- Proven experience years maximum) In a similar role within industries such as mining or heavy construction industries.
- Strong mechanical aptitude and research skills are essential.
- Good written communication skills are important
- Familiarity with dealer operations is a plus
What We Offer:
From day one youre set up to thrive at Caterpillar: helpful training relatable mentors global experience competitive salary package work-life balance and the growth opportunities you expect with a Fortune 100 company.
We value authenticity and encourage candidates to submit original personally crafted responses throughout our hiring process. Use of AI-generated content may disadvantage your application.
Posting Dates:
June JuneCaterpillar is an Equal Opportunity Employer.
Not ready to apply Join our Talent Community.
Required Experience:
IC
#J-18808-LjbffrReliability Engineer
Posted today
Job Viewed
Job Description
Job Purpose
To ensure the reliability, performance, and continuous improvement of Coiled Tubing Drilling (CTD) tools and systems. The Reliability Engineer plays a key role in reducing non-productive time (NPT), improving tool life, and enhancing service delivery through data-driven analysis and root cause investigations.
Key Responsibilities:
Tool Reliability & Performance Monitoring
- Track and analyze CTD tool performance across jobs and regions.
- Identify failure trends and initiate corrective actions to improve tool reliability.
- Maintain a database of tool runs, failure modes, and repair history.
Root Cause Analysis (RCA)
- Lead investigations into tool failures, service quality incidents, and NPT events.
- Use structured RCA methodologies (e.g., 5 Whys, Fishbone, FMEA) to identify root causes.
- Collaborate with engineering, manufacturing, and field teams to implement corrective and preventive actions (CAPA).
Data Analytics & Reporting
- Develop dashboards and reports to monitor KPIs such as MTBF (Mean Time Between Failures), tool utilization, and service quality.
- Provide insights to operations and engineering teams to support decision-making.
- Support digital initiatives for predictive maintenance and reliability modeling.
Tool Qualification & Field Trials
- Support the qualification of new CTD tools and technologies.
- Plan and monitor field trials, ensuring proper data capture and post-run analysis.
- Provide feedback to R&D and product engineering teams.
Documentation & Compliance
- Maintain accurate records of tool configurations, modifications, and performance logs.
- Ensure compliance with internal quality standards and client-specific requirements.
- Participate in audits and service quality reviews.
Qualifications & Experience:
- Bachelor's degree in Mechanical, Petroleum, or Reliability Engineering.
- 4–6 years of experience in Coiled Tubing or Well Intervention operations, with a focus on tool reliability or maintenance.
- Strong understanding of CTD tools, downhole dynamics, and failure mechanisms.
- Familiarity with reliability tools and software (e.g., Weibull analysis, SAP PM, Power BI).
Key Competencies:
- Analytical mindset with strong problem-solving skills.
- Excellent communication and cross-functional collaboration.
- Proficiency in data analysis and visualization tools.
- Commitment to safety, quality, and continuous improvement.
*Please remember that joining the Talent Community is not an application for any specific job at Baker Hughes but to have the privilege of being considered for an opportunity that suits your profile on priority.
#J-18808-LjbffrSite Reliability Engineer
Posted today
Job Viewed
Job Description
At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental’s practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.
Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.
Responsibilities- Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
- Proactively monitor application health and performance across cloud infrastructure (AWS).
- Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
- Lead and participate in disaster recovery drills and security incident simulations.
- Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
- Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
- Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
- Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
- Champion best practices in security, availability, performance, and incident response.
- Cloud Infrastructure : Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
- Programming/Scripting : Proficiency in Node.js and scripting for automation and tooling.
- Containerization : Experience with Docker for container-based deployment pipelines.
- Frontend Awareness : Familiarity with React and Ember.js to understand performance implications at the frontend level.
- Backend Stack : Understanding of NestJS and scalable Node-based services.
- Databases : Proficient in MySQL and performance monitoring of relational databases.
- Version Control : Proficiency with Git for collaborative code management and DevOps workflow integration.
- Incident Response : Calm and focused under pressure with a structured approach to resolving outages and degradation.
- System Design : Ability to contribute to and review architectural designs for scalability and resiliency.
- Collaboration : Strong communication skills to coordinate across developers, QA, and product teams.
- Automation & Efficiency : Passion for automation, repeatability, and continuous improvement.
- Security Mindset : Consistent implementation of security best practices and a strong grasp of data protection standards.
- 3+ years of experience in a Site Reliability, DevOps, or related engineering role.
- Proven track record managing and scaling applications in a production AWS environment.
- Familiarity with full stack environments , particularly those using Node.jss .
- Experience maintaining and deploying databases such as MySQL with performance tuning.
- Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
- Commitment to uptime, performance, and security in fast-moving SaaS environments.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Press Tab to Move to Skip to Content Link
The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs/SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.
Accountabilities- Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
- Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
- Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
- Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
- Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
- Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
- Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
- Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
- Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
- Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
- 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
- Experience working in computing, distributed systems, storage, or networking.
- Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
- Ability to debug, optimize code, and to automate routine tasks.
- Systematic problem-solving approach, coupled with effective verbal and written communication skills.
- Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
- Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
- Strong analytical and problem-solving skills are necessary , TSM processes & tools.
Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world’s leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad’s codeshare partners, Etihad’s network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad’s ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly!
Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video/telephone before any formal offer. If you are asked for money, please treat it as fraudulent.
#J-18808-LjbffrSite Reliability Engineer
Posted today
Job Viewed
Job Description
At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental's practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.
Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.
Responsibilities- Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
- Proactively monitor application health and performance across cloud infrastructure (AWS).
- Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
- Lead and participate in disaster recovery drills and security incident simulations.
- Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
- Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
- Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
- Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
- Champion best practices in security, availability, performance, and incident response.
- Cloud Infrastructure : Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
- Programming/Scripting : Proficiency in and scripting for automation and tooling.
- Containerization : Experience with Docker for container-based deployment pipelines.
- Frontend Awareness : Familiarity with React and to understand performance implications at the frontend level.
- Backend Stack : Understanding of NestJS and scalable Node-based services.
- Databases : Proficient in MySQL and performance monitoring of relational databases.
- Version Control : Proficiency with Git for collaborative code management and DevOps workflow integration.
- Incident Response : Calm and focused under pressure with a structured approach to resolving outages and degradation.
- System Design : Ability to contribute to and review architectural designs for scalability and resiliency.
- Collaboration : Strong communication skills to coordinate across developers, QA, and product teams.
- Automation & Efficiency : Passion for automation, repeatability, and continuous improvement.
- Security Mindset : Consistent implementation of security best practices and a strong grasp of data protection standards.
- 3+ years of experience in a Site Reliability, DevOps, or related engineering role.
- Proven track record managing and scaling applications in a production AWS environment.
- Familiarity with full stack environments , particularly those using .
- Experience maintaining and deploying databases such as MySQL with performance tuning.
- Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
- Commitment to uptime, performance, and security in fast-moving SaaS environments.
Fleet Reliability Engineer
Posted today
Job Viewed
Job Description
Responsible for ensuring that the performance, reliability and safety of the rolling stock fleet is maintained at the required standard. With support of the Rolling Stock Inspectors, evaluate maintenance processes, procedures and regimes, monitor and interrogate fault and failure data, carry out or facilitate analysis to determine root cause of failures. Ensure appropriate control measures are in place to limit re-occurrence of critical failures. Gather technical evidence and data to support claims for warranty and contractual non compliances by third parties. Carry out independent technical investigations for any related safety related defects and compile professional technical reports.
Take a lead role in continually developing and improving the ER Freight quality management system and its processes.
Oversight of any contracted maintenance services, scrutiny of their maintenance processes, procedures and methods of working. Ensuring any engineering changes or modifications are reviewed and approved in compliance with the ER Freight Engineering Change Control policy and industry best practice.
Roles and Responsibilities :
- Ensure all fleet maintenance policies, processes and procedures and document control systems are appropriate as per industry best practice or as required under the Entity in Charge of Maintenance (ECM) regulations.
- Ensuring adequate risk management and control measures are in place by the maintenance function.
- Management, review and investigation of any non-conformances.
- Lead the Quality and Warranty Management processes, supporting other functions such as the procurement department by carrying out technical evaluations and audits of safety critical suppliers or service providers.
- Lead the reliability and performance improvement process, collation and analysis of failure data and all maintenance related statistical data required for RAMS monitoring, handling of key risks and producing performance reports against agreed KPIs as required, for any contracted services review and verify all data and reports as provided by the third party (liaise with contract manager as required).
- Develop and champion performance improvement initiatives, optimizing maintenance processes and activities and thereby promoting a continuous improvement culture.
- Lead and carry out technical investigations and compile professional technical reports for all safety and / or mission critical failures.
- Line management of the Rolling Stock inspectors, providing support and direction as required.
- Work in close cooperation and consultation with other functions and departments as required.
- Lead the asset management processes including the support of SAP.
- Be responsible for your own, and others safety.
- Produce detailed reports at a frequency and content as and when directed by line manager.
Academic Qualifications :
- Degree or equivalent qualification in relevant engineering specialism, or significant operational experience in a senior engineering role.
Professional Qualifications :
- Apprentice trained and / or relevant engineering qualifications.
- Full driving license.
Experience :
- Extensive experience in rolling stock maintenance and trainbourne systems.
- Ability to analyze, interpret and provide recommendations on engineering issues.
- Experience in quality management, RAMS and asset management systems.
- Extensive experience in writing engineering documentation (maintenance instructions, standards, procedures, etc.).
- Extensive knowledge of engineering change and configuration control processes.
- Working in harsh environments.
- Proven ability to work under pressure.
Other Skills :
- Very good planning skills.
- Ability to work in a team and on an individual basis.
- Good leadership skills.
- Very good communication skills.
- English as first language or advanced level of written and verbal skill.
- Analytical skills for problem solving.
- Multi skilled over a number of disciplines.
- Ability to train technicians and pass on experience and knowledge.
- Experience of working in fleet technical department or standards department.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Press Tab to Move to Skip to Content Link
The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs / SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.
Accountabilities
- Team Leadership & Reporting : Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
- Toil Reduction & Automation : Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
- Service Reliability & Uptime : Maintain and improve service availability by aligning with SLAs / SLOs, designing failover strategies, and hardening systems.
- Performance & Latency Optimization : Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
- Change & Deployment Management : Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
- Monitoring & Observability : Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
- Incident Management & RCA : Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
- Capacity & Cost Optimization : Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
- Development & Code Review : Contribute to system development, participate in design / code reviews, and ensure alignment with engineering best practices.
- Governance, Compliance & Documentation : Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
Education & Experience
- 7+ years of experience with data structures / algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
- Experience working in computing, distributed systems, storage, or networking.
- Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
- Ability to debug, optimize code, and to automate routine tasks.
- Systematic problem-solving approach, coupled with effective verbal and written communication skills.
- Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
- Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
- Strong analytical and problem-solving skills are necessary , TSM processes & tools
Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world's leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad's codeshare partners, Etihad's network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad's ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly
Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video / telephone before any formal offer. If you are asked for money, please treat it as fraudulent.
J-18808-Ljbffr
Site Engineer
• Abu Dhabi, Abu Dhabi Emirate, United Arab Emirates
Be The First To Know
About the latest Condition monitoring Jobs in United Arab Emirates !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Press Tab to Move to Skip to Content Link
The Site Reliability Engineer (SRE) will lead an SRE squad focused on enhancing service reliability, performance, and scalability. They will drive automation to reduce toil, optimize system uptime, and manage incident resolution efforts. Responsible for building monitoring systems, optimizing infrastructure, and implementing safe deployment practices, the SRE will also ensure alignment with SLAs/SLOs and contribute to system development and code reviews. The role requires expertise in large-scale distributed systems, cloud infrastructure, and IT governance, with a focus on continuous service improvement and operational excellence.
Accountabilities- Team Leadership & Reporting: Lead an SRE squad handling operations and automation; represent team in senior management briefings; produce dashboards and progress reports.
- Toil Reduction & Automation: Identify and eliminate toil through automation of repetitive tasks, enhancing team efficiency and service reliability.
- Service Reliability & Uptime: Maintain and improve service availability by aligning with SLAs/SLOs, designing failover strategies, and hardening systems.
- Performance & Latency Optimization: Enhance service performance and reduce latency using profiling tools, distributed tracing, load testing, and bottleneck analysis.
- Change & Deployment Management: Implement safe deployment practices (e.g., canary releases, blue-green deployments), ensuring minimal risk and rapid rollback options
- Monitoring & Observability: Build and manage real-time monitoring and alerting systems to ensure service health and proactively detect anomalies.
- Incident Management & RCA: Lead incident resolution efforts, conduct root cause analyses (RCA), and develop response playbooks to reduce MTTR.
- Capacity & Cost Optimization: Perform infrastructure capacity planning and cost-efficient scaling to meet service demands.
- Development & Code Review: Contribute to system development, participate in design/code reviews, and ensure alignment with engineering best practices.
- Governance, Compliance & Documentation: Enforce IT governance standards, maintain documentation, perform quality assessments, and contribute to architecture and risk committees.
- 7+ years of experience with data structures/algorithms and software development in Two or more programming languages and operating and maintaining platforms with 3+ years of experience in a DevOps or SRE role.
- Experience working in computing, distributed systems, storage, or networking.
- Expertise in designing, analysing, and troubleshooting large-scale distributed systems.
- Ability to debug, optimize code, and to automate routine tasks.
- Systematic problem-solving approach, coupled with effective verbal and written communication skills.
- Strong communication capability, able to articulate technical issues in terms of business risk and opportunity.
- Knowledge of the technical aspects of cloud computing, data centres, networks and virtual infrastructure.
- Strong analytical and problem-solving skills are necessary , TSM processes & tools.
Etihad Airways, the national airline of the UAE, was formed in 2003 and quickly went on to become one of the world's leading airlines. From its home in Abu Dhabi, Etihad flies to passenger and cargo destinations in the Middle East, Africa, Europe, Asia, Australia and North America. Together with Etihad's codeshare partners, Etihad's network offers access to hundreds of international destinations. In recent years, Etihad has received numerous awards for its superior service and products, cargo offering, loyalty programme and more.All this ties into Etihad's ambitious Journey 2030 strategy. The airline plans to double its fleet size and triple the number of customers over the next six years as it sets out to be the airline everyone wants to fly
Beware of fraudulent job offers from individuals or organizations claiming to represent the Etihad group. We will never ask for personal information, bank details, or payment during the recruitment process. Interviews are conducted face-to-face or via video/telephone before any formal offer. If you are asked for money, please treat it as fraudulent.
#J-18808-LjbffrSystem Reliability Engineer
Posted today
Job Viewed
Job Description
The primary objective of this role is to oversee, coordinate and manage all faults reported to the centralised fault reporting centre, ensuring effective engineering and maintenance services for the metro system.
KEY RESPONSIBILITIES
- Collaborate in data collection and conduct thorough analysis of system failures covering Rolling Stock, ATC systems, Electrical, Mechanical, Civil & Track.
- Oversee, coordinate and handle all faults reported to the centre, categorising cases into safety and service critical faults and non-critical faults.
- Generate work orders for faults with critical impact on safety and KPI, to be addressed by the FLRT team.
- Deliver work requests for non-KPI impact faults to the Maintenance Centre for scheduled corrective maintenance.
- Arrange and coordinate repair works by dispatching and directing the FLRT team; maintain open communication throughout the fault rectification process.
- Maintain close liaison with the Engineering Controller during maintenance activities involving special trains, track possessions or temporary equipment isolation.
- Prepare fault reports and failure statistics.
- Manage fault alarms and system diagnostics monitoring within the centre.
- Monitor system performance and disseminate information to relevant stakeholders.
- Attend safety-related training and refresher courses as required.
- Perform 24-hour on-call duties, shift work and emergency response when necessary.
- Coordinate communication with Controller Centre Controllers and personnel responsible for track possessions or rail vehicle movements.
- Maintain E and M FLRT competency and perform FLRT roles as required.
- Perform additional duties as instructed by the Senior Fault Controller or Live Fault Response Manager.
- Strong leadership, communication, interpersonal, analytical and decision-making skills are essential for this role.
- Excellent literacy and analytical skills.
- Ability to gather detailed information, diagnose faults and coordinate unscheduled maintenance with the line team.
- Understanding of complex systems and proficient IT skills.
- Excellent communication skills, with the ability to manage multiple tasks efficiently in a fast-paced environment.
- Strong organisational skills and attention to detail.
- Comprehensive understanding of safety issues in railway operations.
- Minimum 1 year experience as a competent E&M FLRT.
- At least 2 years of experience in Electrical & Mechanical systems, Civil & Track systems, ATC systems or Rolling Stock and Depot systems.
- Experience in power utilities controls room, train control room, fault reporting centre or station control room is desirable.
- A degree or higher diploma in Electrical, Mechanical, Electronics, Civil, Computers or Information Technology Engineering.
ESSENTIAL QUALIFICATIONS
KEY SKILLS AND COMPETENCIES
EXPERIENCE AND QUALIFICATIONS
EDUCATIONAL BACKGROUND
Site Reliability Engineer
Posted 5 days ago
Job Viewed
Job Description
At Flex Dental, we go beyond checking boxes; our integration and automation are unparalleled. Every feature serves a purpose, creating seamless collaboration with Open Dental’s practice management system. Our commitment to meaningful functionalities and innovative automation transforms workflows, ensuring efficiency and pushing the boundaries of Open Dental practice management.
Flex Dental is focused on simplifying the lives of dentists and their staff. We're a growing company specializing in a specific area of the dental industry and work exclusively with Open Dental to create a comprehensive solution. By integrating with Open Dental, we aim to deliver innovative tools and services that streamline dental practice management. In short, we're developing cutting-edge solutions for dentists and fostering a great workplace culture for our team.
Responsibilities- Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
- Proactively monitor application health and performance across cloud infrastructure (AWS).
- Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.
- Lead and participate in disaster recovery drills and security incident simulations.
- Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.
- Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).
- Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.
- Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.
- Champion best practices in security, availability, performance, and incident response.
- Cloud Infrastructure : Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.
- Programming/Scripting : Proficiency in Node.js and scripting for automation and tooling.
- Containerization : Experience with Docker for container-based deployment pipelines.
- Frontend Awareness : Familiarity with React and Ember.js to understand performance implications at the frontend level.
- Backend Stack : Understanding of NestJS and scalable Node-based services.
- Databases : Proficient in MySQL and performance monitoring of relational databases.
- Version Control : Proficiency with Git for collaborative code management and DevOps workflow integration.
- Incident Response : Calm and focused under pressure with a structured approach to resolving outages and degradation.
- System Design : Ability to contribute to and review architectural designs for scalability and resiliency.
- Collaboration : Strong communication skills to coordinate across developers, QA, and product teams.
- Automation & Efficiency : Passion for automation, repeatability, and continuous improvement.
- Security Mindset : Consistent implementation of security best practices and a strong grasp of data protection standards.
- 3+ years of experience in a Site Reliability, DevOps, or related engineering role.
- Proven track record managing and scaling applications in a production AWS environment.
- Familiarity with full stack environments , particularly those using Node.jss .
- Experience maintaining and deploying databases such as MySQL with performance tuning.
- Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
- Commitment to uptime, performance, and security in fast-moving SaaS environments.
#J-18808-Ljbffr