149 Devops Engineers jobs in Hyderabad
Site Reliability Engineer
Posted today
Job Viewed
Job Description
**What you get to do in this role:**
+ Drive immediate relief and provide a sustainable resolution to issues within the ServiceNow platform.
+ Use knowledge and experience in software development, application support, systems engineering and networking to proactively prevent issues from reoccurring.
+ Drive internal stakeholders and partner teams to improve the reliability, scalability and performance of the infrastructure through improved system design.
+ Drive and contribute to a culture of intolerance to manual activity, which results in an automation environment delivering repeatable and scalable response to system issues.
**To be successful in this role you have:**
+ Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
+ 3- 5 Yrs of experience in Linux systems.
+ Coding in any development/scripting languages like Javascript, Python, C++, Java
+ Networking skills and IP addressing.
+ MySQL database administration.
+ Monitoring of performance/availability in systems, applications and networks.
+ Uncompromising attention to detail.
+ Ability to work in shifts that cover one weekend day.
JV20
**Work Personas**
We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here ( . To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service.
**Equal Opportunity Employer**
ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements.
**Accommodations**
We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact for assistance.
**Export Control Regulations**
For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities.
From Fortune. ©2025 Fortune Media IP Limited. All rights reserved. Used under license.
Site Reliability Engineer
Posted 1 day ago
Job Viewed
Job Description
At Amgen, if you feel like you're part of something bigger, it's because you are. Our shared mission-to serve patients living with serious illnesses-drives all that we do.
Since 1980, we've helped pioneer the world of biotech in our fight against the world's toughest diseases. With our focus on four therapeutic areas -Oncology, Inflammation, General Medicine, and Rare Disease- we reach millions of patients each year. As a member of the Amgen team, you'll help make a lasting impact on the lives of patients as we research, manufacture, and deliver innovative medicines to help people live longer, fuller happier lives.
Our award-winning culture is collaborative, innovative, and science based. If you have a passion for challenges and the opportunities that lay within them, you'll thrive as part of the Amgen team. Join us and transform the lives of patients while transforming your career.
Site Reliability Engineer
**What you will do**
Let's do this. Let's change the world. In this vital role you will responsible for the reliability, stability, performance, scalability, and security of platforms that support Amgen's digital products and engineering teams. This hands-on role focuses on supporting cloud-based infrastructure, automating operations, maintaining observability, and improving platform reliability through code.
You'll work closely with senior engineers and cross-functional teams to support CI/CD workflows, container platforms, incident response, and enterprise tooling-all while adopting modern SRE principles and practices.
This role is ideal for engineers who have foundational site reliability experience and are looking to expand their skills in a cloud-native, enterprise-scale environment.
**Roles & Responsibilities:**
**Infrastructure & Platform Support**
+ Provision and manage cloud infrastructure using Infrastructure as Code (IaC)
+ Support container orchestration platforms, ensuring availability, access control, and resource management
+ Assist in configuring and maintaining CI/CD pipelines and environments
**Monitoring & Incident Response**
+ Set up and maintain observability tools to track system health and performance
+ Participate in alert tuning, incident resolution, and root cause analysis
+ Support integration of observability platforms with incident response workflows
**Automation & Platform Operations**
+ Automate routine platform tasks such as provisioning, patching, and configuration
+ Write scripts to improve platform reliability, reduce manual work, and enforce compliance
+ Participate in platform upgrades, maintenance windows, and service validation efforts
**AI Enablement & Intelligence**
+ Support the adoption of AI-assisted operational tools for log analysis, anomaly detection, and predictive alerts
+ Collaborate with senior engineers to evaluate AI/ML-based observability and automation platforms
+ Assist in integrating AI-driven insights into dashboards, alerts, or incident workflows
+ Stay current with emerging AI trends in infrastructure and site reliability, and contribute to tool evaluations and pilots
**Collaboration & Enablement**
+ Work with development, QA, and security teams to ensure reliable and secure deployments
+ Document operational procedures, playbooks, and system runbooks
+ Learn and support enterprise collaboration platforms and internal tooling
+ Participate in Agile and SAFe delivery processes-including sprint planning, stand-ups, retrospectives, and PI planning-to ensure security and platform reliability are embedded across development cycles.
**What we expect of you**
We are all different, yet we all use our unique contributions to serve patients. The (vital attribute) professional we seek is a (type of person) with these qualifications.
**Basic Qualifications:**
+ Master's degree / Bachelor's degree and 5 to 9 years in Computer Science, IT or related field
+ 4 years of hands-on related experience in site reliability, DevOps, or platform engineering roles
+ Hands-on experience with cloud platforms preferably AWS
+ Familiarity with Kubernetes or container orchestration technologies
+ Exposure to CI/CD practices and pipeline automation
+ Experience troubleshooting Linux systems, processes, and services
**Preferred Qualifications:**
**Must-Have Skills:**
+ Practical experience with **cloud platforms** (e.g., AWS, Azure, or GCP), including compute, networking, IAM, and storage services
+ Familiarity with **container orchestration platforms** (e.g., Kubernetes, Docker), including basic workload deployment and troubleshooting
+ Experience using **Infrastructure as Code (IaC)** tools such as **Terraform** or **CloudFormation**
+ Working knowledge of **Linux administration** , including system services, package management, and file system structures
+ Hands-on exposure to **CI/CD platforms** (e.g., GitLab CI, Jenkins, GitHub Actions) and pipeline troubleshooting
+ Proficiency in **scripting or automation languages** like **Python** , **Bash** , or **Go**
+ Exposure to **observability tooling** (e.g., **Dynatrace** , **Prometheus** , or **Grafana** ) for monitoring and alerting
+ Familiarity with **incident management practices** and tools (e.g., runbooks, escalation workflows, basic alert tuning)
+ Version control skills using **Git** and understanding of branching strategies
+ Experience supporting or integrating **enterprise collaboration platforms** (e.g., Jira, Confluence, ServiceNow)
+ Interest and basic understanding of **AI/ML tools** used in infrastructure and operations (e.g., anomaly detection, intelligent alerting, log analysis)
**Good-to-Have Skills:**
+ Experience using Infrastructure as Code (IaC) tools like Terraform or CloudFormation
+ Familiarity with IT incident response workflows and ticketing platforms
+ Knowledge of secrets management, configuration management tools (e.g., Ansible), or logging frameworks
+ Exposure to **AI-assisted tooling** (e.g., AIOps platforms, AI-enhanced alerting, anomaly detection)
**Professional Certifications (Preferred)**
+ Cloud DevOps Certification (AWS/Azure/GCP)
+ Certified Kubernetes Administrator (CKA) or Security Specialist (CKS)
+ CI/CD Platform Certification
+ ITIL Foundation or equivalent service management certification
**Soft Skills:**
+ Strong analytical and troubleshooting skills
+ Collaborative and proactive mindset
+ Effective communication and documentation practices
+ Curiosity and willingness to adopt new tools and methods, including AI integrations
+ Ability to manage time and prioritize tasks in dynamic environments
**Shift Information:** This position is an onsite role and may require working during later hours to align with business hours. Candidates must be willing and able to work outside of standard hours as required to meet business needs.
**What you can expect of us**
As we work to develop treatments that take care of others, we also work to care for your professional and personal growth and well-being. From our competitive benefits to our collaborative culture, we'll support your journey every step of the way.
In addition to the base salary, Amgen offers competitive and comprehensive Total Rewards Plans that are aligned with local industry standards.
**Apply now and make a lasting impact with the Amgen team.**
**careers.amgen.com**
As an organization dedicated to improving the quality of life for people around the world, Amgen fosters an inclusive environment of diverse, ethical, committed and highly accomplished people who respect each other and live the Amgen values to continue advancing science to serve patients. Together, we compete in the fight against serious disease.
Amgen is an Equal Opportunity employer and will consider all qualified applicants for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, protected veteran status, disability status, or any other basis protected by applicable law.
We will ensure that individuals with disabilities are provided reasonable accommodation to participate in the job application or interview process, to perform essential job functions, and to receive other benefits and privileges of employment. Please contact us to request accommodation.
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Hello Connetions
Greetings of the day!
We have immediate openings for SRE
Role - Site Reliability Engineer
Experience - 7 to 12yrs
Work Location -Hyderabad
Notice Period -immediate
Interested candidates can share your CVs to -
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
We are currently seeking a for a position SRE Engineer in Hyderabad.
**Job ID: **
**Apply Here:** (TCS iBegin )
**Job Description:**
- Proven experience as a DevOps/SRE Engineer
- Expertise in managing and optimizing GCP or Azure cloud-native services and AI/ML integration.
- Experience or knowledge of Container technology such as Docker, Buildah and Kubernetes (GKE, AKS)
- Must have 2+ scripting and programming experience (Python, Bash, .)
- Proficiency in infrastructure-as-code tools, particularly Terraform and ArgoCD
- Familiarity with observability tools such as Prometheus, Grafana, OpenTelemetry
- Solid understanding of CI/CD concepts
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Job Title: Site Reliability Engineer (SRE) | Fintech | Kubernetes | Datadog | 24/7 Support
Department: Site Reliability Engineering
Location: Hyderabad, India
Employment Type: Full-Time
Notice period: 0-15 Days
We’re hiring a Site Reliability Engineer to join our SRE team focused on maintaining the performance, reliability, and availability of our fintech platforms.
Key Responsibilities:
- Triage and resolve production incidents; respond to alerts from Datadog
- Monitor and troubleshoot Kubernetes workloads and cloud environments
- Develop automation tools using C#, Java, Python, PowerShell, or Bash
- Support and improve CI/CD pipelines and service uptime
- Participate in 24/7 on-call rotations , including weekends and holidays
- Ensure compliance with PCI DSS , ISO 27001 , and other fintech standards
Requirements:
- 3+ years in SRE, DevOps, or related roles
- Hands-on with monitoring tools , Kubernetes , and scripting/programming
- Strong incident management and root cause analysis skills
- Experience in high-transaction or regulated environments
- Excellent communication and cross-team collaboration skills
Nice to Have: Cloud experience (AWS, Azure, GCP), Helm, CI/CD tools, ITIL/agile exposure
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
YOUR IMPACT:
Reliability, Automation, and Observability As a hybrid Site Reliability Engineer/DevOps Engineer, you'll be a key driver in ensuring the stability, performance, and scalability of our mission-critical SaaS platform. You'll apply engineering principles to operational challenges, constantly striving to eliminate toil through automation.
Operational Excellence & Reliability
● Provide day-to-day management of system alerts, check system health, and escalate issues as necessary to maintain high availability.
● Actively participate in a 24x7 on-call rotation for critical SaaS platform incidents, and be available in case of emergencies.
● Lead the incident response process, ensuring fast and effective mitigation and resolution of production issues.
● Perform thorough Root Cause Analysis (RCA) and lead blameless post-mortems to identify systemic weaknesses and create a corrective action plan to prevent recurrence.
● Collaborate with engineering teams to set and enforce error budgets (derived from SLOs, or Service Level Objectives), ensuring a healthy balance between development speed and system stability.
Platform Automation & Infrastructure Development
● Automate routine operational tasks to reduce manual effort and "toil" and increase overall team efficiency.
● Design, deploy, and maintain cloud infrastructure using Infrastructure as Code (IaC), specifically leveraging Terraform and Helm for deployment to EKS/K8s clusters.
● Improve existing infrastructure health by developing and implementing checks and scripts to proactively correct known issues and self-heal the platform.
● Maintain, develop, and evolve our Continuous Integration/Continuous Delivery (CI/CD) deployment code and pipelines.
● Learn and maintain existing infrastructure running under Docker and Docker Swarm while driving migration strategies toward EKS/K8s.
● Implement and integrate new technologies and services into our Cloud Infrastructure to enhance platform capabilities and resilience.
Monitoring & Observability
● Design and implement comprehensive Observability strategies across all three pillars: Metrics, Logs, and Traces.
● Proactively create and refine robust monitoring and alerting configurations within the EKS/K8s ecosystem.
● Utilize and maintain our Observability platform, Datadog, to gather performance data, create complex synthetic tests, and visualize system health via dashboards.
● Leverage existing monitoring solutions such as Grafana and Prometheus while planning and executing the migration or integration of data into a unified platform.
● Document all issues, remediation steps, system architecture, and runbooks to facilitate knowledge transfer and rapid incident response.
● Collaborate closely with Support, Customer Success, Migration, and Professional Services teams to provide the highest level of SaaS service and minimize customer impact during changes.
● Apply a real customer focus when planning deployments/updates, always considering the impact on the end-user before making changes.
YOUR EXPERIENCE: Essential Skills and Qualifications
● Hands-on AWS Cloud Engineer experience, with expert working knowledge of the AWS Cloud ecosystem, including a good understanding of AWS IAM roles and policies.
● Proficiency with container orchestration technologies: EKS/Kubernetes (K8s).
● Demonstrable experience with Infrastructure as Code (IaC) tools, specifically Terraform and Helm. ● Working experience with Docker and maintaining systems using Docker Swarm.
● Expertise in setting up and managing logging and monitoring solutions. Direct experience with Datadog is highly preferred, with experience in setting up APM, infrastructure monitoring, and custom dashboards.
● Experience with existing monitoring solutions such as Grafana and Prometheus is required.
● Proficient in a Linux environment and strong skills in Bash and/or Python scripting for automation and troubleshooting.
● A strong understanding of web technologies, including REST APIs, Systems Architecture, Design, and Databases.
● Experience in Product/Application Support for high-availability SaaS-based products.
● Experience in designing, implementing, and operating in a DevSecOps environment.
● Excellent oral and written communication skills, with the ability to clearly explain complex technical issues and RCAs to both technical and customer-facing audiences.
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Category
Details
Role
Site Reliability Engineer (SRE) III – Data Engineering
Location
Hyderabad-
Employment Type
Full Time
Experience
7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)
Primary Skills (Must-Have)
AWS, CI/CD, Jenkins, IAAC, Terraform, Kubernetes
Secondary Skills (Good-to-Have)
AWS systems; Dataiku data, Platform updates and patching
Tools & Platforms
Data Warehousing & Processing: Snowflake, Redshift, Apache Airflow, dbt
CI/CD & Deployment: Jenkins, GitHub Actions, AWS CodePipeline, Terraform
Cloud & Event Processing: AWS Lambda, API Gateway, SNS/SQS, Kafka, Step Functions
Monitoring & Logging: DataDog, AWS CloudWatch, Prometheus, Splunk
Incident Management: PagerDuty, Opsgenie, AWS Health Dashboard
Collaboration & Code Review: GitHub, Jira, Confluence
Key Responsibilities
Data Pipeline Reliability & Observability:
- Maintain and optimize highly available, fault-tolerant infrastructure for data pipelines, ETL jobs, and real-time data processing
- Implement end-to-end monitoring of Airflow DAGs, Snowflake queries, and AWS-based data workflows
- Automate data pipeline health checks, error handling, and auto-remediation strategies
Infrastructure & Cloud Automation:
- Deploy and manage AWS-based data infrastructure using Terraform and CloudFormation
- Optimize Kubernetes (EKS) clusters for processing large-scale datasets and real-time analytics
- Ensure high availability and cost-efficient scaling for Redshift, Snowflake, and data storage solutions
Performance, Monitoring & Incident Response:
- Implement real-time monitoring, logging, and alerting using DataDog, AWS CloudWatch, and Prometheus
- Define and track SLOs, SLIs, and error budgets to improve data reliability and uptime
- Conduct Root Cause Analysis (RCA), security audits, and post-mortems for incidents
Security & Compliance:
- Ensure GDPR, CCPA, and SOC 2 compliance for data storage, access controls, and retention policies
- Implement AWS security best practices (IAM, KMS, Shield, WAF) to secure data access and encryption
- Secure API gateways, authentication mechanisms, and data lake permissions to prevent unauthorized access
Collaboration & Leadership:
- Work closely with data engineers, analytics teams, and DevOps engineers to enhance data platform reliability
- Participate in incident response drills, disaster recovery planning, and security compliance reviews
- Advocate for best practices in automation, cost optimization, and cloud-native data solutions
Be The First To Know
About the latest Devops engineers Jobs in Hyderabad !
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Role: Site Reliability Engineer
Location: Hyderabad
Notice Period: Immediate to 20 Days
Employment Type: Full Time
Experience
- 7–12 years in site reliability, cloud-based data infrastructure, data pipeline observability, automation, and high-availability engineering within EdTech platforms (2U)
- Primary Skills (Must-Have)
- AWS, CI/CD, Jenkins, IAAC, Terraform, Kubernetes
- Secondary Skills (Good-to-Have)
- AWS systems; Dataiku data, Platform updates and patching
- Tools & Platforms
- Data Warehousing & Processing: Snowflake, Redshift, Apache Airflow, dbt
- CI/CD & Deployment: Jenkins, GitHub Actions, AWS CodePipeline, Terraform
- Cloud & Event Processing: AWS Lambda, API Gateway, SNS/SQS, Kafka, Step Functions
- Monitoring & Logging: DataDog, AWS CloudWatch, Prometheus, Splunk
- Incident Management: PagerDuty, Opsgenie, AWS Health Dashboard
- Collaboration & Code Review: GitHub, Jira, Confluence
Key Responsibilities
Data Pipeline Reliability & Observability:
- Maintain and optimize highly available, fault-tolerant infrastructure for data pipelines, ETL jobs, and real-time data processing
- Implement end-to-end monitoring of Airflow DAGs, Snowflake queries, and AWS-based data workflows
- Automate data pipeline health checks, error handling, and auto-remediation strategies
Infrastructure & Cloud Automation:
- Deploy and manage AWS-based data infrastructure using Terraform and CloudFormation
- Optimize Kubernetes (EKS) clusters for processing large-scale datasets and real-time analytics
- Ensure high availability and cost-efficient scaling for Redshift, Snowflake, and data storage solutions
Performance, Monitoring & Incident Response:
- Implement real-time monitoring, logging, and alerting using DataDog, AWS CloudWatch, and Prometheus
- Define and track SLOs, SLIs, and error budgets to improve data reliability and uptime
- Conduct Root Cause Analysis (RCA), security audits, and post-mortems for incidents
Security & Compliance:
- Ensure GDPR, CCPA, and SOC 2 compliance for data storage, access controls, and retention policies
- Implement AWS security best practices (IAM, KMS, Shield, WAF) to secure data access and encryption
- Secure API gateways, authentication mechanisms, and data lake permissions to prevent unauthorized access
Collaboration & Leadership:
- Work closely with data engineers, analytics teams, and DevOps engineers to enhance data platform reliability
- Participate in incident response drills, disaster recovery planning, and security compliance reviews
- Advocate for best practices in automation, cost optimization, and cloud-native data solutions
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
Job Role: Site Reliability Engineer (SRE) – GCP
Experience: 3+ years
Location: Hyderabad
About SIDGS:
SIDGS is a premium global systems integrator and global implementation partner of Google corporation, providing Digital Solutions & Services to Fortune 500 companies. Our Digital solutions go across following domains: User Experience, CMS, API Management, Microservices, DevOps, Cloud, Service Mesh, Artificial Intelligence, RPA domains.
We create innovative solutions in Digital, API Management, Cloud and DevOps space in partnership with Google. We understand that every business has a unique set of challenges and opportunities, and we leverage our unique industry insights, honed through decades of combined experience in the technology sector, to deliver the products, solutions, and services necessary to achieve best customer satisfaction and delivering positive impact to the communities.
Location: Hyderabad (Work from Office only)
Job Type: Full Time
- The Site Reliability Engineer (SRE) Level 1 will be responsible for maintaining and improving the reliability, availability, and performance of the systems.
- We need someone who can join within 0-30 days only.
- We are looking for someone who passionate about learning and developing their skills in system reliability, automation, and incident response. You will work closely with senior SREs, DevOps teams, and other stakeholders to ensure the services meet the highest standards of reliability and performance.
Key Responsibilities:
- Monitor system performance and availability across GCP and Anthos environments.
- Respond to incidents, perform root cause analysis, and implement fixes.
- Escalate issues to senior team members as needed.
- Assist in developing and maintaining automation scripts and tools to improve efficiency.
- Collaborate with senior SREs to identify and implement improvements in system performance and reliability.
- Document processes, incident reports, and troubleshooting guides.
- Continuously learn and apply new skills and technologies relevant to the SRE role.
- Participate in training sessions and workshops to enhance knowledge.
- Basic knowledge of monitoring tools
Skills:
- Relevant experience 1-2 years of experience in 24x7 support of enterprise level applications
- Strong problem-solving skills and attention to detail.
- Excellent communication and teamwork abilities.
- Willingness to learn and adapt in a fast-paced environment
- Knowledge of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI).
- Basic understanding of Linux/Unix systems and networking concepts.
- Familiarity with Kubernetes and container orchestration.
- Nice to have experience in Apigee environment
- Preferred Qualifications: Graduate in Computers, Engineering or similar educational qualification
Site Reliability Engineer
Posted 3 days ago
Job Viewed
Job Description
About the Role
We are seeking an experienced Site Reliability / Azure DevOps Engineer with Dynatrace Experience to join our engineering team and contribute to scalable CI/CD practices, infrastructure automation, and cloud operations. The ideal candidate will have deep expertise in Azure DevOps, Infrastructure as Code (IaC), Azure services, and modern DevOps practices.
Key Responsibilities
- Design and maintain CI/CD pipelines using Azure DevOps.
- Implement Infrastructure as Code using tools like ARM, Bicep, or Terraform.
- Automate deployment and configuration of cloud resources on Azure.
- Must have experience in Dynatrace.
- Collaborate with development, QA, and cloud teams to streamline DevOps workflows.
- Monitor, secure, and optimize infrastructure and pipelines for performance and cost-efficiency.
- Troubleshoot build/deployment failures and perform root cause analysis.
- Set up dashboards, alerts, and reporting for infrastructure health and deployment metrics.
Required Skills
- Strong hands-on experience with Azure DevOps with Dynatrace. (Repos, Pipelines, Artifacts, Boards).
- Proficiency in PowerShell and/or Bash scripting .
- Experience with Infrastructure as Code using ARM templates, Bicep , or Terraform .
- Familiarity with containerization (Docker, Kubernetes) and Azure Kubernetes Service (AKS) .
- Experience integrating with GitHub Actions, SonarQube, Nexus/Artifactory is a plus.
- Experience working with Azure services like App Services, Key Vault, Storage, Monitor, etc.
- Exposure to Agile/Scrum methodology and DevSecOps principles.
Preferred Qualifications
- Microsoft Azure certifications (e.g., AZ-400, AZ-104).
- Experience with hybrid and multi-cloud environments.
- Familiarity with security best practices in CI/CD pipelines.