739 Sre Manager jobs in India
Manager, Site Reliability Engineering
Posted 2 days ago
Job Viewed
Job Description
**What you get to do in this role:**
As a Manager of the SRE team your responsibilities will be:
+ Team management, career development, project prioritization and performance review.
+ Drive a culture of intolerance to manual activities that promotes automation efforts.
+ Drive initiatives with partner teams to improve the reliability of the infrastructure.
+ Act as crisis management to orchestrate actions towards sustainable solutions.
+ Analysis and evaluation of existing processes to drive continuous improvement and efficiencies.
+ Provide training and support to partner teams that interface with SRE.
+ Onboarding of new hires to enable their success in their roles.
+ Onboarding of new technologies, systems and automations into the team.
**To be successful in this role you have:**
+ Experience in leveraging or critically thinking about how to integrate AI into work processes, decision-making, or problem-solving. This may include using AI-powered tools, automating workflows, analyzing AI-driven insights, or exploring AI's potential impact on the function or industry.
+ 10 + Yrs of Overall experience
+ Hands-on technical skills in Linux, databases, systems and coding.
+ Experience in team management.
+ Design and implementation of monitoring solutions for large and scalable environments.
+ Experience with cloud operations, follow-the-sun and geographic distributed teams.
+ Experience working in software, platform and infrastructure delivered as a service.
+ Knowledge of principles and methods involving ITIL v3.
+ Outstanding interpersonal skills and strong communication skills, both written and verbal.
+ Uncompromising attention to detail.
We also have pluses. These are not a 'must', but please highlight them on your resume if you have:
+ RHCE, CCNA, ITIL or other industry certifications
JV20
**Work Personas**
We approach our distributed world of work with flexibility and trust. Work personas (flexible, remote, or required in office) are categories that are assigned to ServiceNow employees depending on the nature of their work and their assigned work location. Learn more here ( . To determine eligibility for a work persona, ServiceNow may confirm the distance between your primary residence and the closest ServiceNow office using a third-party service.
**Equal Opportunity Employer**
ServiceNow is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, creed, religion, sex, sexual orientation, national origin or nationality, ancestry, age, disability, gender identity or expression, marital status, veteran status, or any other category protected by law. In addition, all qualified applicants with arrest or conviction records will be considered for employment in accordance with legal requirements.
**Accommodations**
We strive to create an accessible and inclusive experience for all candidates. If you require a reasonable accommodation to complete any part of the application process, or are unable to use this online application and need an alternative method to apply, please contact for assistance.
**Export Control Regulations**
For positions requiring access to controlled technology subject to export control regulations, including the U.S. Export Administration Regulations (EAR), ServiceNow may be required to obtain export control approval from government authorities for certain individuals. All employment is contingent upon ServiceNow obtaining any export license or other approval that may be required by relevant export control authorities.
From Fortune. ©2025 Fortune Media IP Limited. All rights reserved. Used under license.
Site Reliability Engineering Lead
Posted 2 days ago
Job Viewed
Job Description
Are you a visible champion with a 'can do' attitude and enthusiasm that inspires others?
About the Business
LexisNexis Risk Solutions is the essential partner in the assessment of risk. Within our Business Services vertical, we offer a multitude of solutions focused on helping businesses of all sizes drive higher revenue growth, maximize operational efficiencies, and improve customer experience. Our solutions help our customers solve difficult problems in the areas of Anti-Money Laundering/Counter Terrorist Financing, Identity Authentication & Verification, Fraud and Credit Risk mitigation and Customer Data Management. You can learn more about LexisNexis Risk at the link below, the Team
This Team performs complex research, design, and software development assignments within a software functional area or product line, and provides direct input to project plans, schedules, and methodology in the development of cross-functional software products .
About the Role
.
We are seeking a highly skilled and experienced Lead Site Reliability Engineer (SRE) to drive reliability, scalability, and performance across our cloud infrastructure, with a strong emphasis on cloud security and compliance. This role blends reliability engineering with security best practices to ensure our cloud infrastructure is not only scalable and resilient but also secure and compliant.
Responsibilities:
+ B uild and maintain secure, scalable, and resilient systems using principles from software development, IT operations, and cybersecurity.
+ Implement and manage continuous security monitoring and threat detection to protect systems and data.
+ Collaborate with development teams to integrate security and observability tools into CI/CD pipelines, automating security checks.
+ Develop and maintain tools, scripts, and pipelines for infrastructure management, code deployment, and system monitoring.
+ Address vulnerabilities in code libraries and infrastructure (e.g., OS packages) through patching and remediation.
+ Review and remediate findings from cloud security tools (e.g., Wiz, Azure/AWS Security Recommendations).
+ Provide centralized support for common cloud configuration patterns while addressing application-specific issues.
+ Lead efforts to remediate vulnerabilities across platforms and cloud providers, working with centralized teams for efficiency.
+ Partner with application teams to resolve specific security findings and improve overall system resilience.
Requirements:
+ Experience (typically 10+ years) in DevOps, Site Reliability Engineering (SRE), or Cloud Engineering.
+ Strong understanding of DevOps practices and infrastructure-as-code tools (e.g., Terraform).
+ Proficiency in scripting or programming languages (e.g., Python, Bash, Go).
+ Hands-on experience with cloud platforms such as Azure, AWS, or Google Cloud Platform (GCP).
+ Familiarity with cloud security, vulnerability management, and patching practices.
+ Experience with threat modeling and vulnerability assessments for code and infrastructure.
+ Ability to interpret and act on cloud security recommendations and configuration findings.
Learn more about the LexisNexis Risk team and how we work ( are committed to providing a fair and accessible hiring process. If you have a disability or other need that requires accommodation or adjustment, please let us know by completing our Applicant Request Support Form or please contact .
Criminals may pose as recruiters asking for money or personal information. We never request money or banking details from job applicants. Learn more about spotting and avoiding scams here .
Please read our Candidate Privacy Policy .
We are an equal opportunity employer: qualified applicants are considered for and treated during employment without regard to race, color, creed, religion, sex, national origin, citizenship status, disability status, protected veteran status, age, marital status, sexual orientation, gender identity, genetic information, or any other characteristic protected by law.
USA Job Seekers:
EEO Know Your Rights .
RELX is a global provider of information-based analytics and decision tools for professional and business customers, enabling them to make better decisions, get better results and be more productive.
Our purpose is to benefit society by developing products that help researchers advance scientific knowledge; doctors and nurses improve the lives of patients; lawyers promote the rule of law and achieve justice and fair results for their clients; businesses and governments prevent fraud; consumers access financial services and get fair prices on insurance; and customers learn about markets and complete transactions.
Our purpose guides our actions beyond the products that we develop. It defines us as a company. Every day across RELX our employees are inspired to undertake initiatives that make unique contributions to society and the communities in which we operate.
Site Reliability Engineering Lead
Posted 2 days ago
Job Viewed
Job Description
Are you a visible champion with a 'can do' attitude and enthusiasm that inspires others?
About the Business
LexisNexis Risk Solutions is the essential partner in the assessment of risk. Within our Business Services vertical, we offer a multitude of solutions focused on helping businesses of all sizes drive higher revenue growth, maximize operational efficiencies, and improve customer experience. Our solutions help our customers solve difficult problems in the areas of Anti-Money Laundering/Counter Terrorist Financing, Identity Authentication & Verification, Fraud and Credit Risk mitigation and Customer Data Management. You can learn more about LexisNexis Risk at the link below, the Team
This Team performs complex research, design, and software development assignments within a software functional area or product line, and provides direct input to project plans, schedules, and methodology in the development of cross-functional software products .
About the Role
.
We are seeking a highly skilled and experienced Lead Site Reliability Engineer (SRE) to drive reliability, scalability, and performance across our cloud infrastructure, with a strong emphasis on cloud security and compliance. This role blends reliability engineering with security best practices to ensure our cloud infrastructure is not only scalable and resilient but also secure and compliant.
Responsibilities:
+ B uild and maintain secure, scalable, and resilient systems using principles from software development, IT operations, and cybersecurity.
+ Implement and manage continuous security monitoring and threat detection to protect systems and data.
+ Collaborate with development teams to integrate security and observability tools into CI/CD pipelines, automating security checks.
+ Develop and maintain tools, scripts, and pipelines for infrastructure management, code deployment, and system monitoring.
+ Address vulnerabilities in code libraries and infrastructure (e.g., OS packages) through patching and remediation.
+ Review and remediate findings from cloud security tools (e.g., Wiz, Azure/AWS Security Recommendations).
+ Provide centralized support for common cloud configuration patterns while addressing application-specific issues.
+ Lead efforts to remediate vulnerabilities across platforms and cloud providers, working with centralized teams for efficiency.
+ Partner with application teams to resolve specific security findings and improve overall system resilience.
Requirements:
+ Experience (typically 10+ years) in DevOps, Site Reliability Engineering (SRE), or Cloud Engineering.
+ Strong understanding of DevOps practices and infrastructure-as-code tools (e.g., Terraform).
+ Proficiency in scripting or programming languages (e.g., Python, Bash, Go).
+ Hands-on experience with cloud platforms such as Azure, AWS, or Google Cloud Platform (GCP).
+ Familiarity with cloud security, vulnerability management, and patching practices.
+ Experience with threat modeling and vulnerability assessments for code and infrastructure.
+ Ability to interpret and act on cloud security recommendations and configuration findings.
Learn more about the LexisNexis Risk team and how we work ( are committed to providing a fair and accessible hiring process. If you have a disability or other need that requires accommodation or adjustment, please let us know by completing our Applicant Request Support Form or please contact .
Criminals may pose as recruiters asking for money or personal information. We never request money or banking details from job applicants. Learn more about spotting and avoiding scams here .
Please read our Candidate Privacy Policy .
We are an equal opportunity employer: qualified applicants are considered for and treated during employment without regard to race, color, creed, religion, sex, national origin, citizenship status, disability status, protected veteran status, age, marital status, sexual orientation, gender identity, genetic information, or any other characteristic protected by law.
USA Job Seekers:
EEO Know Your Rights .
RELX is a global provider of information-based analytics and decision tools for professional and business customers, enabling them to make better decisions, get better results and be more productive.
Our purpose is to benefit society by developing products that help researchers advance scientific knowledge; doctors and nurses improve the lives of patients; lawyers promote the rule of law and achieve justice and fair results for their clients; businesses and governments prevent fraud; consumers access financial services and get fair prices on insurance; and customers learn about markets and complete transactions.
Our purpose guides our actions beyond the products that we develop. It defines us as a company. Every day across RELX our employees are inspired to undertake initiatives that make unique contributions to society and the communities in which we operate.
Site Reliability Engineering Manager
Posted 6 days ago
Job Viewed
Job Description
Role**: Manager, Site Reliability Engineering
Required Technical Skill Set: Manager, Site Reliability Engineering
Desired Experience Range: 12 - 18 yrs
Notice Period: Immediate to 90Days only
Location of Requirement: Bangalore
We are currently planning to do a Virtual Interview
Job Description:
Describe what the person will do in the role - how he/she will impact the organization.
As the Manager of Site Reliability Engineering on the Infrastructure Reliability team, you will be responsible for building and leading a high-performing team dedicated to ensuring our infrastructure is reliable, scalable, and efficient. Your primary focus will be on people management, strategic planning, and technical leadership. You will mentor and guide your team members, fostering their professional growth and creating a culture of ownership and operational excellence. You will define the team's vision and roadmap, aligning it with the company's broader goals, and work with cross-functional partners to prioritize and execute projects. You will oversee the development of SRE solutions across our globally distributed environments and empowering your team to improve service resiliency, automate processes, and conduct effective incident response and capacity planning to guarantee the highest level of uptime and Quality of Service (QoS) for our internal customers.
Responsibilities and Duties of the Role:
Summarize job responsibilities, core deliverables and major duties. What is required for the position to exist?
-Focus on major areas of work, typically 20% or more of role
% of Time
- Lead, mentor, and grow a team of software and infrastructure automation engineers.
- Develop and execute the roadmap for the Infrastructure Reliability Engineering team.
- Collaborate with engineering and operations teams to identify and prioritize reliability improvements.
- Drive the design and implementation of tools and automation for infrastructure testing and self-healing.
- Establish and monitor key performance indicators (KPIs) for infrastructure reliability.
10%Minimum and Preferred. Inclusive of Licenses/Certs (include functional experience as well as behavioral attributes and/or leadership capabilities)
Basic Qualifications
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience.
- 12+ years of experience in a software engineering or infrastructure role.
- 5+ years of experience in a leadership or management role.
- Lead a team of Infrastructure Reliability Engineers on projects for users and be directly responsible for uptime.
- Own end-to-end availability and performance of key services and build automation to prevent problem recurrence. Automate response to all non-exceptional service conditions.
- Set the standard for excellence by mentoring team members and establishing trust through superior technical delivery.
- Proficiency in Kubernetes administration and modern CI/CD techniques and Infrastructure as Code (IaC).
- Deep understanding of Linux operating systems and TCP/IP fundamentals.
- Experience with monitoring, metrics gathering, APM, container management, and log collection tools.
- Creative problem solver with excellent debugging skills and great documentation abilities.
- Strong understanding of networking, storage, security, and compute technologies.
Preferred Qualifications
- Experience building and leading a Site Reliability Engineering (SRE) or Infrastructure Reliability team.
- Expertise with complex system architectures and infrastructures.
- Proficiency in one or more programming languages (e.g., Python, Go, Java).
- Passion for automation, scalability, and building reliable systems from the ground up.
Site Reliability Engineering Manager
Posted 6 days ago
Job Viewed
Job Description
Good-day,
We have immediate opportunity for Senior Site Reliability Engineer.
Job Role: Senior Site Reliability Engineer
Job Location: Synechron ( Bengaluru/ Pune)
Experience- 10 to 15 years
Notice : Immediate Joiner
About Company:
At Synechron, we believe in the power of digital to transform businesses for the better. Our global consulting firm combines creativity and innovative technology to deliver industry-leading digital solutions. Synechron’s progressive technologies and optimization strategies span end-to-end Artificial Intelligence, Consulting, Digital, Cloud & DevOps, Data, and Software Engineering, servicing an array of noteworthy financial services and technology firms. Through research and development initiatives in our FinLabs we develop solutions for modernization, from Artificial Intelligence and Blockchain to Data Science models, Digital Underwriting, mobile-first applications and more. Over the last 20+ years, our company has been honoured with multiple employer awards, recognizing our commitment to our talented teams. With top clients to boast about, Synechron has a global workforce of 13,950+, and has 52 offices in 20 countries within key global markets. For more information on the company, please visit our website or LinkedIn community.
Diversity, Equity, and Inclusion
Diversity & Inclusion are fundamental to our culture, and Synechron is proud to be an equal opportunity workplace and an affirmative-action employer. Our Diversity, Equity, and Inclusion (DEI) initiative ‘Same Difference’ is committed to fostering an inclusive culture – promoting equality, diversity and an environment that is respectful to all. We strongly believe that a diverse workforce helps build stronger, successful businesses as a global company. We encourage applicants from across diverse backgrounds, race, ethnicities, religion, age, marital status, gender, sexual orientations, or disabilities to apply. We empower our global workforce by offering flexible workplace arrangements, mentoring, internal mobility, learning and development programs, and more.
All employment decisions at Synechron are based on business needs, job requirements and individual qualifications, without regard to the applicant’s gender, gender identity, sexual orientation, race, ethnicity, disabled or veteran status, or any other characteristic protected by law.
JOB DESCRIPTION
Experience & qualifications:
10 to 15 years of IT experience with hands on experience in CICD pipeline, DevOps tools & site reliability practices.
Excellent database knowledge with indexing, performance tuning and query optimization skills
Excellent skills in Linux , networking and firewall configurations.
Experience in production monitoring systems
Experience in DevOps tools like Docker, Git, Kubernetes etc.
Experience in DevOps - infrastructure as code, containerization, and automation of infrastructure provisioning
Experience in studying and suggesting improvements on application workload efficiency.
Strong admin level experience in application monitoring tools like Splunk.
Strong configuration experience in job scheduler tools like Control M or Autosys
Experience in collaborating with multiple teams to understand and improve application performance
Preferably from payments and cards domain.
Very good understanding of version control systems like GitHub.
Excellent knowledge of ITIL version 4 with thorough understanding of problem management life cycle, root cause analysis and permanent resolution, change management & incident management process.
Job roles & responsibilities
- Automate software build and deployment processes.
- Build CICD pipeline development that are tailor made as per needs & requirements of the environment.
- Takes accountability for permanent resolution of recurring issues within defined SLAs
- provide technical solutions to complex issues working with solution architects.
- Identify system bottlenecks in terms of Operating System , Database and network, work with the infrastructure teams for permanent resolution for the same.
- Work closely with infrastructure managers, system architects on improving platform stability, and robustness.
- Understands existing change implementation process for upstream systems and proposes improvement ideas.
- Work on improving system reliability with key parameters like Latency, traffic, errors & saturation.
- Understands end user issues and liaise with Partners & stakeholders (internal and external) to co-ordinate on permanent resolution for business user pain points.
- Study existing manual support processes , propose automated solutions to enhance productivity and save manual efforts.
- Review of DR exercise processes and provide recommendations to reduce any outage events during the exercises.
- Building monitoring systems that alerts on symptoms rather than on outages
- Review knowledge base like standard operating procedures , wiki page articles for content quality and guide team on constantly updating with live cases.
- Study and implement measures to enhance system availability .
To expedite the application process, I would appreciate it if you could provide the following information at your earliest convenience:
- Tentative Date to Join (if selected):
- Current Location:
- Preferred Location:
- Current Salary:
- Expected Salary:
- Reason for Change:
- Total Experience:
- Relevant Experience:
- Share official email confirmation of Notice Period or Last Working Day :
- Primary Skills (Hands-on):
- Secondary Skills:
Please send your updated resume to my email at or reach out to me via WhatsApp at .
Junior Site Reliability Engineering Apprentice
Posted 4 days ago
Job Viewed
Job Description
Systems Engineer III, Site Reliability Engineering
Posted 2 days ago
Job Viewed
Job Description
_corporate_fare_ Google _place_ Bengaluru, Karnataka, India
**Mid**
Experience driving progress, solving problems, and mentoring more junior team members; deeper expertise and applied knowledge within relevant area.
**Minimum qualifications:**
+ Bachelor's degree in Computer Science, a related field, or equivalent practical experience.
+ 2 years of experience working with administration (e.g. filesystems, inodes, system calls) or networking (e.g. TCP/IP, routing, network topologies and hardware, SDN).
+ 2 years of experience with data structures/algorithms and software development in one or more programming languages (e.g., Python, C++, Java).
**Preferred qualifications:**
+ Master's degree in Computer Science or Engineering.
+ Experience in Linux system administration, networking fundamentals or system design.
**About the job**
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance.
Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.
SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
**To learn more:** check out our books onSite Reliability Engineering ( or read acareer profile ( about why a Software Engineer chose to join SRE.
Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible. We're proud to be our engineers' engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.
**Responsibilities**
+ Collaborate with our partner teams to identify risks inside our corporate systems and potential solutions to simplify or reduce risk to the productivity or business processes.
+ Ensure the availability of Core Enterprise Network Services, including DNS, DHCP, and RADIUS, supporting Googlers in offices worldwide.
+ Manage low-level infrastructure issues, encompassing networking, system administration, and system design.
+ Ensure our services deliver reliable networking connectivity to offices worldwide. Defend our service Service Level Objectives (SLOs), participating in a sustainable tier-one on-call rotation and supporting the culture.
+ Apply new technologies such as AI to solve traditional system engineering issues in exciting new ways.
Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google'sApplicant and Candidate Privacy Policy (./privacy-policy) .
Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See alsoGoogle's EEO Policy ( ,Know your rights: workplace discrimination is illegal ( ,Belonging at Google ( , andHow we hire ( .
If you have a need that requires accommodation, please let us know by completing ourAccommodations for Applicants form ( .
Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.
To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.
Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also and If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form:
Be The First To Know
About the latest Sre manager Jobs in India !
Software Engineering Manager II, Site Reliability Engineering
Posted 2 days ago
Job Viewed
Job Description
_corporate_fare_ Google _place_ Bengaluru, Karnataka, India
**Advanced**
Experience owning outcomes and decision making, solving ambiguous problems and influencing stakeholders; deep expertise in domain.
**Minimum qualifications:**
+ Bachelor's degree in Computer Science, a related field, or equivalent practical experience.
+ 8 years of experience with software development in one or more programming languages.
+ 3 years of experience managing people or teams.
+ 3 years of experience leading projects.
+ 3 years of experience designing, analyzing, and troubleshooting distributed systems.
**Preferred qualifications:**
+ Master's degree in Computer Science or Engineering.
+ Experience designing, analyzing, and troubleshooting large-scale distributed systems or microservices architectures.
+ Experience in driving organizational change.
+ Excellent communication skills along with systematic problem-solving approach.
**About the job**
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to users' needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance.
Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google, while using your expertise in coding, algorithms, complexity analysis and large-scale system design.
SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.
**To learn more:** check out our books onSite Reliability Engineering ( or read acareer profile ( about why a Software Engineer chose to join SRE.
Behind everything our users see online is the architecture built by the Technical Infrastructure team to keep it running. From developing and maintaining our data centers to building the next generation of Google platforms, we make Google's product portfolio possible. We're proud to be our engineers' engineers and love voiding warranties by taking things apart so we can rebuild them. We keep our networks up and running, ensuring our users have the best and fastest experience possible.
**Responsibilities**
+ Lead a team of Software/Systems Engineers in Bengaluru, on projects for users and be responsible for uptime.
+ Defend the service Service Level Objective (SLO), participating in a sustainable tier-1 on call rotation and supporting a postmortem culture. Ensure our services continue to operate in a reliable, low toil fashion.
+ Identify and lead opportunities to use Artificial Intelligence and Machine Learning to solve traditional system engineering issues, driving improvements in customer satisfaction.
+ Collaborate with the partner teams to deliver business impact and identify risks inside our corporate systems and help define the expanded functional scope for the team as technologies and business needs evolve.
Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google'sApplicant and Candidate Privacy Policy (./privacy-policy) .
Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See alsoGoogle's EEO Policy ( ,Know your rights: workplace discrimination is illegal ( ,Belonging at Google ( , andHow we hire ( .
If you have a need that requires accommodation, please let us know by completing ourAccommodations for Applicants form ( .
Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.
To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.
Google is proud to be an equal opportunity workplace and is an affirmative action employer. We are committed to equal employment opportunity regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity or Veteran status. We also consider qualified applicants regardless of criminal histories, consistent with legal requirements. See also and If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form:
Site Reliability Engineer (SRE) - Cloud Infrastructure
Posted today
Job Viewed
Job Description
Responsibilities:
- Design, build, and maintain scalable and reliable cloud infrastructure, primarily on AWS/GCP/Azure.
- Develop and implement automation tools and scripts for infrastructure provisioning, configuration management, and deployment (e.g., Terraform, Ansible, Kubernetes).
- Monitor system performance and availability, proactively identifying and resolving issues before they impact users.
- Implement robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack.
- Participate in on-call rotations to respond to production incidents and perform root cause analysis.
- Collaborate with development teams to ensure application reliability and performance throughout the software development lifecycle.
- Define and track key reliability metrics (SLOs, SLIs) and drive improvements to meet targets.
- Implement disaster recovery and business continuity strategies.
- Conduct performance testing and capacity planning to ensure systems can handle future growth.
- Automate operational tasks and reduce toil through scripting and tool development.
- Contribute to architectural decisions, focusing on scalability, reliability, and security.
- Participate in security reviews and implement security best practices across the infrastructure.
- Mentor junior engineers and share knowledge within the SRE team.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Engineering, or a related technical field.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Engineering roles.
- Strong proficiency in at least one major cloud platform (AWS, GCP, Azure).
- Extensive experience with container orchestration technologies such as Kubernetes.
- Solid understanding of CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, CircleCI).
- Proficiency in scripting languages (e.g., Python, Go, Bash).
- Experience with infrastructure as code (IaC) tools like Terraform or Ansible.
- Deep knowledge of monitoring and logging tools (e.g., Prometheus, Grafana, Datadog, Splunk).
- Understanding of networking concepts (TCP/IP, DNS, HTTP/S, load balancing).
- Experience with database administration and performance tuning (SQL and NoSQL).
- Excellent problem-solving and debugging skills.
- Strong communication and collaboration abilities, especially in a remote team environment.
- Familiarity with distributed systems and microservices architecture.
- A proactive mindset towards reliability and system resilience.
Senior Site Reliability Engineer - Cloud Infrastructure
Posted 13 days ago
Job Viewed
Job Description
Responsibilities:
- Design, build, and maintain scalable and reliable cloud infrastructure (AWS/GCP/Azure).
- Implement and manage infrastructure as code (IaC) using tools like Terraform or Ansible.
- Develop and maintain robust monitoring, alerting, and logging systems.
- Automate deployment, scaling, and management of applications using containerization (Docker, Kubernetes).
- Respond to production incidents, perform root cause analysis, and implement preventive measures.
- Collaborate with development teams to improve system design for reliability and operability.
- Optimize system performance and resource utilization.
- Develop and enforce operational best practices and security standards.
- Participate in on-call rotations and provide timely incident resolution.
- Contribute to capacity planning and performance tuning efforts.
- Bachelor's degree in Computer Science, Engineering, or a related field.
- 5+ years of experience in Site Reliability Engineering, DevOps, or Systems Administration.
- Strong proficiency in scripting languages (e.g., Python, Bash).
- Extensive experience with cloud platforms (AWS preferred).
- Deep understanding of container orchestration (Kubernetes).
- Experience with CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI).
- Proficiency with monitoring tools (e.g., Prometheus, Grafana, ELK stack).
- Solid understanding of networking concepts and protocols.
- Experience with database administration and performance tuning.
- Excellent troubleshooting and problem-solving skills.
- Strong communication and collaboration abilities.