34 Site Reliability Engineer jobs in Coimbatore
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Roles & Responsibilities
Design, test, implement, deploy, and support continuous integration pipelines that build and deploy to cloud-based environments (development, stage/testing, production). In this role, you will help us build the foundation for our future technology platforms by setting standards around cloud-native patterns, which will inform and be the driving factor in system wide strategic cloud transformation journey.
- Write infrastructure code and test cases code.
- Deploy and configure monitoring and logging tools used for generating alerts about the health of our systems and applications.
- Help with the design, building, and automation of cloud-based infrastructure and provide guidance to development teams regarding how they can continually improve their applications cost, performance, and reliability through investigation, analysis, and best practice recommendations.
- Work in a Cloud Native infrastructure environment built to host and support true micro services architecture applications.
Mandatory Skills
- 6-12 years of experience in a professional cloud computing role with Kubernetes, Docker and Infra-as-Code.
- A BA/BS in Computer Science or equivalent work experience.
- Experience in design, implementation, and deployment of large-scale, highly available, cloud-based infrastructure utilizing AWS, GCP, Azure or other public cloud providers.
- Strong collaboration skills with multiple IT functions, business leaders and vendors, exhibiting excellent teamwork and strong verbal and written communication skills along with expert troubleshooting and analytical skills.
- In depth knowledge of cloud-native application development paradigms and tools.
- Strong know-how with the current trends of large-scale infrastructure environments with proven success in the deployment of resilient cloud-native solutions.
- Experience with source code control systems, branching and merging strategies, automated unit testing frameworks, automated build tools, and automated deploy frameworks.
- Deep working knowledge of serverless and container-based technologies such as Lambda, Docker, Kubernetes, and container platforms such as Rancher Labs or RedHat OpenShift.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Role:
We are looking for a skilled Site Reliability Engineer (SRE) to join our team and help us ensure the reliability, scalability, and performance of our critical systems. As an SRE, you will work closely with development and operations teams to build and maintain highly available services, automate operational tasks, and monitor system health.
Key Responsibilities:
- Design, implement, and maintain scalable and reliable infrastructure for production systems.
- Automate repetitive operational tasks, deployments, and monitoring.
- Collaborate with software engineering teams to build reliable and efficient services.
- Develop and maintain tools for system monitoring, alerting, and incident response.
- Participate in on-call rotations and manage incident response to minimize downtime.
- Analyze system performance and identify bottlenecks or failure points.
- Develop and implement disaster recovery and backup strategies.
- Advocate for reliability, availability, and performance best practices throughout the engineering teams.
- Document processes, architecture, and troubleshooting guides.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
- Proven 5+ years of experience in Site Reliability Engineering with Tools like Grafana, and Prometheus
- Experience with Docker, Kubernetes, and Terraform
- Strong knowledge of Linux/Unix systems and networking fundamentals.
- Proficiency with scripting and programming languages such as Python, Go, Bash, or Ruby.
- Experience with cloud platforms like AWS, GCP, or Azure.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Role: Site Reliability Engineer
Experience: 8-14 years
Location: Sector 16, Noida
Notice Period: Immediate / Serving only
About Times Internet
At Times Internet, we create premium digital products that simplify and enhance the lives of
millions. As India’s largest digital products company, we have a significant presence across a
wide range of categories, including News, Sports, Fintech, and Enterprise solutions.
Our portfolio features market-leading and iconic brands such as TOI, ET, NBT, Cricbuzz, Times
Prime, Times Card, Indiatimes, Whatshot, Abound, Willow TV, Techgig, and Times Mobile
among many more. Each of these products is crafted to enrich your experiences and bring you
closer to your interests and aspirations.
As an equal opportunity employer, Times Internet strongly promotes inclusivity and diversity. We
are proud to have achieved overall gender pay parity in 2018, verified by an independent audit
conducted by Aon Hewitt.
We are driven by the excitement of new possibilities and are committed to bringing innovative
products, ideas, and technologies to help people make the most of every day. Join us and take
us to the next level!
Job Description
We are looking for a Site Reliability Engineer (SRE) to join our News Team. The SRE will be
responsible for maintaining the reliability, scalability, and performance of our critical
infrastructure, ensuring high availability for our services.
Job Role:
As a Site Reliability Engineer (SRE) in the News Team, you will be responsible for ensuring the
stability, performance, and scalability of our systems. You will play a key role in various
migration activities, including Kubernetes cluster upgrades, and application re-platforming. A
significant part of your role will involve migrating applications into Kubernetes, ensuring
seamless deployment, high availability, and minimal downtime.
Additionally, you will be responsible for configuring and maintaining Elasticsearch and Kafka
clusters, ensuring optimal performance, availability, and reliability. You will work on tuning
Elasticsearch for efficient search and indexing, managing Kafka for real-time data streaming,
and troubleshooting any issues related to these services.
You will work on automating operational tasks, optimizing infrastructure, and proactively
resolving issues to maintain system reliability. Additionally, you will collaborate with
development, DevOps, and infrastructure teams to implement best practices for security,
observability, and scalability. Your expertise will be crucial in improving deployment pipelines,
incident response, and overall system performance.
Job Responsibilities:
● Ensure IT services and infrastructure uptime.
● Implement monitoring, alerting, and incident response processes
● Automate repetitive ops tasks (deployments, scaling, failover).
● Respond to outages and production incidents (on-call duties).
● Perform root cause analysis (RCA) and drive postmortems.
● Measure and optimize system performance (latency, throughput, resource usage).
● Support reliable and safe code releases
● Ensure systems are patched, hardened, and compliant with standards.
● Collaborate with technology teams for new requirements and deliver them
Technical Skills Required:
● 8+ years of experience in Site Reliability Engineering, or a related role.
● Proficiency in Kubernetes, Docker, and container orchestration.
● Experience with CI/CD tools.
● Strong knowledge of Linux systems and scripting (Bash, Python).
● Familiarity with configuration management tools like Ansible,Helm.
● Experience with monitoring and logging tools (ELK Stack, or NewRelic).
● Strong troubleshooting skills and incident management experience.
● Experience with Elasticsearch and Kafka
● Knowledge of networking concepts, load balancers, and DNS.
● Experience in performance tuning and optimization.
Soft Skills Required:
● Systems & OS Knowledge
● Linux/Unix administration (process management, system tuning, networking)
● Understanding of filesystems, memory, CPU, kernel basics (centos / Ubuntu )
● Scripting for automation: BASH, python
● Knowledge of cloud platforms : AWS, GCP, Azzure
● Networking and Protocols
● TCP/IP, DNS, HTTP/HTTPS, CDN concepts
● Debugging latency, connectivity, and routing issues
● CI/CD and DevOps Practices
● Jenkins, GitHub Actions, GitLab CI, BitBucket, Git
● Working knowledge of Apache, Tomcat, Nginx
● Knowledge of DNS, Load Balancer, WAF, Firewall.
● Working knowledge of Monitoring tools and ELK
● Knowledge hypervisor like VMware.
● Strong on Virtualization technologies, Docker, Kubernetes
● Knowledge of Database concepts
Qualifications - Education & Experience:
● Bachelor’s degree in Electronic and Telecom, Computer Science, Information
Technology, or a related field.
● 8+ years of experience in Site Reliability Engineering
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Roles & Responsibilities
Design, test, implement, deploy, and support continuous integration pipelines that build and deploy to cloud-based environments (development, stage/testing, production). In this role, you will help us build the foundation for our future technology platforms by setting standards around cloud-native patterns, which will inform and be the driving factor in system wide strategic cloud transformation journey.
- Write infrastructure code and test cases code.
- Deploy and configure monitoring and logging tools used for generating alerts about the health of our systems and applications.
- Help with the design, building, and automation of cloud-based infrastructure and provide guidance to development teams regarding how they can continually improve their applications cost, performance, and reliability through investigation, analysis, and best practice recommendations.
- Work in a Cloud Native infrastructure environment built to host and support true micro services architecture applications.
Mandatory Skills
- 6-12 years of experience in a professional cloud computing role with Kubernetes, Docker and Infra-as-Code.
- A BA/BS in Computer Science or equivalent work experience.
- Experience in design, implementation, and deployment of large-scale, highly available, cloud-based infrastructure utilizing AWS, GCP, Azure or other public cloud providers.
- Strong collaboration skills with multiple IT functions, business leaders and vendors, exhibiting excellent teamwork and strong verbal and written communication skills along with expert troubleshooting and analytical skills.
- In depth knowledge of cloud-native application development paradigms and tools.
- Strong know-how with the current trends of large-scale infrastructure environments with proven success in the deployment of resilient cloud-native solutions.
- Experience with source code control systems, branching and merging strategies, automated unit testing frameworks, automated build tools, and automated deploy frameworks.
- Deep working knowledge of serverless and container-based technologies such as Lambda, Docker, Kubernetes, and container platforms such as Rancher Labs or RedHat OpenShift.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering roles
Location – Bangalore/ Remote
Type - Contract
Work Ex - 4-6 yrs
We’re working with a AI product company that’s building the next generation of GenAI powered developer platforms .
We’re looking for an experienced Site Reliability Engineer to join their Platform Engineering team . This role is perfect for someone who thrives at the intersection of software engineering and systems operations , and wants to build infrastructure that powers millions of AI-driven code reviews at scale.
What We’re Looking For
- 4–6 years in SRE, Platform Engineering, or DevOps roles.
- Strong hands-on with GCP (or AWS) , Kubernetes , Docker , and Terraform .
- Proficiency in Node.js / TypeScript for automation and tooling.
- Strong background in Linux/Unix systems, networking, and CI/CD pipelines .
- Familiarity with observability platforms (Datadog, Prometheus, Grafana, ELK).
Nice to Have
- AI/ML infrastructure exposure
- Experience running high-traffic, distributed systems
- Open-source contributions
- Knowledge of compliance (SOC 2, ISO 27001) and cost optimization
Why This Role?
- Work on cutting-edge AI systems with massive real-world developer impact
- Join a collaborative, high-growth product team
- Competitive salary + equity + benefits
- Shape infrastructure that supports millions of real-time code reviews
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About Us
MyRemoteTeam, Inc is a fast-growing distributed workforce enabler, helping companies scale with top global talent. We empower businesses by providing world-class software engineers, operations support, and infrastructure to help them grow faster and better.
Job Title: AWS SRE Engineer
Mandatory skills: Java, Cloud(AWS or Docker/Kubernetes), Prod support knowledge, Snow Tool
Exp: 8+ Yrs
Candidate needs to work from client office.
Detailed JD:
We are seeking a skilled and proactive engineer with expertise in Kubernetes, Java-based applications, and cloud platforms (AWS/Azure/GCP), along with experience in ServiceNow for support ticket management. The ideal candidate will be responsible for maintaining cloud-native applications, troubleshooting production issues, and ensuring smooth operations through effective ticket handling and resolution.
Key Responsibilities:
Kubernetes & Cloud Operations:
- Deploy, manage, and monitor containerized applications using Kubernetes.
- Maintain and optimize cloud infrastructure (AWS, Azure, or GCP).
- Automate deployments and infrastructure using CI/CD pipelines and Infrastructure as Code (IaC) tools like Terraform or Helm.
- Monitor system performance, availability, and security.
Java Application Support:
- Troubleshoot and debug Java-based microservices and APIs.
- Collaborate with development teams to resolve application issues.
- Participate in code reviews and suggest performance improvements.
ServiceNow (SNOW) Support:
- Handle incident, problem, and change management via ServiceNow.
- Raise, track, and resolve support tickets in coordination with internal and external teams.
- Document root cause analysis (RCA) and resolution steps for recurring issues.
Collaboration & Documentation:
- Work closely with DevOps, QA, and development teams.
- Maintain technical documentation, runbooks, and knowledge base articles.
- Participate in on-call rotations and provide timely support for critical issues.
Required Skills:
- Strong hands-on experience with Kubernetes and container orchestration.
- Proficiency in Java and related frameworks (Spring Boot, REST APIs).
- Experience with cloud platforms (AWS, Azure, or GCP).
- Familiarity with ServiceNow or similar ITSM tools.
- Good understanding of CI/CD tools (Jenkins, GitLab CI, etc.).
- Knowledge of monitoring tools (Prometheus, Grafana, ELK, etc.)
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Role:
We are looking for a skilled Site Reliability Engineer (SRE) to join our team and help us ensure the reliability, scalability, and performance of our critical systems. As an SRE, you will work closely with development and operations teams to build and maintain highly available services, automate operational tasks, and monitor system health.
Key Responsibilities:
- Design, implement, and maintain scalable and reliable infrastructure for production systems.
- Automate repetitive operational tasks, deployments, and monitoring.
- Collaborate with software engineering teams to build reliable and efficient services.
- Develop and maintain tools for system monitoring, alerting, and incident response.
- Participate in on-call rotations and manage incident response to minimize downtime.
- Analyze system performance and identify bottlenecks or failure points.
- Develop and implement disaster recovery and backup strategies.
- Advocate for reliability, availability, and performance best practices throughout the engineering teams.
- Document processes, architecture, and troubleshooting guides.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
- Proven 5+ years of experience in Site Reliability Engineering with Tools like Grafana, and Prometheus
- Experience with Docker, Kubernetes, and Terraform
- Strong knowledge of Linux/Unix systems and networking fundamentals.
- Proficiency with scripting and programming languages such as Python, Go, Bash, or Ruby.
- Experience with cloud platforms like AWS, GCP, or Azure.
Be The First To Know
About the latest Site reliability engineer Jobs in Coimbatore !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are looking for a highly skilled AWS Engineer with strong Python development and Chaos Engineering expertise to design, build, and validate resilient, scalable, and automated cloud-native environments. The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault tolerance, and operational efficiency of critical systems.
Key Responsibilities
Cloud Engineering (AWS):
- Architect, implement, and manage secure, scalable, and cost-efficient AWS infrastructure (EC2, Lambda, EKS, S3, RDS, IAM, CloudFront, etc.).
- Automate infrastructure provisioning and configuration using Terraform / CloudFormation and AWS SDKs.
- Manage containerized workloads (Docker, Kubernetes, EKS).
Python Development:
- Build automation scripts, deployment utilities, and infrastructure tooling using Python (Boto3, Flask, FastAPI, etc.) .
- Develop custom monitoring/alerting integrations with APIs, SDKs, and third-party observability platforms.
- Implement self-healing and resilience-focused automation scripts.
Chaos Engineering & Resiliency:
- Design and execute chaos experiments (fault injection, latency, outages, resource failures) to validate system resilience.
- Use tools like Gremlin, Litmus, Chaos Mesh, or AWS Fault Injection Simulator .
- Partner with SRE and development teams to define SLIs, SLOs, and error budgets .
- Document learnings from chaos tests and improve incident response & recovery playbooks.
DevOps & Observability:
- Build and maintain CI/CD pipelines for automated deployments (Jenkins, GitHub Actions, GitLab CI, AWS CodePipeline).
- Integrate observability frameworks (Prometheus, Grafana, ELK/EFK, CloudWatch, Datadog) for monitoring and tracing.
- Ensure proactive alerting and real-time visibility into system health.
Security & Compliance:
- Apply AWS security best practices for IAM, networking, and data protection.
- Ensure compliance with internal and external regulatory frameworks (SOC2, ISO, GDPR, etc.).
Required Skills & Qualifications
- 6–10 years of experience in Cloud, DevOps, or SRE roles.
- Strong hands-on expertise in AWS Cloud (certifications preferred: AWS DevOps Engineer / Solutions Architect).
- Advanced Python development skills for automation and tooling (Boto3 a must).
- Experience designing and running chaos experiments (Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection).
- Solid knowledge of IaC (Terraform / CloudFormation) .
- Proficiency in containers & orchestration (Docker, Kubernetes, EKS) .
- Strong background in monitoring, observability, and incident management .
- Familiarity with DevOps toolchain (CI/CD, Git, Jenkins, GitLab, CodePipeline) .
- Good understanding of resilient architectures, reliability principles, and disaster recovery .
Preferred Skills
- Knowledge of Go / Shell scripting in addition to Python.
- Experience with chaos testing in production-like environments .
- Exposure to multi-cloud or hybrid-cloud environments .
- Strong problem-solving and debugging skills.
What We Offer
- Opportunity to lead cloud reliability & chaos engineering initiatives .
- Culture focused on automation, resilience, and continuous improvement .
- Growth opportunities through certifications, R&D projects, and leadership roles.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Uplers is hiring for one of the clients. It is a remote opportunity.
Role Details:
- Position: SRE (Oracle Cloud Infrastructure)
- Type: 10-month contract (possible extension)
- Mode: Remote | Mon–Fri | 10:30 AM – 7:30 PM IST
- Policy: Use of personal device required
- Experience: 7–10 yrs (min. 7–8 yrs in OCI)
- Skills: OCI, Terraform, GitLab
- Rounds: 2
About the Role:
SRE Engineer: build and manage our OCI cloud infrastructure using Terraform and GitLab CI/CD, ensuring its stability, security, and scalability.
You will automate build and deployments of various apps, implement monitoring and logging, and collaborate with development teams while staying current with OCI best practices.
Experience with OCI, Terraform, GitLab CI/CD, and cloud-native principles is essential.
Preferable if the candidate has FS banking experience.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are seeking a hands-on SRE with expertise in infrastructure automation, cloud scalability, and performance optimization. You’ll design, manage, and monitor large-scale AWS environments, ensuring high availability, security, and reliability for our SaaS platforms
Key Responsibilities
- Develop and execute UI automation using Cypress with TypeScript.
- Conduct performance testing using K6.
- Perform API testing with Postman.
- Run accessibility testing using Wave, AudioEye, and similar tools.
- Manage and optimize AWS infrastructure at scale (EC2, S3, ELB, Lambda, Route 53, ECS, SQS, CloudWatch).
- Package, deploy, and manage containerized workloads (Docker, Kubernetes).
- Automate workflows using Terraform, CDK, Chef .
- Implement CI/CD pipelines (TeamCity, Octopus Deploy, GitHub, Jenkins, Codefresh).
- Monitor and troubleshoot using ELK stack, Dynatrace, New Relic, Nagios.
- Manage and optimize IIS and web farms in high-traffic SaaS environments.
Key Skills & Experience
- 3+ years with IaaC & DSC tools (Terraform, CDK, Chef).
- 3+ years managing containerized workloads on PaaS (Docker, Kubernetes).
- Strong scripting/automation skills (PowerShell, Ruby, Go, Python, Bash).
- Experience with large-scale monitoring & reporting.
- Solid understanding of .NET application architecture .
- Proven problem-solving & troubleshooting skills in DevOps/SRE environments.