64 Sre jobs in Coimbatore
Associate Platform Reliability Engineer (SRE)
Posted today
Job Viewed
Job Description
About Us
Jefferies Financial Group Inc. (‘‘Jefferies,’’ ‘‘we,’’ ‘‘us’’ or ‘‘our’’) is a U.S.-headquartered global full service, integrated investment banking and securities firm. Our largest subsidiary, Jefferies LLC, a U.S. broker-dealer, was founded in the U.S. in 1962 and our first international operating subsidiary, Jefferies International Limited, a U.K. broker-dealer, was established in the U.K. in 1986. Our strategy focuses on continuing to build out our investment banking effort, enhancing our capital markets businesses and further developing our Leucadia Asset Management alternative asset management platform. We offer deep sector expertise across a full range of products and services in investment banking, equities, fixed income, asset and wealth management in the Americas, Europe and the Middle East and Asia.
Overview
We are seeking a hands-on, technically skilled professional to join our global team as an Associate Platform Reliability Engineer. This role is critical to ensuring the stability, reliability, and resilience of Jefferies’ front-to-back technology infrastructure, with a focus on post-trade processing, operations, and regulatory support.
Key Responsibilities
- Collaborate with a high-performing global team to maintain plant stability across middle-office and operations applications.
- Lead incident triage, root cause analysis, and communication, with a strong focus on problem management.
- Partner with regional teams to drive technical and functional initiatives.
- Identify and eliminate manual support tasks through automation; develop tools for deployment, management, and service visibility.
- Design and implement robust monitoring and alerting systems using platforms like AppD and Open Telemetry.
- Work closely with engineering teams to support system architecture, schema design, and performance tuning.
- Troubleshoot issues across the full technology stack.
Qualifications
- Bachelor’s degree in Computer Science, Engineering, or a related field.
- Minimum 3 years of experience in programming with Python, Go, C/C++, or C#.
- Strong foundation in computing fundamentals—data structures, algorithms, and software design.
- Experience in application design, maintenance, and support.
- Solid understanding of SRE principles and practices.
- Proficiency in Linux/Unix and Windows Server environments.
- Hands-on experience with scripting, databases, and troubleshooting application/data access issues.
- Self-driven with a strong sense of ownership and commitment to quality.
- Excellent communication skills, able to engage both technical and business stakeholders.
- Familiarity with open-source platforms such as Redis, MongoDB, Kafka, and Elasticsearch.
- Experience configuring observability stacks (Grafana, Prometheus, Jaeger, Loki).
- Exposure to DevOps tools and technologies including Git, Jenkins, Ansible.
Jefferies is an equal employment opportunity employer, and takes affirmative action to ensure that all qualified applicants will receive consideration for employment without regard to race, creed, color, national origin, ancestry, religion, gender, pregnancy, age, physical or mental disability, marital status, sexual orientation, gender identity or expression, veteran or military status, genetic information, reproductive health decisions, or any other factor protected by applicable law. We are committed to hiring the most qualified applicants and complying with all federal, state, and local equal employment opportunity laws. As part of this commitment, Jefferies will extend reasonable accommodations to individuals with disabilities, as required by applicable law.
We have been made aware of bad actors falsely claiming to be associated with Jefferies Group soliciting individuals to attend virtual job interviews, complete online tests or courses and sending fictitious employment offer letters.
Please note that any email contact with Jefferies personnel will come from an “@jefferies.com” email address. Further, Jefferies will not notify shortlisted candidates through social media platforms (e.g. WhatsApp or Telegram) or ask candidates to make payment to participate in the hiring process.
Developer - Cloud SRE & DevOps
Posted 11 days ago
Job Viewed
Job Description
Site Reliability Engineer (SRE) - Airline Sciences (DMA)
Designation: Developer - Cloud SRE & DevOps
Key Responsibilities:
● Design, implement, and maintain robust and scalable infrastructure on AWS to support our microservices-based applications and REST APIs.
● Work collaboratively with DevOps Engineer’s CI/CD pipelines for automated deployment, testing, and rollback of services.
● Monitor system performance, availability, and reliability using APM tools, with a preference for Kibana-based solutions, and establish effective alerting mechanisms.
● Proactively identify potential issues and bottlenecks through log analysis, performance metrics, and synthetic monitoring; implement preventative measures.
● Troubleshoot and resolve complex production incidents, performing root cause analysis (RCA) and implementing long-term solutions.
● Manage and optimize database performance, reliability, and scalability.
● Configure and maintain network infrastructure, including load balancers, firewalls, and proxies, ensuring secure and efficient traffic flow.
● Champion and implement infrastructure-as-code (IaC) practices.
● Work with Docker for containerization of applications, managing container orchestration and registries.
● Collaborate closely with development teams to define service level objectives (SLOs), service level indicators (SLIs), and error budgets.
● Develop and maintain comprehensive documentation for system architecture, configurations, and operational procedures.
● Drive automation initiatives to reduce manual effort and improve system resilience. ● Contribute to capacity planning and performance tuning efforts.
Required Qualifications:
● Bachelor's degree in Computer Science, Engineering, or a related technical field.
● 4-6 years of experience in Site Reliability Engineering, DevOps, or a similar role.
● Proven hands-on experience with Amazon Web Services (AWS), including services like EC2, S3, RDS, VPC, IAM (Identity and Access Management), Lambda, and EKS/ECS.
● Strong understanding and practical experience with microservices architecture and REST API principles.
● Proficiency in managing and troubleshooting relational and NoSQL databases (e.g., PostgreSQL, MySQL, MongoDB, Cassandra).
● Solid knowledge of networking concepts (TCP/IP, DNS, HTTP/S, VPNs) and experience with proxies (e.g., Nginx, HAProxy).
● Demonstrable experience with monitoring, logging, and alerting systems, with a strong preference for experience with the ELK Stack (Elasticsearch, Logstash, Kibana) for APM and observability.
● Hands-on experience with Docker containerization and orchestration (e.g., Kubernetes, Docker Swarm).
● Capable in at least one scripting language (e.g., Python, Bash, Go).
● Experience with CI/CD tools (e.g., Jenkins, GitLab CI, AWS CodePipeline).
● Strong analytical and problem-solving skills with a proactive approach to identifying and resolving issues.
● Excellent communication and collaboration skills.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Roles & Responsibilities
Design, test, implement, deploy, and support continuous integration pipelines that build and deploy to cloud-based environments (development, stage/testing, production). In this role, you will help us build the foundation for our future technology platforms by setting standards around cloud-native patterns, which will inform and be the driving factor in system wide strategic cloud transformation journey.
- Write infrastructure code and test cases code.
- Deploy and configure monitoring and logging tools used for generating alerts about the health of our systems and applications.
- Help with the design, building, and automation of cloud-based infrastructure and provide guidance to development teams regarding how they can continually improve their applications cost, performance, and reliability through investigation, analysis, and best practice recommendations.
- Work in a Cloud Native infrastructure environment built to host and support true micro services architecture applications.
Mandatory Skills
- 6-12 years of experience in a professional cloud computing role with Kubernetes, Docker and Infra-as-Code.
- A BA/BS in Computer Science or equivalent work experience.
- Experience in design, implementation, and deployment of large-scale, highly available, cloud-based infrastructure utilizing AWS, GCP, Azure or other public cloud providers.
- Strong collaboration skills with multiple IT functions, business leaders and vendors, exhibiting excellent teamwork and strong verbal and written communication skills along with expert troubleshooting and analytical skills.
- In depth knowledge of cloud-native application development paradigms and tools.
- Strong know-how with the current trends of large-scale infrastructure environments with proven success in the deployment of resilient cloud-native solutions.
- Experience with source code control systems, branching and merging strategies, automated unit testing frameworks, automated build tools, and automated deploy frameworks.
- Deep working knowledge of serverless and container-based technologies such as Lambda, Docker, Kubernetes, and container platforms such as Rancher Labs or RedHat OpenShift.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Role:
We are looking for a skilled Site Reliability Engineer (SRE) to join our team and help us ensure the reliability, scalability, and performance of our critical systems. As an SRE, you will work closely with development and operations teams to build and maintain highly available services, automate operational tasks, and monitor system health.
Key Responsibilities:
- Design, implement, and maintain scalable and reliable infrastructure for production systems.
- Automate repetitive operational tasks, deployments, and monitoring.
- Collaborate with software engineering teams to build reliable and efficient services.
- Develop and maintain tools for system monitoring, alerting, and incident response.
- Participate in on-call rotations and manage incident response to minimize downtime.
- Analyze system performance and identify bottlenecks or failure points.
- Develop and implement disaster recovery and backup strategies.
- Advocate for reliability, availability, and performance best practices throughout the engineering teams.
- Document processes, architecture, and troubleshooting guides.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
- Proven 5+ years of experience in Site Reliability Engineering with Tools like Grafana, and Prometheus
- Experience with Docker, Kubernetes, and Terraform
- Strong knowledge of Linux/Unix systems and networking fundamentals.
- Proficiency with scripting and programming languages such as Python, Go, Bash, or Ruby.
- Experience with cloud platforms like AWS, GCP, or Azure.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Role: Site Reliability Engineer
Experience: 8-14 years
Location: Sector 16, Noida
Notice Period: Immediate / Serving only
About Times Internet
At Times Internet, we create premium digital products that simplify and enhance the lives of
millions. As India’s largest digital products company, we have a significant presence across a
wide range of categories, including News, Sports, Fintech, and Enterprise solutions.
Our portfolio features market-leading and iconic brands such as TOI, ET, NBT, Cricbuzz, Times
Prime, Times Card, Indiatimes, Whatshot, Abound, Willow TV, Techgig, and Times Mobile
among many more. Each of these products is crafted to enrich your experiences and bring you
closer to your interests and aspirations.
As an equal opportunity employer, Times Internet strongly promotes inclusivity and diversity. We
are proud to have achieved overall gender pay parity in 2018, verified by an independent audit
conducted by Aon Hewitt.
We are driven by the excitement of new possibilities and are committed to bringing innovative
products, ideas, and technologies to help people make the most of every day. Join us and take
us to the next level!
Job Description
We are looking for a Site Reliability Engineer (SRE) to join our News Team. The SRE will be
responsible for maintaining the reliability, scalability, and performance of our critical
infrastructure, ensuring high availability for our services.
Job Role:
As a Site Reliability Engineer (SRE) in the News Team, you will be responsible for ensuring the
stability, performance, and scalability of our systems. You will play a key role in various
migration activities, including Kubernetes cluster upgrades, and application re-platforming. A
significant part of your role will involve migrating applications into Kubernetes, ensuring
seamless deployment, high availability, and minimal downtime.
Additionally, you will be responsible for configuring and maintaining Elasticsearch and Kafka
clusters, ensuring optimal performance, availability, and reliability. You will work on tuning
Elasticsearch for efficient search and indexing, managing Kafka for real-time data streaming,
and troubleshooting any issues related to these services.
You will work on automating operational tasks, optimizing infrastructure, and proactively
resolving issues to maintain system reliability. Additionally, you will collaborate with
development, DevOps, and infrastructure teams to implement best practices for security,
observability, and scalability. Your expertise will be crucial in improving deployment pipelines,
incident response, and overall system performance.
Job Responsibilities:
● Ensure IT services and infrastructure uptime.
● Implement monitoring, alerting, and incident response processes
● Automate repetitive ops tasks (deployments, scaling, failover).
● Respond to outages and production incidents (on-call duties).
● Perform root cause analysis (RCA) and drive postmortems.
● Measure and optimize system performance (latency, throughput, resource usage).
● Support reliable and safe code releases
● Ensure systems are patched, hardened, and compliant with standards.
● Collaborate with technology teams for new requirements and deliver them
Technical Skills Required:
● 8+ years of experience in Site Reliability Engineering, or a related role.
● Proficiency in Kubernetes, Docker, and container orchestration.
● Experience with CI/CD tools.
● Strong knowledge of Linux systems and scripting (Bash, Python).
● Familiarity with configuration management tools like Ansible,Helm.
● Experience with monitoring and logging tools (ELK Stack, or NewRelic).
● Strong troubleshooting skills and incident management experience.
● Experience with Elasticsearch and Kafka
● Knowledge of networking concepts, load balancers, and DNS.
● Experience in performance tuning and optimization.
Soft Skills Required:
● Systems & OS Knowledge
● Linux/Unix administration (process management, system tuning, networking)
● Understanding of filesystems, memory, CPU, kernel basics (centos / Ubuntu )
● Scripting for automation: BASH, python
● Knowledge of cloud platforms : AWS, GCP, Azzure
● Networking and Protocols
● TCP/IP, DNS, HTTP/HTTPS, CDN concepts
● Debugging latency, connectivity, and routing issues
● CI/CD and DevOps Practices
● Jenkins, GitHub Actions, GitLab CI, BitBucket, Git
● Working knowledge of Apache, Tomcat, Nginx
● Knowledge of DNS, Load Balancer, WAF, Firewall.
● Working knowledge of Monitoring tools and ELK
● Knowledge hypervisor like VMware.
● Strong on Virtualization technologies, Docker, Kubernetes
● Knowledge of Database concepts
Qualifications - Education & Experience:
● Bachelor’s degree in Electronic and Telecom, Computer Science, Information
Technology, or a related field.
● 8+ years of experience in Site Reliability Engineering
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Roles & Responsibilities
Design, test, implement, deploy, and support continuous integration pipelines that build and deploy to cloud-based environments (development, stage/testing, production). In this role, you will help us build the foundation for our future technology platforms by setting standards around cloud-native patterns, which will inform and be the driving factor in system wide strategic cloud transformation journey.
- Write infrastructure code and test cases code.
- Deploy and configure monitoring and logging tools used for generating alerts about the health of our systems and applications.
- Help with the design, building, and automation of cloud-based infrastructure and provide guidance to development teams regarding how they can continually improve their applications cost, performance, and reliability through investigation, analysis, and best practice recommendations.
- Work in a Cloud Native infrastructure environment built to host and support true micro services architecture applications.
Mandatory Skills
- 6-12 years of experience in a professional cloud computing role with Kubernetes, Docker and Infra-as-Code.
- A BA/BS in Computer Science or equivalent work experience.
- Experience in design, implementation, and deployment of large-scale, highly available, cloud-based infrastructure utilizing AWS, GCP, Azure or other public cloud providers.
- Strong collaboration skills with multiple IT functions, business leaders and vendors, exhibiting excellent teamwork and strong verbal and written communication skills along with expert troubleshooting and analytical skills.
- In depth knowledge of cloud-native application development paradigms and tools.
- Strong know-how with the current trends of large-scale infrastructure environments with proven success in the deployment of resilient cloud-native solutions.
- Experience with source code control systems, branching and merging strategies, automated unit testing frameworks, automated build tools, and automated deploy frameworks.
- Deep working knowledge of serverless and container-based technologies such as Lambda, Docker, Kubernetes, and container platforms such as Rancher Labs or RedHat OpenShift.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
Role - Site Reliability Engineer (SRE)/ Platform Engineering/ or DevOps Engineering roles
Location – Bangalore/ Remote
Type - Contract
Work Ex - 4-6 yrs
We’re working with a AI product company that’s building the next generation of GenAI powered developer platforms .
We’re looking for an experienced Site Reliability Engineer to join their Platform Engineering team . This role is perfect for someone who thrives at the intersection of software engineering and systems operations , and wants to build infrastructure that powers millions of AI-driven code reviews at scale.
What We’re Looking For
- 4–6 years in SRE, Platform Engineering, or DevOps roles.
- Strong hands-on with GCP (or AWS) , Kubernetes , Docker , and Terraform .
- Proficiency in Node.js / TypeScript for automation and tooling.
- Strong background in Linux/Unix systems, networking, and CI/CD pipelines .
- Familiarity with observability platforms (Datadog, Prometheus, Grafana, ELK).
Nice to Have
- AI/ML infrastructure exposure
- Experience running high-traffic, distributed systems
- Open-source contributions
- Knowledge of compliance (SOC 2, ISO 27001) and cost optimization
Why This Role?
- Work on cutting-edge AI systems with massive real-world developer impact
- Join a collaborative, high-growth product team
- Competitive salary + equity + benefits
- Shape infrastructure that supports millions of real-time code reviews
Be The First To Know
About the latest Sre Jobs in Coimbatore !
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About Us
MyRemoteTeam, Inc is a fast-growing distributed workforce enabler, helping companies scale with top global talent. We empower businesses by providing world-class software engineers, operations support, and infrastructure to help them grow faster and better.
Job Title: AWS SRE Engineer
Mandatory skills: Java, Cloud(AWS or Docker/Kubernetes), Prod support knowledge, Snow Tool
Exp: 8+ Yrs
Candidate needs to work from client office.
Detailed JD:
We are seeking a skilled and proactive engineer with expertise in Kubernetes, Java-based applications, and cloud platforms (AWS/Azure/GCP), along with experience in ServiceNow for support ticket management. The ideal candidate will be responsible for maintaining cloud-native applications, troubleshooting production issues, and ensuring smooth operations through effective ticket handling and resolution.
Key Responsibilities:
Kubernetes & Cloud Operations:
- Deploy, manage, and monitor containerized applications using Kubernetes.
- Maintain and optimize cloud infrastructure (AWS, Azure, or GCP).
- Automate deployments and infrastructure using CI/CD pipelines and Infrastructure as Code (IaC) tools like Terraform or Helm.
- Monitor system performance, availability, and security.
Java Application Support:
- Troubleshoot and debug Java-based microservices and APIs.
- Collaborate with development teams to resolve application issues.
- Participate in code reviews and suggest performance improvements.
ServiceNow (SNOW) Support:
- Handle incident, problem, and change management via ServiceNow.
- Raise, track, and resolve support tickets in coordination with internal and external teams.
- Document root cause analysis (RCA) and resolution steps for recurring issues.
Collaboration & Documentation:
- Work closely with DevOps, QA, and development teams.
- Maintain technical documentation, runbooks, and knowledge base articles.
- Participate in on-call rotations and provide timely support for critical issues.
Required Skills:
- Strong hands-on experience with Kubernetes and container orchestration.
- Proficiency in Java and related frameworks (Spring Boot, REST APIs).
- Experience with cloud platforms (AWS, Azure, or GCP).
- Familiarity with ServiceNow or similar ITSM tools.
- Good understanding of CI/CD tools (Jenkins, GitLab CI, etc.).
- Knowledge of monitoring tools (Prometheus, Grafana, ELK, etc.)
Site Reliability Engineer
Posted today
Job Viewed
Job Description
About the Role:
We are looking for a skilled Site Reliability Engineer (SRE) to join our team and help us ensure the reliability, scalability, and performance of our critical systems. As an SRE, you will work closely with development and operations teams to build and maintain highly available services, automate operational tasks, and monitor system health.
Key Responsibilities:
- Design, implement, and maintain scalable and reliable infrastructure for production systems.
- Automate repetitive operational tasks, deployments, and monitoring.
- Collaborate with software engineering teams to build reliable and efficient services.
- Develop and maintain tools for system monitoring, alerting, and incident response.
- Participate in on-call rotations and manage incident response to minimize downtime.
- Analyze system performance and identify bottlenecks or failure points.
- Develop and implement disaster recovery and backup strategies.
- Advocate for reliability, availability, and performance best practices throughout the engineering teams.
- Document processes, architecture, and troubleshooting guides.
Required Qualifications:
- Bachelor’s degree in Computer Science, Engineering, or related field, or equivalent experience.
- Proven 5+ years of experience in Site Reliability Engineering with Tools like Grafana, and Prometheus
- Experience with Docker, Kubernetes, and Terraform
- Strong knowledge of Linux/Unix systems and networking fundamentals.
- Proficiency with scripting and programming languages such as Python, Go, Bash, or Ruby.
- Experience with cloud platforms like AWS, GCP, or Azure.
Site Reliability Engineer
Posted today
Job Viewed
Job Description
We are looking for a highly skilled AWS Engineer with strong Python development and Chaos Engineering expertise to design, build, and validate resilient, scalable, and automated cloud-native environments. The ideal candidate will combine cloud engineering, DevOps, and chaos experimentation to improve reliability, fault tolerance, and operational efficiency of critical systems.
Key Responsibilities
Cloud Engineering (AWS):
- Architect, implement, and manage secure, scalable, and cost-efficient AWS infrastructure (EC2, Lambda, EKS, S3, RDS, IAM, CloudFront, etc.).
- Automate infrastructure provisioning and configuration using Terraform / CloudFormation and AWS SDKs.
- Manage containerized workloads (Docker, Kubernetes, EKS).
Python Development:
- Build automation scripts, deployment utilities, and infrastructure tooling using Python (Boto3, Flask, FastAPI, etc.) .
- Develop custom monitoring/alerting integrations with APIs, SDKs, and third-party observability platforms.
- Implement self-healing and resilience-focused automation scripts.
Chaos Engineering & Resiliency:
- Design and execute chaos experiments (fault injection, latency, outages, resource failures) to validate system resilience.
- Use tools like Gremlin, Litmus, Chaos Mesh, or AWS Fault Injection Simulator .
- Partner with SRE and development teams to define SLIs, SLOs, and error budgets .
- Document learnings from chaos tests and improve incident response & recovery playbooks.
DevOps & Observability:
- Build and maintain CI/CD pipelines for automated deployments (Jenkins, GitHub Actions, GitLab CI, AWS CodePipeline).
- Integrate observability frameworks (Prometheus, Grafana, ELK/EFK, CloudWatch, Datadog) for monitoring and tracing.
- Ensure proactive alerting and real-time visibility into system health.
Security & Compliance:
- Apply AWS security best practices for IAM, networking, and data protection.
- Ensure compliance with internal and external regulatory frameworks (SOC2, ISO, GDPR, etc.).
Required Skills & Qualifications
- 6–10 years of experience in Cloud, DevOps, or SRE roles.
- Strong hands-on expertise in AWS Cloud (certifications preferred: AWS DevOps Engineer / Solutions Architect).
- Advanced Python development skills for automation and tooling (Boto3 a must).
- Experience designing and running chaos experiments (Gremlin, AWS FIS, Litmus, Chaos Mesh, or custom Python-based fault injection).
- Solid knowledge of IaC (Terraform / CloudFormation) .
- Proficiency in containers & orchestration (Docker, Kubernetes, EKS) .
- Strong background in monitoring, observability, and incident management .
- Familiarity with DevOps toolchain (CI/CD, Git, Jenkins, GitLab, CodePipeline) .
- Good understanding of resilient architectures, reliability principles, and disaster recovery .
Preferred Skills
- Knowledge of Go / Shell scripting in addition to Python.
- Experience with chaos testing in production-like environments .
- Exposure to multi-cloud or hybrid-cloud environments .
- Strong problem-solving and debugging skills.
What We Offer
- Opportunity to lead cloud reliability & chaos engineering initiatives .
- Culture focused on automation, resilience, and continuous improvement .
- Growth opportunities through certifications, R&D projects, and leadership roles.