Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

2,018 System Reliability jobs in India

System Reliability Engineer

Synechron

Posted today

Tap Again To Close

Job Description

We have immediate opportunity for SRE (Senior Site Reliability Engineer) 5 to 9 years.

Synechron – Bangalore

Job Role: - SRE (Senior Site Reliability Engineer)

Job Location: - Bangalore

Notice Period: Within 30days

About Synechron

We began life in 2001 as a small, self-funded team of technology specialists. Since then, we’ve grown our organization to 14,500+ people, across 58 offices, in 21 countries, in key global markets.

Innovative tech solutions for business

We're now a leading global digital consulting firm, providing innovative technology solutions for business. As a trusted partner, we're always at the forefront of change as we lead digital optimization and modernization journeys for our clients.

Customized end-to-end solutions

Our expertise in AI, Consulting, Data, Digital, Cloud & DevOps and Software Engineering, delivers customized, end-to-end solutions that drive business value and growth.

For more information on the company, please visit our website or LinkedIn community.

Job Description

Base Skills:

Performance Testing & Engg, Scalability, Availability. Exp with Load Testing tools: Jmeter/LoadRunner, and exp with any APM tools : Dynatrace / RewRelics / AppDynamics. Worked on performance optimization on apps and infrastructure

Experience in client side performance engineering with focus on mobile(android/iOS) and web applications optimization and tuning(Optional)

Strong understanding of distributed sytems, cloud platforms (AWS, Azure or GCP) and microservices architecture

Monitoring, observability, Open Telemetry : Using tools like Splunk, AppD, Prometheus, Fluentd, ELK(Elastic Search, Logstash Kibana), TIG( Telegraf, Influx, Grafana). Dynatrace/ AppDynamics. / New Relic

Common Soft skill

Experience of independently execute customer facing role of understanding the SRE requirements, assist in build the team, and drive the implementation

Experience in establish & documenting performance baselines, thresholds and SLA's for critical applications

Optional

Experience in developing capacity planning models and working with stake holders to forecast future scalability requirements

Experience in designing high availability architectures with failover and recovery mechanisms

Hands-on experience of working on RFP/proposals

Excellent communication and business presentation skills

Must Have skills:

System Reliability, Chaos engineering to proactively identify and mitigate potential system vulnerabilities.

Concepts of SLI, SLO, SLA, Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets.

Applications debugging and support, Understanding of REST APIs

Optional Skills

Software Engineering and Development skills: .NET, Go, Python, C++, Ruby or Java or software delivery platforms such as Puppet, Chef, Ansible, and/or Spinnaker. Being able to instrument services;
write exporters and collectors etc

Experience with Kubernetes. Good experience of Automation across applications /services and & Infrastructure Management

QUALIFICATION:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.

If you find this this opportunity interesting kindly share your updated profile on

With below details (Mandatory)

Total Experience

Experience in Site Reliability: -

Current CTC

Expected CTC

Notice period

Current Location

Available for Face-to-Face interview?

Ready to relocate to Bangalore?

If you had gone through any interviews in Synechron before? If Yes when

Regards,

Pravin Chauhan

Hp & WhatsApp #

This advertiser has chosen not to accept applicants from your region.

System Reliability Engineer (Big Data)

411057 Maval, Maharashtra Fulcrum Digital

Posted 148 days ago

Tap Again To Close

Job Description

Permanent

Who are we

Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, health care, and manufacturing.

The Role

Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms.Define strategies for Application Performance Monitoring, Optimization in Prod environmentRespond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.Ensures that batch production scheduling and process are accurate and timely.Able to create and execute queries to big data platform and relational data tables to identify process issues or to perform mass updates, preferred.Performs ad hoc requests from users such as data research, file manipulation/transfer, research of process issues, etc.Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.Maintain services once they are live by measuring and monitoring availability, latency and overall system health.Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.Work with a global team spread across tech hubs in multiple geographies and time zones.Ability to share knowledge and explain processes and procedures to others.RequirementsExperience in Linux and Knowledge on ITSM/ITIL.Experience in the Big Data technologies (Hadoop, Spark, Nifi, Impala)2+ years of Experience in running Big Data production systems.Good to have experience in industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Solid grasp of SQL or Oracle fundamentalsExperience with scripting, pipeline management, and software design.Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.Ability to help debug and optimize code and automate routine tasks.Ability to support many different stakeholders. Experience in dealing with difficult situations and making decisions with a sense of urgency is needed.Appetite for change and pushing the boundaries of what can be done with automation.Experience in working across development, operations, and product teams to prioritize needs and to build relationships are a must.Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired.Good Handle on Change Management and Release Management aspects of Software

This advertiser has chosen not to accept applicants from your region.

Lead DevOps and System Reliability Engineer

Glowingbud

Posted today

Tap Again To Close

Job Description

Glowingbud is a rapidly growing eSIM services platform that simplifies connectivity with powerful APIs, robust B2B and B2C interfaces, and seamless integrations with Telna. Our platform enables global eSIM lifecycle management, user onboarding, secure payment systems, and scalable deployments. Recently acquired by Telna, we are expanding our product offerings and team to meet increasing demand and innovation goals.

Job Summary

We are seeking a highly experienced Senior DevOps Engineer with 10+ years of expertise in cloud infrastructure, automation, and system reliability. The ideal candidate will be responsible for maintaining scalable AWS-based environments, implementing robust CI/CD pipelines, optimizing system performance, and ensuring high availability of critical applications. This role requires deep expertise in Docker, Kubernetes, Infrastructure as Code (IaC), and system monitoring. The candidate will also be responsible for documenting system architecture, setting SLAs, and leading DevOps best practices across teams. If you thrive in a fast-paced, collaborative environment and are passionate about DevOps, we'd love to hear from you!

Key Responsibilities:

Infrastructure Management: Design, implement, and maintain scalable cloud infrastructure using AWS services.
System Documentation & Diagrams: Maintain up-to-date system diagrams, architecture documentation, and operational procedures.
Containerization & Orchestration: Deploy and manage containerized applications using Docker and Kubernetes.
System Maintenance & Optimization: Ensure high availability, performance tuning, and cost optimization of cloud and on-premise infrastructure.
Monitoring & Observability: Implement detailed system monitoring, logging, and alerting using tools like Datadog, Prometheus, Grafana, ELK stack, or AWS CloudWatch.
Security & Compliance: Enforce security best practices, conduct regular audits, and ensure adherence to compliance standards.
CI/CD Pipeline Management: Build and maintain automated deployment pipelines for seamless application releases.
Incident Response & SLA Management: Define SLAs, monitor system performance, and establish an efficient incident response strategy.
Collaboration & Leadership: Work closely with development, QA, and operations teams to improve reliability, scalability, and efficiency.

Qualifications:

7+ years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure roles.
Expert knowledge of AWS Services (EC2, ECS, S3, RDS, Mongo Atlas, Lambda, VPC, ALB, Gateway, Cognito, WAF, IAM, Amplify CloudFormation, etc.).
Strong experience with Docker & Kubernetes for container orchestration and management.
Hands-on experience with infrastructure as code (IaC) tools like Terraform, CloudFormation, or Pulumi.
Expertise in system monitoring and logging tools (Prometheus, Grafana, ELK Stack, Datadog, AWS CloudWatch).
Proficiency in scripting languages (Bash, Python, or Go) for automation and infrastructure management.
Experience with CI/CD pipelines using Jenkins, AWS CodePipeline, GitHub Actions.
Knowledge of networking, security best practices, and system performance tuning.
Experience with setting and enforcing SLAs for DevOps teams.
Strong problem-solving skills and ability to work in a fast-paced environment.

Preferred Skills:

Thorough Experience with AWS Infrastructure.
Knowledge of serverless architectures and event-driven computing.
Experience with configuration management tools (Ansible, Chef, Puppet).
Background in database administration (PostgreSQL, MySQL, or NoSQL databases).

This advertiser has chosen not to accept applicants from your region.

Sr System Reliability Engineer (Application Support + Automation)

Pune, Maharashtra Fulcrum Digital

Posted today

Tap Again To Close

Job Description

Job Description

Whoare we
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.

The Role

·Plan, manage, and oversee all aspects of a Production Environment

·Define strategies for Application Performance Monitoring, Optimization in Prod environment

·Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.

·Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.

·Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.

·Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.

·Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.

·Analyse ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.

·Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.

·Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.

·Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.

·Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.

·Work with a global team spread across tech hubs in multiple geographies and time zones.

·Ability to share knowledge and explain processes and procedures to others.

·Able to perform on-call duties on a rotational basis.

·Occasional off hours work required.

Requirements

Skills –

Must Have

Linux

Shell Scripting

ITIL / ITSM

SQL

Application Troubleshooting

Any Monitoring tool (Preferred Splunk/Dynatrace)

Jenkins - CI/CD

Groovy Scripting/Yaml

Git basic/bit bucket

Kubernetes

Even Framework architecture

Good To Have

Even Framework architecture

Ansible/Chef (Basic)

Benefits

Skills –

Must Have

·Linux

·Shell Scripting

·ITIL / ITSM

·SQL

·Application Troubleshooting

·Any Monitoring tool (Preferred Splunk/Dynatrace)

·Jenkins - CI/CD

·Groovy Scripting/Yaml

·Git basic/bit bucket

·Kubernetes

·Even Framework architecture

Good To Have

·Even Framework architecture

·Ansible/Chef (Basic)

This advertiser has chosen not to accept applicants from your region.

Lead System Reliability Engineer (Application Support + Automation)

Pune, Maharashtra Fulcrum Digital

Posted today

Tap Again To Close

Job Description

Job Description

Whoare we:
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.

The Role: