Didn't find the right job?

Get expert career advice to help you find the ideal role and improve your job search strategy.

2,330 System Reliability jobs in India

Senior Analyst, Site Reliability Engineering

Bengaluru, Karnataka Hudson's Bay Company

Posted today

Tap Again To Close

Job Description

Job Description

Saks Cloud Services is looking for a Senior Analyst to join the Site Reliability Engineering (SRE) team.The ideal candidate for this role would be someone who is outgoing, obsessed with customer service and has strong analytical and communication skills. This candidate should also strive for continuous improvement, be enthusiastic about new ideas and enjoy opportunities to “think outside the box”.

Position: Sr. Analyst SRE

What this position is all about

The successful candidate will primarily identify and analyze technical problems in systems and applications across all supported divisions. Work closely with cross-functional IT Teams to troubleshoot and resolve application-related issues. Play a key role in implementing new solutions that improve the efficiency and effectiveness of the team and organization. The ideal candidate for this role should have a strong technical background and communicate effectively with technical and non-technical stakeholders.

Role description:

?5+ years of experience working within DevOps or SRE teams.

?3+ years of experience with any Cloud platforms (preferably AWS, Azure)

?Ability to program (structured and OO) with one or more high-level languages, such as JavaScript, Java,Python and bash.

?Participate in on-call rotations (PagerDuty/Opsgenie) and respond to incidents outside of regular hours.

?Run the production environment by monitoring availability and taking a holistic view of system health

?Part of building and implementing services to make IT and support better at their jobs.

?Improve reliability, quality, and time-to-market of our suite of software solutions

?Measure and optimize system performance, to push our capabilities forward, get ahead of customer needs, and innovate to improve continually

?Validate the NFR/SLx with production logs or business analytics.

?Conduct proof-of-concepts to showcase the benefit of the recommendation.

?Instrument the target environment to capture relevant monitoring metrics for analysis.

?Contribute to grooming SRE in core concepts and build a knowledge repository by adding point-of-view documents and blogs.

?Document the engineering strategy and analysis reports.

?Document every action so your findings turn into repeatable actions–and then into automation.

?Hands-on experience with Distributed Version Control Systems such as GIT, AWS Code Commit or equivalent.

?Must have experience with Docker, Kubernetes, Terraform, and Ansible.

?Know your way around Linux and the Unix Shell.

?Experience or familiarity with ELK stack

?Balance feature development speed and reliability with well-defined service level objectives

?Monitor systems and telemetry of Salesforce Commerce Cloud and Salesforce Service Cloud for operational health in terms of site stability, reliability, and performance.

?Prioritize and develop automated administrative and operational tasks to continuously improve site stability, capacity, reliability, and performance.

?Provide active incident response support, investigate major problems, and ensure the timely and effective return to normal operations of the Digital Commerce and CRM platforms during major incidents.

?Provide periodic on-call support based on established 24/7/365 support schedules.

?Collaborate with Digital Development, and QA teams to ensure that Production environments are deployment-ready by Change Management processes and the Digital release schedules.

?Support Development teams in the provision and configuration of lower environments including CICD pipeline support

?Support incident management and problem management efforts with root cause analysis to effectively identify and resolve issues related to platform reliability, stability, and performance through the careful analysis of telemetry data and system logs.

?Collaborate with Engineering and Project teams to perform production readiness assessments and ensure that proper controls and processes are in place.

?Support / execute production change management requests on behalf of the Digital Engineering teams.

?Evaluate and propose tools and techniques to improve operational activities.

?Support Development teams in the provision and configuration of lower environments.

Job Qualifications

Key Qualifications:

?5+ years of related work experience, preferably in SRE or DevOps-related fields.

?Understand customer business processes & transactions

?Understand application architecture/design, analyze non-functional requirements, SLI/SLO

?Independently troubleshoot performance, scalability, capacity, resilience & reliability issues & correlate to application code & configurations.

?Involve in code, design and Architecture reviews and ensure meeting application reliability goals

?Strong troubleshooting, analytical, and problem-solving skills

?Strong verbal and written communication skills.

?Experience in the administration and support of Digital Retail Platforms, e.g. Salesforce CC, Shopify, Magento, IBM WebSphere Commerce, etc. is an asset.

?Experience with monitoring, logging & telemetry tools like New Relic, Mpulse, Splunk, Nagios, SolarWinds, Prometheus, AWS Cloudwatch, Datadog, etc.

?Experience with cloud infrastructure administration (i.e., AWS, GCP, Azure)

?Basic understanding of Networking, Content Delivery Networks (CDN, e.g. Akamai, Cloudflare), and Saas solutions

?Hands-on experience with scripting languages and in maintaining Automation frameworks (PowerShell, Python, Ruby, AWK, SED, Shell, etc.) to run health checks and self-healing capabilities for the platforms.

?Experience with automation and tools such as (but not limited to) GitHub Actions, Chef, Terraform, Ansible, etc.

?Experience with Web/development technologies (i.e., JavaScript, Node.js, React, HTML, XML, CSS, REST)

?Experience with ticketing and collaboration tools (i.e., JSM, Jira Work Management, ServiceNow)

?3+ years of SRE experience working on telemetry, observation, self-healing solutions, and platform automation

Your Life and Career at Saks Cloud Services

?Be part of a world-class team; work adventurously; think and act like an owner-operator!

?Exposure to rewarding career advancement opportunities, from retail to supply chain, to digital or corporate.

?A culture that promotes a healthy, fulfilling work/life balance.

?Benefits package for all eligible full-time employees (including medical, vision, and dental).

?amazing employee discount

This advertiser has chosen not to accept applicants from your region.

System Reliability Engineer

Prayagraj, Uttar Pradesh ShiftCare

Posted today

Tap Again To Close

Job Description

We're on the lookout for a passionate and exceptional reliability engineer to join our dynamic team and help us transform the homecare industry. Rally with us in creating meaningful experiences for our hyper-growth healthcare startup.

Why ShiftCare?

We're not just optimising resources; we're enhancing experiences. Our purpose-built solution is changing the game for support providers in Australia and North America, making care accessible and affordable for all.

About this role:

Enjoy ownership and responsibility, with a bias towards identifying problems and proposing and implementing solutions.
Strong experience with Ruby on Rails, especially in production SaaS systems.
Deep knowledge of background job processing (Sidekiq or similar), caching, and distributed systems.
Proven experience improving CI/CD pipelines, we currently use CircleCI but don't discard a migration.
Comfortable designing and improving observability stacks (New Relic, Datadog, Honeycomb, etc.).
Experience building resilient systems — retries, back-offs, queueing, circuit breakers, graceful degradation, kill switches, isolation of workloads, etc.
Strong focus on developer ergonomics and reliability culture.
Bias toward action and delivering tools that improve system behaviour and developer happiness.

What you’ll do

Own and improve our CI/CD pipelines (CircleCI), reducing deploy times and failure rates.
Build reliable retry/back-off mechanisms for critical user workflows.
Design and implement observability tooling, including synthetic checks, smoke tests, etc.
Help architect and implement failover and fallback mechanisms for critical vendors and workflows.
Work with Support to build debug tooling and dashboards that empower non-engineers.
Collaborate with engineering to define and template runbooks, kill switches, and disaster mitigation patterns.
Champion performance tuning.

This advertiser has chosen not to accept applicants from your region.

System Reliability Engineer (Big Data)

411057 Maval, Maharashtra Fulcrum Digital

Posted 100 days ago

Tap Again To Close

Job Description

Permanent

Who are we

Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, health care, and manufacturing.

The Role

Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms.Define strategies for Application Performance Monitoring, Optimization in Prod environmentRespond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.Ensures that batch production scheduling and process are accurate and timely.Able to create and execute queries to big data platform and relational data tables to identify process issues or to perform mass updates, preferred.Performs ad hoc requests from users such as data research, file manipulation/transfer, research of process issues, etc.Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.Maintain services once they are live by measuring and monitoring availability, latency and overall system health.Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.Work with a global team spread across tech hubs in multiple geographies and time zones.Ability to share knowledge and explain processes and procedures to others.RequirementsExperience in Linux and Knowledge on ITSM/ITIL.Experience in the Big Data technologies (Hadoop, Spark, Nifi, Impala)2+ years of Experience in running Big Data production systems.Good to have experience in industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Solid grasp of SQL or Oracle fundamentalsExperience with scripting, pipeline management, and software design.Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.Ability to help debug and optimize code and automate routine tasks.Ability to support many different stakeholders. Experience in dealing with difficult situations and making decisions with a sense of urgency is needed.Appetite for change and pushing the boundaries of what can be done with automation.Experience in working across development, operations, and product teams to prioritize needs and to build relationships are a must.Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired.Good Handle on Change Management and Release Management aspects of Software

This advertiser has chosen not to accept applicants from your region.

Lead System Reliability Engineer (Application Support + Automation)

New

Pune, Maharashtra Fulcrum Digital

Posted today

Tap Again To Close

Job Description

Job Description

Whoare we:
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.

The Role:

·Plan, manage, and oversee all aspects of a Production Environment

·Define strategies for Application Performance Monitoring, Optimization in Prod environment

·Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.

·Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.

·Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.

·Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.

·Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.

·Analyse ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.

·Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.

·Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.

·Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.

·Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.

·Work with a global team spread across tech hubs in multiple geographies and time zones.

·Ability to share knowledge and explain processes and procedures to others.

·Able to perform on-call duties on a rotational basis.

·Occasional off hours work required.

Requirements

Skills –

Must Have:

·Linux

·Shell Scripting (basic)

·ITIL / ITSM

·SQL

·Application Troubleshooting

·Any Monitoring tool (Preferred Splunk/Dynatrace)

·Jenkins - CI/CD (basic)

·Groovy Scripting/Yaml (basic)

·Git basic/bit bucket (basic)

·Networks – F5, Load Balancers, HSM, Security Keys, SSL/TLS Certificates - (knowledge/experience)

·ISO8583 / ISO20022 (knowledge/experience)

Good To Have:

·Payments Flows, Switching, Settlements, Authorisation flows.

·Even Framework architecture

·Ansible/Chef (Basic)

This advertiser has chosen not to accept applicants from your region.

Sr System Reliability Engineer (Application Support + Automation)

New

Pune, Maharashtra Fulcrum Digital

Posted today

Tap Again To Close

Job Description

Job Description

Whoare we
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.