2,330 System Reliability jobs in India
System Reliability Engineer
Posted today
Job Viewed
Job Description
We're on the lookout for a passionate and exceptional reliability engineer to join our dynamic team and help us transform the homecare industry. Rally with us in creating meaningful experiences for our hyper-growth healthcare startup.
Why ShiftCare?
We're not just optimising resources; we're enhancing experiences. Our purpose-built solution is changing the game for support providers in Australia and North America, making care accessible and affordable for all.
About this role:
- Enjoy ownership and responsibility, with a bias towards identifying problems and proposing and implementing solutions.
- Strong experience with Ruby on Rails, especially in production SaaS systems.
- Deep knowledge of background job processing (Sidekiq or similar), caching, and distributed systems.
- Proven experience improving CI/CD pipelines, we currently use CircleCI but don't discard a migration.
- Comfortable designing and improving observability stacks (New Relic, Datadog, Honeycomb, etc.).
- Experience building resilient systems — retries, back-offs, queueing, circuit breakers, graceful degradation, kill switches, isolation of workloads, etc.
- Strong focus on developer ergonomics and reliability culture.
- Bias toward action and delivering tools that improve system behaviour and developer happiness.
What you’ll do
- Own and improve our CI/CD pipelines (CircleCI), reducing deploy times and failure rates.
- Build reliable retry/back-off mechanisms for critical user workflows.
- Design and implement observability tooling, including synthetic checks, smoke tests, etc.
- Help architect and implement failover and fallback mechanisms for critical vendors and workflows.
- Work with Support to build debug tooling and dashboards that empower non-engineers.
- Collaborate with engineering to define and template runbooks, kill switches, and disaster mitigation patterns.
- Champion performance tuning.
System Reliability Engineer (Big Data)
Posted 100 days ago
Job Viewed
Job Description
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, health care, and manufacturing.
The Role
Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms.Define strategies for Application Performance Monitoring, Optimization in Prod environmentRespond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.Ensures that batch production scheduling and process are accurate and timely.Able to create and execute queries to big data platform and relational data tables to identify process issues or to perform mass updates, preferred.Performs ad hoc requests from users such as data research, file manipulation/transfer, research of process issues, etc.Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.Maintain services once they are live by measuring and monitoring availability, latency and overall system health.Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.Work with a global team spread across tech hubs in multiple geographies and time zones.Ability to share knowledge and explain processes and procedures to others.RequirementsExperience in Linux and Knowledge on ITSM/ITIL.Experience in the Big Data technologies (Hadoop, Spark, Nifi, Impala)2+ years of Experience in running Big Data production systems.Good to have experience in industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Solid grasp of SQL or Oracle fundamentalsExperience with scripting, pipeline management, and software design.Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.Ability to help debug and optimize code and automate routine tasks.Ability to support many different stakeholders. Experience in dealing with difficult situations and making decisions with a sense of urgency is needed.Appetite for change and pushing the boundaries of what can be done with automation.Experience in working across development, operations, and product teams to prioritize needs and to build relationships are a must.Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired.Good Handle on Change Management and Release Management aspects of SoftwareLead System Reliability Engineer (Application Support + Automation)
Posted today
Job Viewed
Job Description
Job Description
Whoare we:
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.
The Role:
·Plan, manage, and oversee all aspects of a Production Environment
·Define strategies for Application Performance Monitoring, Optimization in Prod environment
·Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.
·Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.
·Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.
·Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.
·Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
·Analyse ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.
·Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
·Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.
·Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
·Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.
·Work with a global team spread across tech hubs in multiple geographies and time zones.
·Ability to share knowledge and explain processes and procedures to others.
·Able to perform on-call duties on a rotational basis.
·Occasional off hours work required.
Requirements
Skills –Must Have:
·Linux
·Shell Scripting (basic)
·ITIL / ITSM
·SQL
·Application Troubleshooting
·Any Monitoring tool (Preferred Splunk/Dynatrace)
·Jenkins - CI/CD (basic)
·Groovy Scripting/Yaml (basic)
·Git basic/bit bucket (basic)
·Networks – F5, Load Balancers, HSM, Security Keys, SSL/TLS Certificates - (knowledge/experience)
·ISO8583 / ISO20022 (knowledge/experience)
Good To Have:
·Payments Flows, Switching, Settlements, Authorisation flows.
·Even Framework architecture
·Ansible/Chef (Basic)
Sr System Reliability Engineer (Application Support + Automation)
Posted today
Job Viewed
Job Description
Job Description
Whoare weFulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.
The Role
·Plan, manage, and oversee all aspects of a Production Environment
·Define strategies for Application Performance Monitoring, Optimization in Prod environment
·Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.
·Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.
·Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.
·Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.
·Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
·Analyse ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.
·Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
·Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.
·Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
·Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.
·Work with a global team spread across tech hubs in multiple geographies and time zones.
·Ability to share knowledge and explain processes and procedures to others.
·Able to perform on-call duties on a rotational basis.
·Occasional off hours work required.
Requirements
Skills –Must Have
Good To Have
Benefits
Skills –
Must Have
·Linux
·Shell Scripting
·ITIL / ITSM
·SQL
·Application Troubleshooting
·Any Monitoring tool (Preferred Splunk/Dynatrace)
·Jenkins - CI/CD
·Groovy Scripting/Yaml
·Git basic/bit bucket
·Kubernetes
·Even Framework architecture
Good To Have
·Even Framework architecture
·Ansible/Chef (Basic)
Reliability system engineer
Posted today
Job Viewed
Job Description
such as EC2,
S3,
Route 53,
and RDS,
and more niche services,
such as Organizations,
Sage Maker,
and Guard Duty.• Experience with Infrastructure as code
and automation / configuration management using either Cloud Formation
or Terraform to define infrastructure standards for cloud services.• Ability to use various technologies to host container services
and registries,
continuous deployment
and continuous integration services,
code repositories,
and security vulnerability identification to support cloud infrastructure.Example technologies include AWS ECS,
Kubernetes,
Docker,
Jenkins,
GoCD,
AWS ECR,
Artifactory,
Twistlock,
and Netsparker.• Good understanding of programming languages such as PHP,
Python,
Perl,
and Ruby.• Experience analyzing solutions components,
understanding systems integration challenges,
and identifying technology gaps in current details that must be
Hardware System Test - Reliability, Availability
Posted today
Job Viewed
Job Description
Software Developers at IBM are the backbone of our strategic initiatives to design, code, test, and provide industry-leading solutions that make the world run today - planes and trains take off on time, bank transactions complete in the blink of an eye and the world remains safe because of the work our software developers do. Whether you are working on projects internally or for a client, software development is critical to the success of IBM and our clients worldwide. At IBM, you will use the latest software development tools, techniques and approaches and work with leading minds in the industry to build solutions you can be proud of
Your Role and Responsibilities
- Bug analysis ,Debug and triage
- Knowledge on RAS
- Engaged in frontline test and debug related to system error.
- Hands on experience on Highspeed interconnect and bus technology (PCIE ,CXL )
- Linux & Python scripting knowledge is essential.What you will be doing:
- Work closely with system architects, hardware and software design engineers, and project managers to maintain good understanding of the design and ongoing changes.
- Build detailed and rigorous test plan based on the system design, and look for improvements, tools, and power server products.
- Drive system integration and verification tests along with other validation engineers to enable rapid iteration of the server design.
- Work closely with Validation leads to conduct PHY layer debug and measurements on prototype hardware.
- Drive debug and resolution of issues found. Influence future system designs to avoid them.
- Investigate field issues.
Required Technical and Professional Expertise
- 3 - 5 years of systeM/Firmware validation experience
- Must have experience in Scripting languages (i.e., Python, Perl) for bench automation.
- Experience with complicated systems including silicon bring-up, board-level design/validation, or system design/validation.
- Experience with system level verification, IO interconnect, PCI-Express high-speed data switches.
- Strong circuit analysis and debugging skills, and experience with BMC & server management processors
- Knowledge of Linux operating system and networking.
Preferred Technical and Professional Expertise
- Experience in CPU interface testing with IBM POWER architecture knowledge
- Understanding of data center management tools, system mechanics
- A solid understanding of the electrical characteristics of IO interfaces such as I2C, JTAG, and SPI
About Business UnitIBM Systems helps IT leaders think differently about their infrastructure. IBM servers and storage are no longer inanimate - they can understand, reason, and learn so our clients can innovate while avoiding IT issues. Our systems power the world’s most important industries and our clients are the architects of the future. Join us to help build our leading-edge technology portfolio designed for cognitive business and optimized for cloud computing.
Being an IBMer means you’ll be able to learn and develop yourself and your career, you’ll be encouraged to be courageous and experiment everyday, all whilst having continuous trust and support in an environment where everyone can thrive whatever their personal or professional background.
Our IBMers are growth minded, always staying curious, open to feedback and learning new information and skills to constantly transform themselves and our company. They are trusted to provide on-going feedback to help other IBMers grow, as well as collaborate with colleagues keeping in mind a team focused approach to include different perspectives to drive exceptional outcomes for our customers. The courage our IBMers have to make critical decisions everyday is essential to IBM becoming the catalyst for progress, always embracing challenges with resources they have to hand, a can-do attitude and always striving for an outcome focused approach within everything that they do.
Are you ready to be an IBMer?
Be The First To Know
About the latest System reliability Jobs in India !
DevOps Site Reliability system eng...
Posted today
Job Viewed
Job Description
such as EC2,
S3,
Route 53,
and RDS,
and more niche services,
such as Organizations,
Sage Maker,
and Guard Duty.• Experience with Infrastructure as code
and automation / configuration management using either Cloud Formation
or Terraform to define infrastructure standards for cloud services.• Ability to use various technologies to host container services
and registries,
continuous deployment
and continuous integration services,
code repositories,
and security vulnerability identification to support cloud infrastructure.Example technologies include AWS ECS,
Kubernetes,
Docker,
Jenkins,
GoCD,
AWS ECR,
Artifactory,
Twistlock,
and Netsparker.• Good understanding of programming languages such as PHP,
Python,
Perl,
and Ruby.• Experience analyzing solutions components,
understanding systems integration challenges,
and identifying technology gaps in current details that must be
Power System Studies| - Lead Electrical Maintenance & Reliability
Posted 1 day ago
Job Viewed
Job Description
DIRECT Applicants - Please Submit CV to
Email : ;
Experience : 4 -12 Years
- Lead Electrical Engineer/ Electrical Engineer - Power Sytsem Studies
- Lead Electrical Engineer/ Electrical Engineer - Maintenance & Reliability
Must Have Skills
- Power System Modelling
- Power system Analysis (Must Have)
- Power system Analysis specific to Arc flash studies
- Function Testing of all the Electrical Equipment
- Transformers
- Motors
- Oil & Gas
- Petrochemical industry
- Electrical Maintenance
- Power system analysis(tools – ETAP, SKM,) -
- Erection
- Commissioning
- Reliability
- EHV
- MV
- IEEE(knowledge)
- Schematics Diagrams
- Electrical Equipment Layouts
- Cable schedule
- Selection and sizing of Electrical Equipment
- Transformers
- Motors
- Switch gears
- Generators
- Maintenance of all Electrical Equipment
- Transformers
- HT & LT Switchgears
- Circuit breakers
- MCC panels
- Batteries
- Generators
- Motors
Power System Studies| - Lead Electrical Maintenance & Reliability
Posted 1 day ago
Job Viewed
Job Description
DIRECT Applicants - Please Submit CV to
Email : ;
Experience : 4 -12 Years
- Lead Electrical Engineer/ Electrical Engineer - Power Sytsem Studies
- Lead Electrical Engineer/ Electrical Engineer - Maintenance & Reliability
Must Have Skills
- Power System Modelling
- Power system Analysis (Must Have)
- Power system Analysis specific to Arc flash studies
- Function Testing of all the Electrical Equipment
- Transformers
- Motors
- Oil & Gas
- Petrochemical industry
- Electrical Maintenance
- Power system analysis(tools – ETAP, SKM,) -
- Erection
- Commissioning
- Reliability
- EHV
- MV
- IEEE(knowledge)
- Schematics Diagrams
- Electrical Equipment Layouts
- Cable schedule
- Selection and sizing of Electrical Equipment
- Transformers
- Motors
- Switch gears
- Generators
- Maintenance of all Electrical Equipment
- Transformers
- HT & LT Switchgears
- Circuit breakers
- MCC panels
- Batteries
- Generators
- Motors