2,018 System Reliability jobs in India
System Reliability Engineer
Posted today
Job Viewed
Job Description
We have immediate opportunity for SRE (Senior Site Reliability Engineer) 5 to 9 years.
Synechron – Bangalore
Job Role: - SRE (Senior Site Reliability Engineer)
Job Location: - Bangalore
Notice Period: Within 30days
About Synechron
We began life in 2001 as a small, self-funded team of technology specialists. Since then, we’ve grown our organization to 14,500+ people, across 58 offices, in 21 countries, in key global markets.
Innovative tech solutions for business
We're now a leading global digital consulting firm, providing innovative technology solutions for business. As a trusted partner, we're always at the forefront of change as we lead digital optimization and modernization journeys for our clients.
Customized end-to-end solutions
Our expertise in AI, Consulting, Data, Digital, Cloud & DevOps and Software Engineering, delivers customized, end-to-end solutions that drive business value and growth.
For more information on the company, please visit our website or LinkedIn community.
Job Description
Base Skills:
Performance Testing & Engg, Scalability, Availability. Exp with Load Testing tools: Jmeter/LoadRunner, and exp with any APM tools : Dynatrace / RewRelics / AppDynamics. Worked on performance optimization on apps and infrastructure
Experience in client side performance engineering with focus on mobile(android/iOS) and web applications optimization and tuning(Optional)
Strong understanding of distributed sytems, cloud platforms (AWS, Azure or GCP) and microservices architecture
Monitoring, observability, Open Telemetry : Using tools like Splunk, AppD, Prometheus, Fluentd, ELK(Elastic Search, Logstash Kibana), TIG( Telegraf, Influx, Grafana). Dynatrace/ AppDynamics. / New Relic
Common Soft skill
Experience of independently execute customer facing role of understanding the SRE requirements, assist in build the team, and drive the implementation
Experience in establish & documenting performance baselines, thresholds and SLA's for critical applications
Optional
Experience in developing capacity planning models and working with stake holders to forecast future scalability requirements
Experience in designing high availability architectures with failover and recovery mechanisms
Hands-on experience of working on RFP/proposals
Excellent communication and business presentation skills
Must Have skills:
System Reliability, Chaos engineering to proactively identify and mitigate potential system vulnerabilities.
Concepts of SLI, SLO, SLA, Define SLIs (Service Level Indicators), SLOs (Service Level Objectives), and error budgets.
Applications debugging and support, Understanding of REST APIs
Optional Skills
Software Engineering and Development skills: .NET, Go, Python, C++, Ruby or Java or software delivery platforms such as Puppet, Chef, Ansible, and/or Spinnaker. Being able to instrument services;
write exporters and collectors etc
Experience with Kubernetes. Good experience of Automation across applications /services and & Infrastructure Management
QUALIFICATION:
Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
If you find this this opportunity interesting kindly share your updated profile on
With below details (Mandatory)
Total Experience
Experience in Site Reliability: -
Current CTC
Expected CTC
Notice period
Current Location
Available for Face-to-Face interview?
Ready to relocate to Bangalore?
If you had gone through any interviews in Synechron before? If Yes when
Regards,
Pravin Chauhan
Hp & WhatsApp #
System Reliability Engineer (Big Data)
Posted 148 days ago
Job Viewed
Job Description
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, health care, and manufacturing.
The Role
Plan, manage, and oversee all aspects of a Production Environment for Big Data Platforms.Define strategies for Application Performance Monitoring, Optimization in Prod environmentRespond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.Ensures that batch production scheduling and process are accurate and timely.Able to create and execute queries to big data platform and relational data tables to identify process issues or to perform mass updates, preferred.Performs ad hoc requests from users such as data research, file manipulation/transfer, research of process issues, etc.Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.Analyze ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.Maintain services once they are live by measuring and monitoring availability, latency and overall system health.Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.Work with a global team spread across tech hubs in multiple geographies and time zones.Ability to share knowledge and explain processes and procedures to others.RequirementsExperience in Linux and Knowledge on ITSM/ITIL.Experience in the Big Data technologies (Hadoop, Spark, Nifi, Impala)2+ years of Experience in running Big Data production systems.Good to have experience in industry standard CI/CD tools like Git/BitBucket, Jenkins, Maven, Solid grasp of SQL or Oracle fundamentalsExperience with scripting, pipeline management, and software design.Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.Ability to help debug and optimize code and automate routine tasks.Ability to support many different stakeholders. Experience in dealing with difficult situations and making decisions with a sense of urgency is needed.Appetite for change and pushing the boundaries of what can be done with automation.Experience in working across development, operations, and product teams to prioritize needs and to build relationships are a must.Experience designing and implementing an effective and efficient CI/CD flow that gets code from dev to prod with high quality and minimal manual effort is desired.Good Handle on Change Management and Release Management aspects of SoftwareLead DevOps and System Reliability Engineer
Posted today
Job Viewed
Job Description
Glowingbud is a rapidly growing eSIM services platform that simplifies connectivity with powerful APIs, robust B2B and B2C interfaces, and seamless integrations with Telna. Our platform enables global eSIM lifecycle management, user onboarding, secure payment systems, and scalable deployments. Recently acquired by Telna, we are expanding our product offerings and team to meet increasing demand and innovation goals.
Job Summary
We are seeking a highly experienced Senior DevOps Engineer with 10+ years of expertise in cloud infrastructure, automation, and system reliability. The ideal candidate will be responsible for maintaining scalable AWS-based environments, implementing robust CI/CD pipelines, optimizing system performance, and ensuring high availability of critical applications. This role requires deep expertise in Docker, Kubernetes, Infrastructure as Code (IaC), and system monitoring. The candidate will also be responsible for documenting system architecture, setting SLAs, and leading DevOps best practices across teams. If you thrive in a fast-paced, collaborative environment and are passionate about DevOps, we'd love to hear from you!
Key Responsibilities:
- Infrastructure Management: Design, implement, and maintain scalable cloud infrastructure using AWS services.
- System Documentation & Diagrams: Maintain up-to-date system diagrams, architecture documentation, and operational procedures.
- Containerization & Orchestration: Deploy and manage containerized applications using Docker and Kubernetes.
- System Maintenance & Optimization: Ensure high availability, performance tuning, and cost optimization of cloud and on-premise infrastructure.
- Monitoring & Observability: Implement detailed system monitoring, logging, and alerting using tools like Datadog, Prometheus, Grafana, ELK stack, or AWS CloudWatch.
- Security & Compliance: Enforce security best practices, conduct regular audits, and ensure adherence to compliance standards.
- CI/CD Pipeline Management: Build and maintain automated deployment pipelines for seamless application releases.
- Incident Response & SLA Management: Define SLAs, monitor system performance, and establish an efficient incident response strategy.
- Collaboration & Leadership: Work closely with development, QA, and operations teams to improve reliability, scalability, and efficiency.
Qualifications:
- 7+ years of experience in DevOps, Site Reliability Engineering (SRE), or Cloud Infrastructure roles.
- Expert knowledge of AWS Services (EC2, ECS, S3, RDS, Mongo Atlas, Lambda, VPC, ALB, Gateway, Cognito, WAF, IAM, Amplify CloudFormation, etc.).
- Strong experience with Docker & Kubernetes for container orchestration and management.
- Hands-on experience with infrastructure as code (IaC) tools like Terraform, CloudFormation, or Pulumi.
- Expertise in system monitoring and logging tools (Prometheus, Grafana, ELK Stack, Datadog, AWS CloudWatch).
- Proficiency in scripting languages (Bash, Python, or Go) for automation and infrastructure management.
- Experience with CI/CD pipelines using Jenkins, AWS CodePipeline, GitHub Actions.
- Knowledge of networking, security best practices, and system performance tuning.
- Experience with setting and enforcing SLAs for DevOps teams.
- Strong problem-solving skills and ability to work in a fast-paced environment.
Preferred Skills:
- Thorough Experience with AWS Infrastructure.
- Knowledge of serverless architectures and event-driven computing.
- Experience with configuration management tools (Ansible, Chef, Puppet).
- Background in database administration (PostgreSQL, MySQL, or NoSQL databases).
Sr System Reliability Engineer (Application Support + Automation)
Posted today
Job Viewed
Job Description
Job Description
Whoare weFulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.
The Role
·Plan, manage, and oversee all aspects of a Production Environment
·Define strategies for Application Performance Monitoring, Optimization in Prod environment
·Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.
·Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.
·Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.
·Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.
·Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
·Analyse ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.
·Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
·Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.
·Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
·Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.
·Work with a global team spread across tech hubs in multiple geographies and time zones.
·Ability to share knowledge and explain processes and procedures to others.
·Able to perform on-call duties on a rotational basis.
·Occasional off hours work required.
Requirements
Skills –Must Have
Good To Have
Benefits
Skills –
Must Have
·Linux
·Shell Scripting
·ITIL / ITSM
·SQL
·Application Troubleshooting
·Any Monitoring tool (Preferred Splunk/Dynatrace)
·Jenkins - CI/CD
·Groovy Scripting/Yaml
·Git basic/bit bucket
·Kubernetes
·Even Framework architecture
Good To Have
·Even Framework architecture
·Ansible/Chef (Basic)
Lead System Reliability Engineer (Application Support + Automation)
Posted today
Job Viewed
Job Description
Job Description
Whoare we:
Fulcrum Digital is an agile and next-generation digital accelerating company providing digital transformation and technology services right from ideation to implementation. These services have applicability across a variety of industries, including banking & financial services, insurance, retail, higher education, food, healthcare, and manufacturing.
The Role:
·Plan, manage, and oversee all aspects of a Production Environment
·Define strategies for Application Performance Monitoring, Optimization in Prod environment
·Respond to Incidents and improvise platform based on feedback and measure the reduction of incidents over time.
·Support deployment of code into multiple lower environments. Supporting current processes with an emphasis on automating everything as soon as possible.
·Design, develop and standardize Monitoring and Alerting mechanism for the supported applications.
·Take a holistic approach to problem solving, by connecting the dots during a production event through the various technology stack that makes up the platform, to optimize meantime to recover.
·Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
·Analyse ITSM activities of the platform and provide feedback loop to development teams on operational gaps or resiliency concerns.
·Support services before they go live through activities such as system design consulting, capacity planning and launch reviews.
·Support the application CI/CD pipeline for promoting software into higher environments through validation and operational gating, and lead in DevOps automation and best practices.
·Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
·Scale systems sustainably through mechanisms like automation and evolving systems by pushing for changes that improve reliability and velocity.
·Work with a global team spread across tech hubs in multiple geographies and time zones.
·Ability to share knowledge and explain processes and procedures to others.
·Able to perform on-call duties on a rotational basis.
·Occasional off hours work required.
Requirements
Skills –Must Have:
·Linux
·Shell Scripting (basic)
·ITIL / ITSM
·SQL
·Application Troubleshooting
·Any Monitoring tool (Preferred Splunk/Dynatrace)
·Jenkins - CI/CD (basic)
·Groovy Scripting/Yaml (basic)
·Git basic/bit bucket (basic)
·Networks – F5, Load Balancers, HSM, Security Keys, SSL/TLS Certificates - (knowledge/experience)
·ISO8583 / ISO20022 (knowledge/experience)
Good To Have:
·Payments Flows, Switching, Settlements, Authorisation flows.
·Even Framework architecture
·Ansible/Chef (Basic)
Reliability system engineer
Posted today
Job Viewed
Job Description
such as EC2,
S3,
Route 53,
and RDS,
and more niche services,
such as Organizations,
Sage Maker,
and Guard Duty.• Experience with Infrastructure as code
and automation / configuration management using either Cloud Formation
or Terraform to define infrastructure standards for cloud services.• Ability to use various technologies to host container services
and registries,
continuous deployment
and continuous integration services,
code repositories,
and security vulnerability identification to support cloud infrastructure.Example technologies include AWS ECS,
Kubernetes,
Docker,
Jenkins,
GoCD,
AWS ECR,
Artifactory,
Twistlock,
and Netsparker.• Good understanding of programming languages such as PHP,
Python,
Perl,
and Ruby.• Experience analyzing solutions components,
understanding systems integration challenges,
and identifying technology gaps in current details that must be
DevOps Site Reliability system eng...
Posted today
Job Viewed
Job Description
such as EC2,
S3,
Route 53,
and RDS,
and more niche services,
such as Organizations,
Sage Maker,
and Guard Duty.• Experience with Infrastructure as code
and automation / configuration management using either Cloud Formation
or Terraform to define infrastructure standards for cloud services.• Ability to use various technologies to host container services
and registries,
continuous deployment
and continuous integration services,
code repositories,
and security vulnerability identification to support cloud infrastructure.Example technologies include AWS ECS,
Kubernetes,
Docker,
Jenkins,
GoCD,
AWS ECR,
Artifactory,
Twistlock,
and Netsparker.• Good understanding of programming languages such as PHP,
Python,
Perl,
and Ruby.• Experience analyzing solutions components,
understanding systems integration challenges,
and identifying technology gaps in current details that must be
Be The First To Know
About the latest System reliability Jobs in India !
RMS (Reliability Monitoring System) Technical Expert – OSAT
Posted 3 days ago
Job Viewed
Job Description
Tata Electronics (a wholly owned subsidiary of Tata Sons Pvt. Ltd.) is building India’s first AI-enabled state-of-the-art Semiconductor Foundry. This facility will produce chips for applications such as power management IC, display drivers, microcontrollers (MCU) and high-performance computing logic, addressing the growing demand in markets such as automotive, computing and data storage, wireless communications and artificial intelligence.
Tata Electronics is a subsidiary of the Tata group. The Tata Group operates in more than 100 countries across six continents, with the mission 'To improve the quality of life of the communities we serve globally, through long term stakeholder value creation based on leadership with Trust.
Role Summary
The RMS Technical Expert will be responsible for the design, deployment, and optimization of Reliability Monitoring Systems in an OSAT (Outsourced Semiconductor Assembly & Test) manufacturing environment. This role requires deep technical expertise in equipment data acquisition, reliability analytics, system integration, and continuous improvement of RMS to support yield, quality, and equipment performance targets.
Key Responsibilities
1. System Development & Maintenance
- Configure, customize, and maintain the RMS platform to capture, process, and analyze equipment health and performance data.
- Develop and optimize RMS data models, rules, and alerts for proactive maintenance and reliability improvement.
- Work with automation engineers to ensure accurate equipment connectivity via SECS/GEM, OPC-UA, MQTT, or other protocols.
2. Integration & Data Flow Management
- Integrate RMS with MES, SPC, FDC, and other manufacturing systems for end-to-end visibility.
- Ensure data quality, accuracy, and real-time availability for engineering and operations teams.
- Collaborate with IT teams for database management, server optimization, and cybersecurity compliance.
3. Reliability & Performance Analytics
- Develop dashboards, reports, and analytical tools to monitor equipment reliability KPIs (MTBF, MTTR, OEE).
- Use RMS insights to recommend and implement preventive/corrective actions to improve tool uptime and process stability.
- Partner with process and maintenance engineers to translate RMS data into actionable reliability programs.
4. Continuous Improvement & Innovation
- Drive enhancements in RMS through advanced analytics, AI/ML models, and predictive maintenance algorithms.
- Identify opportunities to automate RMS workflows and reporting.
- Benchmark RMS performance against industry best practices in semiconductor manufacturing.
Qualifications & Skills
Education:
- Bachelor’s in Electronics, Instrumentation, Computer Science, or related engineering discipline.
- Master’s degree or certifications in manufacturing systems or reliability engineering preferred.
Experience:
- 5–8 years of hands-on experience with RMS, FDC, or equipment monitoring systems in a semiconductor or electronics manufacturing environment.
- Strong understanding of OSAT operations, tool reliability, and equipment maintenance processes.
Technical Skills:
- Expertise in RMS platforms (e.g., Camline, BISTel, PEER Group, Applied E3, or custom solutions).
- Proficiency in industrial communication protocols (SECS/GEM, OPC-UA, MQTT).
- Strong skills in SQL, data visualization tools (Power BI, Tableau), and scripting (Python, JavaScript).
- Experience with AI/ML-based predictive maintenance is an advantage.
Soft Skills:
- Strong analytical and problem-solving mindset.
- Effective communication and cross-functional collaboration skills.
- Ability to work in a fast-paced manufacturing environment with tight deadlines.
Bottom of Form
RMS (Reliability Monitoring System) Technical Expert - OSAT
Posted today
Job Viewed
Job Description
Tata Electronics (a wholly owned subsidiary of Tata Sons Pvt. Ltd.) is building India’s first AI-enabled state-of-the-art Semiconductor Foundry. This facility will produce chips for applications such as power management IC, display drivers, microcontrollers (MCU) and high-performance computing logic, addressing the growing demand in markets such as automotive, computing and data storage, wireless communications and artificial intelligence.
Tata Electronics is a subsidiary of the Tata group. The Tata Group operates in more than 100 countries across six continents, with the mission 'To improve the quality of life of the communities we serve globally, through long term stakeholder value creation based on leadership with Trust.
Role Summary
The RMS Technical Expert will be responsible for the design, deployment, and optimization of Reliability Monitoring Systems in an OSAT (Outsourced Semiconductor Assembly & Test) manufacturing environment. This role requires deep technical expertise in equipment data acquisition, reliability analytics, system integration, and continuous improvement of RMS to support yield, quality, and equipment performance targets.
Key Responsibilities
1. System Development & Maintenance
- Configure, customize, and maintain the RMS platform to capture, process, and analyze equipment health and performance data.
- Develop and optimize RMS data models, rules, and alerts for proactive maintenance and reliability improvement.
- Work with automation engineers to ensure accurate equipment connectivity via SECS/GEM, OPC-UA, MQTT, or other protocols.
2. Integration & Data Flow Management
- Integrate RMS with MES, SPC, FDC, and other manufacturing systems for end-to-end visibility.
- Ensure data quality, accuracy, and real-time availability for engineering and operations teams.
- Collaborate with IT teams for database management, server optimization, and cybersecurity compliance.
3. Reliability & Performance Analytics
- Develop dashboards, reports, and analytical tools to monitor equipment reliability KPIs (MTBF, MTTR, OEE).
- Use RMS insights to recommend and implement preventive/corrective actions to improve tool uptime and process stability.
- Partner with process and maintenance engineers to translate RMS data into actionable reliability programs.
4. Continuous Improvement & Innovation
- Drive enhancements in RMS through advanced analytics, AI/ML models, and predictive maintenance algorithms.
- Identify opportunities to automate RMS workflows and reporting.
- Benchmark RMS performance against industry best practices in semiconductor manufacturing.
Qualifications & Skills
Education:
- Bachelor’s in Electronics, Instrumentation, Computer Science, or related engineering discipline.
- Master’s degree or certifications in manufacturing systems or reliability engineering preferred.
Experience:
- 5–8 years of hands-on experience with RMS, FDC, or equipment monitoring systems in a semiconductor or electronics manufacturing environment.
- Strong understanding of OSAT operations, tool reliability, and equipment maintenance processes.
Technical Skills:
- Expertise in RMS platforms (e.g., Camline, BISTel, PEER Group, Applied E3, or custom solutions).
- Proficiency in industrial communication protocols (SECS/GEM, OPC-UA, MQTT).
- Strong skills in SQL, data visualization tools (Power BI, Tableau), and scripting (Python, JavaScript).
- Experience with AI/ML-based predictive maintenance is an advantage.
Soft Skills:
- Strong analytical and problem-solving mindset.
- Effective communication and cross-functional collaboration skills.
- Ability to work in a fast-paced manufacturing environment with tight deadlines.
Bottom of Form
RMS (Reliability Monitoring System) Technical Expert - OSAT
Posted 1 day ago
Job Viewed
Job Description
Tata Electronics is a subsidiary of the Tata group. The Tata Group operates in more than 100 countries across six continents, with the mission 'To improve the quality of life of the communities we serve globally, through long term stakeholder value creation based on leadership with Trust.
Role Summary
The RMS Technical Expert will be responsible for the design, deployment, and optimization of Reliability Monitoring Systems in an OSAT (Outsourced Semiconductor Assembly & Test) manufacturing environment. This role requires deep technical expertise in equipment data acquisition, reliability analytics, system integration, and continuous improvement of RMS to support yield, quality, and equipment performance targets.
Key Responsibilities
1. System Development & Maintenance
- Configure, customize, and maintain the RMS platform to capture, process, and analyze equipment health and performance data.
- Develop and optimize RMS data models, rules, and alerts for proactive maintenance and reliability improvement.
- Work with automation engineers to ensure accurate equipment connectivity via SECS/GEM, OPC-UA, MQTT, or other protocols.
2. Integration & Data Flow Management
- Integrate RMS with MES, SPC, FDC, and other manufacturing systems for end-to-end visibility.
- Ensure data quality, accuracy, and real-time availability for engineering and operations teams.
- Collaborate with IT teams for database management, server optimization, and cybersecurity compliance.
3. Reliability & Performance Analytics
- Develop dashboards, reports, and analytical tools to monitor equipment reliability KPIs (MTBF, MTTR, OEE).
- Use RMS insights to recommend and implement preventive/corrective actions to improve tool uptime and process stability.
- Partner with process and maintenance engineers to translate RMS data into actionable reliability programs.
4. Continuous Improvement & Innovation
- Drive enhancements in RMS through advanced analytics, AI/ML models, and predictive maintenance algorithms.
- Identify opportunities to automate RMS workflows and reporting.
- Benchmark RMS performance against industry best practices in semiconductor manufacturing.
Qualifications & Skills
Education:
- Bachelor’s in Electronics, Instrumentation, Computer Science, or related engineering discipline.
- Master’s degree or certifications in manufacturing systems or reliability engineering preferred.
Experience:
- 5–8 years of hands-on experience with RMS, FDC, or equipment monitoring systems in a semiconductor or electronics manufacturing environment.
- Strong understanding of OSAT operations, tool reliability, and equipment maintenance processes.
Technical Skills:
- Expertise in RMS platforms (e.g., Camline, BISTel, PEER Group, Applied E3, or custom solutions).
- Proficiency in industrial communication protocols (SECS/GEM, OPC-UA, MQTT).
- Strong skills in SQL, data visualization tools (Power BI, Tableau), and scripting (Python, JavaScript).
- Experience with AI/ML-based predictive maintenance is an advantage.
Soft Skills:
- Strong analytical and problem-solving mindset.
- Effective communication and cross-functional collaboration skills.
- Ability to work in a fast-paced manufacturing environment with tight deadlines.
Bottom of Form