Advanced Incident Management Services and Research and Development of Agentic AI: NEC Technical Journal

Tweet
Share

Many companies face the challenge of expending significant effort on the operation and maintenance of existing IT systems, which limits their ability to leverage IT for management strategies and business growth. To address this issue, NEC offers the NEC BluStellar Scenario, "Improving business processes through operational DX in hybrid IT environments," drawing on expertise gained from in-house practice and accumulated knowledge from customer system deployments. This paper introduces part of these efforts, specifically NEC's advanced incident management services and the research and development of autonomous AI agents aimed at streamlining and automating IT system operations.

1. Introduction

In many companies, IT system departments are largely devoted to the operation and maintenance of incrementally expanded legacy system assets, resulting in insufficient investment in new areas. Recruitment challenges stemming from Japan’s declining birth rate further exacerbate the situation, making the promotion of standardized and automated operations increasingly vital to alleviating the operational burden.

To address these issues, NEC offers the NEC BluStellar Scenario, “Improving business processes through operational DX in hybrid IT environments,” a solution grounded in insights from our own practical experience and industry-specific expertise accumulated from building systems for our customers. Through this approach, NEC supports organizations in achieving operational digital transformation (DX) and becoming more resilient to change.

2. Introduction to Operational DX

2.1 Overview of operational DX with the NEC BluStellar Scenario

In the NEC BluStellar Scenario for operational DX, NEC provides four sub-scenarios tailored to different objectives and needs. These four sub-scenarios are:

Operation standardization and automation: Eliminates dependence on individual skills and enables labor efficiency by establishing guidelines and processes in compliance with ITIL (Information Technology Infrastructure Library).
Asset and vulnerability management automation: Automates the detection of vulnerabilities for numerous systems and servers, contributing to higher levels of management.
Advanced incident management: Automates monitoring, filtering, notification, problem management, and analysis, and further enhances these processes through the use of AI to create a cycle of autonomous operational improvement and efficiency.
Visualization: Provides the ability to confirm status on-site simultaneously using dashboards, allowing for swift sharing of information with users.

2.2 Advanced incident management

This section provides a detailed overview of the advanced incident management sub-scenario within operational DX.

Advanced incident management automates monitoring, filtering, notification, problem management, and analysis, and further enhances these processes through the use of AI to create a cycle of autonomous operational improvement and efficiency. By utilizing an integrated platform for system monitoring, incident response is centralized and streamlined. For the large volume of monitoring message alerts, unnecessary alerts are reduced through advanced filtering definitions. Centralized incident information and automatic ticket creation enable rapid collaboration among stakeholders, providing mechanisms for quick and incremental improvements while minimizing the impact on existing systems.

Two representative services supporting advanced incident management are introduced here. The first is WebSAM Automatic Message Call (AMC), a cloud-based notification service that revolutionizes monitoring by automatically delivering only essential information from a large volume of system alerts directly to the appropriate personnel. The second is WebSAM IT Process Management (ITPM), a SaaS-based IT service management tool that enables efficient response and management of inquiries and incidents during system operation. (Note: WebSAM is known as MasterScope outside of Japan).

Through the integration of AMC and ITPM, alert classification, notification, and ticket creation are all automated, enabling more efficient system operations (Fig. 1).

zoom — Fig. 1 WebSAM AMC-ITPM integration.

2.3 AI utilization in ITPM

Generative AI is utilized in ITPM to provide the following functions:

Automatic generation of incident responses: AI automatically generates response messages based on information recorded in the incident ticket’s history.
Incident response review: When review criteria are set, AI verifies whether the responses meet these criteria.
Incident response history summarization: AI summarizes the actions taken, enabling users to easily understand how incidents were handled without having to track each individual action.

Looking ahead, further automation using AI is planned, including the following enhancements:

Suggestion of similar information: For user inquiries, AI automatically searches for and displays similar past cases and relevant knowledge. This helps standardize response quality, regardless of individual search skills or experience.
Assistance with investigation planning: Drawing on information suggested by AI, subtasks for incident resolution—including task description, deadlines, and responsible personnel—are automatically created and assigned. This lowers barriers to task delegation and enables more efficient workload distribution.
Enhanced integration with automatic message call (AMC): AI automates the summarization, analysis, visualization, and notification of ticket contents created by the automatic message call tool. By promoting the semi-automation of situation assessment and countermeasure planning, this leads to greater efficiency and improved quality in responses.

3. Research on AI Agents for Autonomous Operations

The streamlining of operational tasks through generative AI is already underway, and it is anticipated that, by advancing collaboration between humans and AI agents, fully autonomous operations will be realized in the future. To further reduce the need for human intervention in operations, NEC is conducting research and development on Agentic AI capable of autonomously handling inquiries and incident investigation.

3.1 Automated Inquiry Response Agent

One type of AI that can be utilized for automating inquiry responses is chatbot technology based on large language models (LLMs), such as ChatGPT. While chatbots powered by LLMs are capable of generating responses that account for context and are natural in tone, they may be unable to correctly answer questions that require knowledge of specific products or services not included in their training data. To overcome this limitation, retrieval-augmented generation (RAG) has emerged as a promising solution. For automation of inquiry response, an effective use of RAG is to search for past inquiry cases that are similar to the user's current question and supply these cases as supplemental information. If the similar inquiry examples contain answers directly related to the user's question, the LLM is expected to generate an appropriate response.

However, in actual inquiry operations, it is often necessary not only to reference similar past inquiry records, but also to consult documents such as manuals related to products and services. For this purpose, NEC has developed an AI agent for information integration and automation of inquiry response, which can leverage all relevant information sources as needed.

Fig. 2 shows the architecture of the developed Automated Inquiry Response Agent. From the standpoint of preventing hallucinations, Agentic AI-based RAG (Agentic RAG) is utilized to autonomously limit unnecessary searches of external information.

Fig. 2 Automated Inquiry Response Agent.

First, the user’s inquiry is input to the Similar Ticket Search Agent (Agent 1). Agent 1 retrieves related incident tickets based on the inquiry and appends that information to the prompt, which is then passed on to the Response Feasibility Decision-Making Agent (Agent 2). Agent 2 determines, based on the related ticket information included in the prompt, whether it has sufficient information to answer the user’s inquiry. If the information is insufficient (that is, when Agent 2 judges that additional information is required), the Related Document Search Agent (Agent 3) retrieves FAQs, help pages, manuals, and other relevant documents related to the inquiry from the document database and appends them to the LLM prompt. The LLM generates a response based on the final prompt and the user’s inquiry text, and returns the response to the user.

As of September 2025, NEC has evaluated this Automated Inquiry Response Agent using actual inquiries and responses exchanged in service support operations. Specifically, the initial inquiry submitted by a user and recorded in an incident ticket was used as the input, and the output from the Automated Inquiry Response Agent was compared with the actual response written in the ticket. The results as of September 2025 show that when the inquiry does not lack information and does not require further investigation or manual work, and when similar examples and required information exist within past incident tickets and external documents, the system is able to generate responses of comparable quality to those provided manually.

3.2 Incident Investigation Agent

To restore an IT system that has experienced an incident, it is essential to identify the root cause. By shortening the time required for this investigation, IT systems can be recovered more quickly. The cause identified at this stage does not necessarily have to be the ultimate root cause, but it must at least be sufficient to enable provisional measures that minimize the impact on business operations or services. For example, if frequent container restarts are causing the incident, the restarts themselves are a cause, but to achieve recovery, it is necessary to determine what is triggering these restarts.

NEC has developed an Incident Investigation Agent that autonomously identifies the causes of such incidents. This Agentic AI combines generative AI with log analysis technology. Starting from monitoring alarms or user-reported incidents, it investigates logs and resource consumption data from the components that make up the IT system to pinpoint the cause. The investigation process consists of two stages: narrowing down the scope of investigation and generating and verifying hypotheses about the cause of the incident.

First, in the narrowing-down stage, the agent uses the content of monitoring alarms or incident reports, configuration information of the target IT system, and the results of impact analysis from log analysis technology to limit the scope of the investigation. Next, in the hypothesis generation and verification stage, the agent estimates possible causes based on the investigation targets and the information used to narrow them down, then determines the investigation method. The agent then executes this method, and based on the additional information obtained, verifies whether the estimated cause is correct. This cycle of investigation is repeated, with the generative AI reflecting on the results and refining its hypotheses, until the final cause is identified.

This technology has three main features.

1)
Parallel investigation: The agent can shorten investigation time by conducting parallel investigations. If multiple possible causes are considered from a single alarm or incident report, the agent generates several hypotheses at once, and independent agents investigate each hypothesis.
2)
Autonomously deepening cause analysis: The agent autonomously reviews the results of multiple investigations. If the initial hypothesis is found to be incorrect, it is revised, and if a more fundamental cause is suspected, a new hypothesis is set and the investigation is resumed (Fig. 3).

Fig. 3 Deepening cause analysis through stepwise refinement.

3)
Efficient investigation using proprietary log analysis¹⁾²⁾: Since IT systems generate massive amounts of log data, it is not feasible to feed all of it to the generative AI at once. Therefore, proprietary log analysis technology is used to summarize the log data, and the agent uses these summaries to conduct efficient investigations.

To verify the effectiveness of this technology, it was applied to simulated environments and actual incidents in NEC’s IT systems. As of September 2025, the agent has successfully extracted and correlated information from logs to identify causes such as noisy neighbor issues, misconfigurations, and broadcast storms.

4. Conclusion

With the increasing adoption of microservices and hybrid cloud, IT operations are becoming increasingly complex. Meanwhile, the growing difficulty of securing personnel to manage IT systems has made the streamlining of IT operations an important management issue for many companies.

In this paper, we have described the utilization of generative AI in services for advanced incident management scenarios in operational DX, as well as our research and development efforts related to AI agents for autonomous operations. Going forward, we will continue working to further reduce operational workloads by strengthening operator support functions and realizing autonomous response and recovery through Agentic AI.

Trademarks

*
ITIL（Information Technology Infrastructure Library）is a registered trademark of AXELOS Limited.
*
All other company names and product names that appear in this paper are trademarks or registered trademarks of their respective companies.

References

Authors’ Profiles

NATSUMEDA Masanao
Lead Research Engineer
Secure System Platform Research Laboratories

KANAMEDA Keiji
Director
Technology Service and Software Department

MIZOGUCHI Takehiko
Researcher
Secure System Platform Research Laboratories

TAKAHASHI Atsushi
Senior Professional
Technology Service and Software Department

YONEDA Masashi
Senior Professional
Technology Service and Software Department

AJIRO Yasuhiro
Director
Secure System Platform Research Laboratories

Go to This Special Issue TOP

Go to NEC Technical Journal TOP

Displaying present location in the site.

Advanced Incident Management Services and Research and Development of Agentic AI for Autonomous Operations