AWS for Industries

Near Real-time News Clustering and Summarization for FSI

Introduction

In today’s fast-paced financial markets, the sheer volume and velocity of news pose significant challenges for investors, traders, and financial institutions. The ability to quickly process, analyze, and act on relevant information can mean the difference between seizing opportunities and falling behind. This blog post introduces an innovative solution that leverages HAQM Bedrock to cluster news in near real-time and provides concise summaries, addressing a critical need in the financial services industry.

Problem Statement

Financial professionals face a daily deluge of news from multiple sources, often reporting on the same events with varying perspectives. This information overload makes it difficult to:

  • Quickly identify and focus on the relevant news items
  • Synthesize information from multiple sources to form a comprehensive view
  • Detect emerging trends or breaking news that impact investment decisions
  • Efficiently manage risk in a rapidly changing market environment

Our solution tackles these challenges by introducing near real-time clustering of news articles, coupled with AI-powered summarization. This approach introduces unique technical hurdles, including handling large cluster pools, balancing synchronous vs. asynchronous clustering, and evaluating the effectiveness of created clusters.

Business Use Case in Financial Services

This clustering and summarization solution offers significant advantages across various roles in the financial services ecosystem:

  • Portfolio Managers: Quickly assess market-moving news to make informed investment decisions and rebalance portfolios in response to emerging trends.
  • Risk Managers: Rapidly identify potential threats to existing positions and overall market stability.
  • Traders: Spot breaking news that might impact short-term price movements, enabling faster execution of trades.
  • Research Analysts: Efficiently aggregate and synthesize information from multiple sources to create in-depth analysis and reports.
  • Compliance Officers: Monitor news for regulatory changes or potential compliance issues affecting the firm or its investments.
  • Wealth Advisors: Stay informed about market developments to provide clients timely advice and adjust investment strategies.

By streamlining the consumption of critical information, this solution not only saves time but also allows financial professionals to make informed and timely decisions in an ever-changing market environment. The ability to distill key insights from a flood of news sources represents a major advantage, potentially leading to improved performance and reduced risk exposure across the financial services industry.

In the following sections, we’ll explore the technical architecture of this solution. We will dive into the embedding and clustering approaches, summarization techniques, and the event-driven architecture that powers near real-time news analysis for financial services professionals.

Clustering News Articles for Event Detection

Approach To Event Detection

Live event detection requires continuous monitoring of news streams and the grouping of related articles into events. Proper assignment and grouping of these articles into distinct events is complicated by the frequent and significant overlap in topical information between news articles. For example, finance articles that discuss the technological sector but cover distinct events within that sector; or sports articles that cover soccer but discuss individual matches. Our solution provides a strategy for creating semantically rich text-embeddings of news articles with state-of-the-art embedding models showing these embeddings are rich in information. They also have the capability of segmenting topically related but distinct news articles into individual events using a density based spatial clustering with noise algorithm.

Embedding news articles for Performant News Event Clustering

Text embeddings are vector representations of text that capture semantic meaning in an embedding space, where similar texts lie closely together. For event detection, we experimented with embedding strategies to create embeddings that best group news articles pertaining to the same event and separate all others. Rich metadata often accompanies news articles that provide a powerful context for representing an event. Such metadata might include the industry, organizations, people, headline, and article body. We found that embedding text, which includes the metadata fields added to the news article body, yields effective embeddings for achieving homogeneity and completeness in event-based clustering. We employed the following embedding models to generate text embeddings: HAQM Titan-embed-text-v1, intfloat/e5-mistral-7b-instruct, BGE-large, and BERT-uncased. The maximum context window was used for each model, except for e5-mistral-7b-instruct, which was evaluated with context windows of 512 and 1024 tokens. The solution truncates text exceeding the context window length.

Workflow diagram showing text processing flow: headline + metadata + body transform into metadata-enhanced text, which then becomes a news article embedding.

Figure 1: Workflow diagram showing text processing flow: headline + metadata + body transform into metadata-enhanced text, which then becomes a news article embedding.

Extending DBCSAN: A Fast, Incremental Clustering Algorithm for Near Real-Time Clustering of News Articles

There are many cutting edge clustering algorithms available when evaluating what the best option is for a use case. In our case, Density-based Spatial Clustering of Applications with Noise (DBSCAN) provides a clustering approach capable of clustering related articles and accommodating singleton ones that don’t belong to a cluster.

While effective for small datasets, DBSCAN’s implementation in scikit-learn has limitations for real-time clustering of large volumes. This is because of its inability to perform incremental updates and slow performance with increasing data size. To address these issues, we developed a custom algorithm that performs incremental updates and near real-time clustering assignments. This improved version keeps cluster pool assignments in memory, updates them incrementally with each batch of incoming articles, and represents clusters by their centroids, reducing the number of comparisons needed at each update step and improving overall latency. By using this method, we can quickly update our clusters as new articles arrive, with the speed depending on the number of new articles we process at once and how well they fit into existing groups. This works especially well for articles having overlapping topics.

The following figure shows the performance of scikit-learn’s implementation of DBSCAN. At each iteration, the system recalculates the entire cluster pool. As the cluster pool size increases, the computation time surpasses three minutes on a C7g.4xl instance, for a cluster pool size larger than 110,000. A 3-minute update time is not suitable for real-time news event detection.

Figure 2: Scatter plot comparing processing times between C5.24xl instance type using triangles and C7a.48xl using circles across increasing numbers of articlesFigure 2: Scatter plot comparing processing times between C5.24xl instance type using triangles and C7a.48xl using circles across increasing numbers of articles

We tested our clustering algorithm on 110,000 news articles, processing them in batches of 20 to 1,000. As we increase the number of articles and batch size, so did the processing time. For real-time performance, we aim the batch processing time to be better than the collection time. At the peak rate of 20 articles per second, our algorithm efficiently cluster 80 articles every 4 seconds, handling up to 82,000 total clusters and standalone articles. Doubling this rate, it processes 1,000 articles every 25 seconds, demonstrating its capability for near real-time clustering of high-volume news streams.

The next figure shows:

Figure a: Iteration time for incremental clustering of article embeddings:

  • The graph displays clustering time (seconds) in color for various batch sizes and cluster + singleton pool configurations

Figures b & c: Zones of near-real-time performance where the system fully flushes an article queue at:

  • 20 articles/second (Figure b)
  • 40 articles/second (Figure c)

Three graphs showing how batch size and pool size affect processing time and queue clearing performance. Left shows processing time heat map, middle and right show queue clearing thresholds.

Figure 3: Three graphs showing how batch size and pool size affect processing time and queue clearing performance. The heat map, middle and right show queue clearing thresholds.

Clustering and Evaluation

A key element of designing this system was evaluating the different text embeddings used by DBSCAN. We strongly encourage you to perform feedback loops and testing on your production dataset to find the best models for your use case. We will cover the results for our use case next.

We measured the clustering quality of embedding the enhanced text representation with each embedding model with the homogeneity score, completeness score, and v-measure.

Homogeneity score measures the cluster quality by the number of classes within each cluster. Completeness score measures overall clustering quality by the extent to which all of the data points belonging to the same class are within a single cluster. It disregards singletons because they, by definition, are not members of a cluster. We include singletons in our analysis for completeness, as their absence from clusters directly measures clustering quality. We clustered each set of embeddings with an EPS range of 0.01 – 0.3 with an interval of 0.005. The table displayed in the image below shows the metrics for each embedding model at their maximum V-measure. The e5-mistral-7b-instruct and titan-embed-text-v1 model yielded the best overall performance. Both embedding models resulted in high homogeneity and completeness while also delivering cluster counts and singleton counts that approach the ground truth values of 429 and 21, respectively.

Figure 4: Comparison of embedding model performance metrics showing V-measure, Homogeneity, Completeness scores, and cluster counts for 5 different models including Mistral and BERT variants.Figure 4: Comparison of embedding model performance metrics showing V-measure, Homogeneity, Completeness scores, and cluster counts for 5 different models including Mistral and BERT variants.

LLM for summarization

In this experiment, we use HAQM Bedrock to test different high-performing LLMs to generate the title and summarization of the news clusters. The input data includes the previously generated headline and summary of each cluster, and their original title and summary. If there was no summary, we used the first 500 characters of the news content. Next is the prompt used for the generation.

“You will be provided with multiple sets of headlines and summaries from different articles in <context> tag, and the current title and summary for a story in <story> tag. Compile, summarize and update the current title and summary for the story. The summary should be less than 100 words. Put the generated context inside <title> and <summary> tag. Do not hallucinate or make up content.

<story>{<INSERT PREVIOUS SUMMARY>} </story>\n

<context> headline: {<INSERT ARTICLE HEADLINE>}, summary: {<INSERT ARTICLE SUMMARY OR FIRST 500 WORDS>} \n headline: … </context>”

We evaluated and compared different foundational models provided in HAQM Bedrock. The image below shows the performance comparison of different models, where the Time(s) column is the total time in seconds for generating summary and headlines for 200 clusters. The updated cost of each model is in the HAQM Bedrock pricing and HAQM SageMaker pricing pages. We selected Anthropic’s Claude Haiku in this use case given its speed, lower cost and relevant high accuracy. A customer’s subject expert manually validated the accuracy with a small dataset of approximately 200 clusters. The criteria include summary correctness, hallucinations, the when, where, what, who components of each news cluster. The table below shows the LLM performance comparison for summarization.

Comparison table of four LLMs showing Claude-V2.1, Claude-instant, Llama 2 7b chat, and Claude-Haiku (highlighted) with their respective engine, processing times (latency) and token cost.Figure 5: Comparison table of four LLMs showing Claude-V2.1, Claude-instant, Llama 2 7b chat, and Claude-Haiku (highlighted) with their respective engine, processing times (latency) and token cost.

Architecture

Article processing workflow using AWS services,

Figure 6: Reference Architecture Diagram: Article processing workflow using AWS services, with numbered steps showing data flow from raw articles through embedding, clustering, summarization and storage to UI presentation.

Developing a near real-time solution is simpler by leveraging a combination of AWS’s managed services and serverless options. We rely on the following services for a scalable, event driven, micro-service architecture that will handle one million articles per day and up to 40 articles per second:

Solution Workflow:

The workflow of the architecture is:

  1. HAQM Kinesis ingests articles (body and title) in JSON format. The system bridges them to Step Functions using EventBridge Pipes.
  2. Our first Step Function state machine contains 2 steps. First, we pre-process the documents by removing extraneous keys from the JSON document and then remove unnecessary characters from the article’s text body like HTML tags.
  3. Step Functions has a 256 KB limit for data that can pass between steps. Because of this limitation, we store the article in an S3 bucket after each step. Next, we pull from the temporary storage and embed the articles by calling HAQM Titan Embeddings from HAQM Bedrock. This API simplifies testing against multiple embedding models. After the Step Function completes, we send the data to a queue in HAQM Simple Queue Service (SQS) to hold before clustering in micro batches.
  4. The clustering process runs on an EC2 instance synchronously because of algorithm constraints. The EC2 instance pulls articles from the queue in batches of 500 by default. This can be changed in the source code. After pulling in a batch, the clustering algorithm DBSCAN runs against the clustering pool, which adds or removes new articles as needed. Then, it updates the DynamoDB table based on these changes. Finally, every hour, we checkpoint the clustering pool in S3 to ensure minimizing data loss in case of an interruption with the clustering process.
  5. Leveraging HAQM DynamoDB Streams, we have a scalable event-driven micro batching solution for summarizing clusters. When ingesting articles, the Trigger Summary Pipeline Lambda function evaluates whether to summarize a cluster if it reached the threshold of articles. The default is five articles, but you can adjust it easily.
  6. Once we’ve determined there are enough articles to summarize in a cluster, we trigger our second Step Function. The first step is to call Claude Haiku and leverage the prompts and multi-doc summarization technique shown earlier.
  7. Finally, we write back to DynamoDB the final generated summary for the UI to show to the customer.

Open-source Solution

The solution we’ve developed offers a powerful and user-friendly way to handle large volumes of news articles in near real-time. To demonstrate its capabilities, we’ve created a demo that showcases the entire process, from ingestion to visualization.

Near real time demo

In the animated demo, you can see the solution in action. It begins by sending articles to the Kinesis stream, which then triggers the clustering and summarization processes. We configured the system to start clustering once it receives 500 articles, though this threshold is adjustable. Every five seconds, the web UI updates by reading the latest data from the DynamoDB table, which contains the clusters, articles, and summaries.

The UI itself is simple yet effective; hosted on HAQM Elastic Container Service (ECS) and fronted by an Application Load Balancer for optimal performance and scalability. Users access the interface by logging in with their HAQM Cognito credentials, ensuring secure access to the system. Once logged in, users can view the clusters and their related news articles, which are updated every 5 seconds to provide a near real-time view of the news landscape.

aws-samples GitHub Repository

We’re excited to announce that we have made this solution available as an open-source project on GitHub. This lets developers and organizations leverage and customize the solution to their specific needs.

To deploy the solution, follow the detailed steps provided in the GitHub repository. The process begins with the Prerequisites section, which outlines the necessary tools and configurations. Following that, the Build and Deploy section guides users through the actual deployment process.

One of the key features of our solution is its use of Terraform to automate the entire infrastructure deployment on AWS. This approach ensures consistency and reduces the potential for human error during setup. We’ve also included a Makefile that contains the commands for deploying, sending data, testing, and destroying the infrastructure/application, making it easy for users to manage the entire lifecycle of the solution.

It’s worth noting that while our demo focuses on financial news, the solution is versatile enough to cluster and summarize any type of news. Users can specialize the system by focusing on news from a specific industry, which can potentially increase performance and optimize costs. This flexibility makes our solution adaptable to a wide range of use cases across various sectors.

Additionally, this repository is for a proof of concept of the discussed architecture, before leveraging this for production consider what your use case will need. For instance, this proof of concept does not clear the cluster pool, and may lead to some slowdown after 100,000+ clusters. Considerations like this will enable you to run complex clustering workloads in production confidently.

By releasing this solution as an open source, we provide a valuable tool for organizations dealing with high volumes of news data, enabling them to extract insights quickly and make informed decisions in today’s fast-paced information environment.

Clean Up

If you deployed the solution using the instructions in the GitHub Repository, you can delete all resources in your AWS account by following the instructions provided in the Destroy section in the same Repository.

Conclusion

The news clustering and summarization solution we’ve presented offers a powerful tool for managing the flow of information in today’s digital age. While our focus has been on financial news, the potential applications of this technology extend far beyond this domain.

Other Potential Use Cases:

  • Social Media Event Tracking. By adapting the solution to ingest and analyze posts from platforms like X (formerly Twitter) or Instagram, businesses can track emerging trends, viral content, or breaking news in real-time.
  • Brand Sentiment Analysis. The system clusters and summarizes mentions of brands across specific sources (social media, news sites, forums, etc.) providing a measurable view of public perception and facilitating quick response times to identified PR issues.
  • Multimedia Content Analysis. The solution extends to cluster and summarization of video content from platforms like YouTube, Twitch and specific social media sites, helping content creators and marketers understand popular topics.
  • Political Campaign Monitoring. Political strategists deploy this tool to track and analyze news coverage and public sentiment across multiple sources during election seasons.
  • Academic Research Aggregation. Researchers use this system to cluster and summarize academic papers on specific topics, aiding literature reviews and identifying research trends.
  • Emergency Response Coordination. During natural disasters or crises, the solution clusters and summarizes reports from various sources, aiding in rapid information dissemination and response coordination.

As the information landscape continues to evolve, tools like this will become increasingly vital for making sense of our data-rich world. We encourage you to explore the GitHub repository, deploy the solution, and adapt it to your unique use cases. Your feedback and contributions will help drive this technology forward, opening new possibilities for information management and decision-making in the digital age.

Samuel Baruffi

Samuel Baruffi

Samuel Baruffi is a seasoned technology professional with over 17 years of experience in the information technology industry. Currently, he works at AWS as a Principal Solutions Architect, providing valuable support to global financial services organizations. His vast expertise in cloud-based solutions is validated by numerous industry certifications. Away from cloud architecture, Samuel enjoys soccer, tennis, and travel.

Ally Meringer

Ally Meringer

Ally Meringer is a Solutions Architect for Global Financial Services. As part of the PACE team (Prototyping and Cloud Engineering), Ally focuses exclusively on machine learning and generative AI prototypes, helping customers rapidly explore and validate innovative use cases. Her work spans a wide range of domains including intelligent document processing (IDP), retrieval-augmented generation (RAG), agents, clustering, complex data analysis, and more.

Hector Lopez

Hector Lopez

Hector Lopez is an Applied Scientist in AWS's Generative AI Innovation Center, where he specializes in delivering production-ready generative AI solutions and proof-of-concepts across diverse industry applications. His expertise spans traditional machine learning and data science in life and physical sciences, having successfully delivered over 20 generative AI use cases and providing strategic advisory to help customers navigate their path to production. Hector implements a first-principles approach to customer solutions, working backwards from core business needs to help organizations understand and leverage generative AI tools for meaningful business transformation.

Kareem Abdol-Hamid

Kareem Abdol-Hamid

Kareem Abdol-Hamid is a Senior Accelerated Compute Specialist for Startups. As an Accelerated Compute specialist, Kareem experiences novel challenges every day involving generative AI, High Performance Compute, and massively scaled workloads. In his free time, he plays piano and competes in the video game Street Fighter.

Yanxiang Yu

Yanxiang Yu

Yanxiang Yu is an Applied Scientist at the HAQM Generative AI Innovation Center. With over 9 years of experience building AI and machine learning solutions for industrial applications, he specializes in generative AI, computer vision, and time series modeling.