Shorthills AI teams with AWS and DataStax to transform Enterprise Data Search

By Paramdeep Singh, Co-Founder – Shorthills AI,
By Ganesh Sawhney, Sr Partner Solutions Architect – AWS,
By Henry Issac Mudumala, APJ Lead Strategic Partnerships – Datastax

DataStax

Introduction

Extracting value from unstructured data remains a challenge for organizations. They struggle to extract value from emails, reports, legal documents, and other digital assets, which delays decision-making. According to IDC’s 2023 study, 90% of data generated by enterprises are unstructured and only 10% are structured.

Organizations require advance solutions like vector search, and graph indexing to translate petabytes of unstructured data into actionable insights. Customers need capabilities like automated summarization and contextual responses based on organization’s data.

This post explains how Shorthills AI’s collaboration with AWS and DataStax’s Astra DB employs advanced search technologies and natural language processing for enterprise search. This collaboration supports customers making data driven business decisions by leveraging AWS’s enterprise-grade security features alongside DataStax’s high-performance vector search capabilities.

Business Needs and Opportunities

Legal, e-commerce, healthcare, and financial services industries rely on data to drive strategic decisions and optimize customer engagement. These data include legal judgments, reviews, and invoices in PDFs and documents. Advanced techniques like vector search and graph indexing are necessary for processing this data.

Shorthills AI has developed a domain-specific optimized chatbot with a Retrieval-Augmented Generation (RAG) framework and knowledge graph to deliver AI-powered insights. This helps lawyers, legal consultants, healthcare professionals, e-commerce product managers, etc., gain a competitive edge through data-driven decision-making.

As organizations accelerate their digital transformation efforts, the need for flexible, secure, and scalable AI solutions has become essential. By partnering with AWS and DataStax, Shorthills AI provides customers a robust solution that reduces search time by 70% compared to traditional methods. This estimation is based on tests with around two million documents, while maintaining data privacy.

Solution Overview

Shorthills AI has transitioned from open-source solutions. The company now uses enterprise-grade Astra DB on AWS. This transition enables AI-driven search capabilities that deliver real-time insights to customers. The solution standardizes unstructured data using custom parsing and chunking methods, then applies advanced NLP, vector search, and in-house graph algorithms—such as Degree Centrality and Article Rank. This enables to extract metadata, uncover relationships, and deliver deep insights from data within your existing datalake. Based on the analysis, this solution will help you with detailed, relevant insights into legal judgments, customer sentiment, and market trends.

Salient features of Shorthills AI’s solution –

Industry-Specific Customization: Tailored for industry-specific use cases like legal, healthcare, e-commerce, etc.
Optimized Data Processing: Employs techniques like optimized chunking to efficiently process and analyze large volumes of industry-specific data.
Enhanced Understanding: Its graph-based indexing captures intricate relationships among data points, improving the retrieval of contextually relevant information.
Real-Time Adaptability: The OptimizeRAG framework’s incremental update algorithm allows it to adapt to new data without a full index rebuild, ensuring real-time accuracy and relevance.

What makes the solution different?

Currently, the solution is optimized for legal use cases. Naive RAG is a basic RAG model that retrieves relevant information and generates responses without advanced optimizations. Shorthills.AI used the OptimizeRAG framework, which has an edge over NaiveRAG techniques in the legal domain. Below is the performance table for reference:

	NaiveRAG	OptimizeRAG
Comprehensiveness	19.05%	80.95%
Diversity	10.98%	89.02%
Empowerment	17.59%	82.41%
Overall	17.46%	82.54%

Table 1: Comparison of accuracy between NaiveRAG and our OptimizeRAG answer

Shorthills AI has used JUSTIA Inc’s public dataset containing around one lakh legal documents (cases, judgements) and based on the answers generated by both NaiveRAG and OptimizeRAG, we computed the metrics shown in Table 1. Below is an example illustrating the difference in their responses.

Query: What procedural history and administrative proceedings led to the ALJ’s final decision in this case?

You can view the response for this query from NaiveRAG in Figure 1.

Figure 1: NaiveRAG Response to a sample query

You can view the response for this query from OptimizeRAG in Figure 2.

Figure 2: OptimizeRAG Response to a sample query

As shown in Figure 2, OptimizeRAG delivers a contextual response compared to NaiveRAG response in Figure 1. It incorporates critical elements such as the Second ALJ Hearing and Appeals Council Decision, key ALJ findings including RFC and the disability timeframe, and a comprehensive ‘layered history’. This adds contextual depth to the analysis.

Core Benefits of Shorthills AI Solution

The solution caters to a variety of use cases, such as discovery, search, report generation, and summarization. Below are the business benefits:

Enhanced Search Accuracy: Shorthills AI algorithms involve ensembles of models, prompt strategies like few shot prompt, flexible graph building techniques, and Astra DB’s vector search. They improve search accuracy by interpreting intent in complex, contextual queries.
Reduced Management Costs: Leveraging DataStax’s managed database solution on AWS reduces DevOps efforts, resulting in 50% upfront reduction in TCO, allowing Shorthills AI to focus on innovation rather than infrastructure management.
Data Security and Compliance: AWS’s enterprise-grade security, including AWS Key Management Service (KMS) encryption and HAQM Virtual Private Cloud (VPC) endpoints, and DataStax’s production-ready vector search, ensures robust protection and compliance for sensitive data.

This solution ensures that Shorthills AI’s clients can access relevant, timely insights from their unstructured content, ultimately improving user experience and driving competitive advantage.

Architecture

The architecture behind Shorthills AI’s innovative search platform combines the scalability and reliability of AWS cloud infrastructure with the advanced vector search capabilities of Astra DB.

Data first arrives in HAQM S3. This economical, high-throughput storage layer handles incoming files, including PDFs, scanned images, and other formats. HAQM Textract automatically extracts text and structured data, eliminating the need for manual data entry. An AWS Lambda function chunks the extracted text into smaller, manageable units for deeper analysis. Large language models, including HAQM Bedrock, process these text chunks to uncover entities and relationships. The solution store the resulting embeddings in DataStax’s Astra DB on AWS, which efficiently handles both vector and document data. AWS Step Functions orchestrate the entire process, managing errors and ensuring a smooth, scalable end-to-end workflow.

Step 1: Data Storage

As shown in figure 3, this step begins by monitoring an HAQM Simple Storage Service (HAQM S3) bucket for newly uploaded PDF files using HAQM EventBridge. The system checks a CSV file stored in HAQM S3 to identify unprocessed documents, ensuring no duplicates.

New PDFs are read, split into smaller text chunks (e.g., paragraphs or sections), and the IDs of processed files are saved back to the CSV. This creates structured, chunked text ready for analysis and updates the tracking record to prevent reprocessing.

Figure 3: Data processing and storage

Step 2: Chunking and Entity-Relationship Extraction

As shown in Figure 4, Chunks are sent to HAQM Bedrock to detect entities (like people, organizations) and their relationships (e.g., “works at,” “located in”).

This data is then parsed into clean, structured formats. Entities as labeled items and relationships as connections between them. Embeddings (numeric representations) for the chunks are generated and stored in a vector database for future retrieval.

Figure 4: Chunking and Entity-Relationship Extraction

Step 3: Embedding Generation and Storage

HAQM Neptune (a graph database) stores processed entities and relationships, as shown in figure 5, allowing complex queries such as “Find all subsidiaries of Company X”. DataStax’s AstraDB stores chunk and entity embeddings, thus facilitating semantic search (e.g., finding similar documents). Together, these databases allow both keyword-based and context-aware searches across the analyzed data.

Figure 5: Embedding Generation and Storage

Conclusion

Through the high-performance vector search of DataStax Astra DB and AWS’s enterprise-grade security features, Shorthills AI’s platform offers enterprise clients a scalable, compliant, high-performance search solution for their data.

Call to Action

Contact the Shorthills AI team to learn more about their innovative platform. Visit the AWS startup showcase page to learn more about Shorthills AI services. Also, explore the DataStax listing on the AWS Marketplace to discover more about their services and offerings.

DataStax – AWS Partner Spotlight

DataStax is an AWS Technology Partner that offers a One-stop Generative AI Stack with everything needed for a faster, easier, path to production for relevant and responsive GenAI apps. DataStax delivers a RAG-first developer experience, with first-class integrations into leading AI ecosystem partners. With DataStax, anyone can quickly build smart, high-growth AI applications at unlimited scale.

Contact DataStax | Partner Overview | AWS Marketplace

AWS Partner Network (APN) Blog