AWS Big Data Blog
Category: HAQM SageMaker
Simplify data integration with AWS Glue and zero-ETL to HAQM SageMaker Lakehouse
AWS has introduced zero-ETL integration support from external applications to AWS Glue, simplifying data integration for organizations. This new feature allows for seamless replication of data from popular platforms like Salesforce, ServiceNow, and Zendesk into HAQM SageMaker Lakehouse and HAQM Redshift. This blog post demonstrates a use case involving ServiceNow data integration, outlining the process of setting up a connector, creating a zero-ETL integration, and verifying both initial data load and change data capture (CDC). It also highlights the advantages of using Apache Iceberg for data versioning and time travel capabilities within zero-ETL integrations.
Catalog and govern HAQM Athena federated queries with HAQM SageMaker Lakehouse
In this post, we show how to connect to, govern, and run federated queries on data stored in Redshift, DynamoDB (Preview), and Snowflake (Preview). To query our data, we use Athena, which is seamlessly integrated with SageMaker Unified Studio. We use SageMaker Lakehouse to present data to end-users as federated catalogs, a new type of catalog object. Finally, we demonstrate how to use column-level security permissions in AWS Lake Formation to give analysts access to the data they need while restricting access to sensitive information.
The next generation of HAQM SageMaker: The center for all your data, analytics, and AI
This week on the keynote stages at AWS re:Invent 2024, you heard from Matt Garman, CEO, AWS, and Swami Sivasubramanian, VP of AI and Data, AWS, speak about the next generation of HAQM SageMaker, the center for all of your data, analytics, and AI. This update addresses the evolving relationship between analytics and AI workloads, aiming to streamline how customers work with their data. It helps organizations collaborate more effectively, reduce data silos, and accelerate the development of AI-powered applications while maintaining robust governance and security measures.
Integrate sparse and dense vectors to enhance knowledge retrieval in RAG using HAQM OpenSearch Service
In this post, instead of using the BM25 algorithm, we introduce sparse vector retrieval. This approach offers improved term expansion while maintaining interpretability. We walk through the steps of integrating sparse and dense vectors for knowledge retrieval using HAQM OpenSearch Service and run some experiments on some public datasets to show its advantages.
Protein similarity search using ProtT5-XL-UniRef50 and HAQM OpenSearch Service
A protein is a sequence of amino acids that, when chained together, creates a 3D structure. This 3D structure allows the protein to bind to other structures within the body and initiate changes. This binding is core to the working of many drugs. A common workflow within drug discovery is searching for similar proteins, because […]
Build a decentralized semantic search engine on heterogeneous data stores using autonomous agents
In this post, we show how to build a Q&A bot with RAG (Retrieval Augmented Generation). RAG uses data sources like HAQM Redshift and HAQM OpenSearch Service to retrieve documents that augment the LLM prompt. For getting data from HAQM Redshift, we use the Anthropic Claude 2.0 on HAQM Bedrock, summarizing the final response based on pre-defined prompt template libraries from LangChain. To get data from HAQM OpenSearch Service, we chunk, and convert the source data chunks to vectors using HAQM Titan Text Embeddings model.
Hybrid Search with HAQM OpenSearch Service
This post explains the internals of hybrid search and how to build a hybrid search solution using OpenSearch Service. We experiment with sample queries to explore and compare lexical, semantic, and hybrid search. All the code used in this post is publicly available in the GitHub repository.
Preprocess and fine-tune LLMs quickly and cost-effectively using HAQM EMR Serverless and HAQM SageMaker
Large language models (LLMs) are becoming increasing popular, with new use cases constantly being explored. In general, you can build applications powered by LLMs by incorporating prompt engineering into your code. However, there are cases where prompting an existing LLM falls short. This is where model fine-tuning can help. Prompt engineering is about guiding the […]
Power neural search with AI/ML connectors in HAQM OpenSearch Service
With the launch of the neural search feature for HAQM OpenSearch Service in OpenSearch 2.9, it’s now effortless to integrate with AI/ML models to power semantic search and other use cases. OpenSearch Service has supported both lexical and vector search since the introduction of its k-nearest neighbor (k-NN) feature in 2020; however, configuring semantic search […]
Implement fine-grained access control in HAQM SageMaker Studio and HAQM EMR using Apache Ranger and Microsoft Active Directory
In this post, we show how you can authenticate into SageMaker Studio using an existing Active Directory (AD), with authorized access to both HAQM S3 and Hive cataloged data using AD entitlements via Apache Ranger integration and AWS IAM Identity Center (successor to AWS Single Sign-On). With this solution, you can manage access to multiple SageMaker environments and SageMaker Studio notebooks using a single set of credentials. Subsequently, Apache Spark jobs created from SageMaker Studio notebooks will access only the data and resources permitted by Apache Ranger policies attached to the AD credentials, inclusive of table and column-level access.