AWS Big Data Blog
Improve performance of workloads containing repetitive scan filters with multidimensional data layout sort keys in HAQM Redshift
HAQM Redshift, a widely used cloud data warehouse, has evolved significantly to meet the performance requirements of the most demanding workloads. This post covers one such new feature—the multidimensional data layout sort key. HAQM Redshift now improves your query performance by supporting multidimensional data layout sort keys, which is a new type of sort key […]
HAQM MSK now provides up to 29% more throughput and up to 24% lower costs with AWS Graviton3 support
HAQM Managed Streaming for Apache Kafka (HAQM MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. Today, we’re excited to bring the benefits of Graviton3 to Kafka workloads, with HAQM MSK now offering M7g instances for new MSK provisioned clusters. AWS Graviton […]
Use HAQM EMR with S3 Access Grants to scale Spark access to HAQM S3
HAQM EMR is pleased to announce integration with HAQM Simple Storage Service (HAQM S3) Access Grants that simplifies HAQM S3 permission management and allows you to enforce granular access at scale. With this integration, you can scale job-based HAQM S3 access for Apache Spark jobs across all HAQM EMR deployment options and enforce granular HAQM […]
Large Language Models for sentiment analysis with HAQM Redshift ML (Preview)
HAQM Redshift ML empowers data analysts and database developers to integrate the capabilities of machine learning and artificial intelligence into their data warehouse. HAQM Redshift ML helps to simplify the creation, training, and application of machine learning models through familiar SQL commands. You can further enhance HAQM Redshift’s inferencing capabilities by Bringing Your Own Models […]
Enhance query performance using AWS Glue Data Catalog column-level statistics
Today, we’re making available a new capability of AWS Glue Data Catalog that allows generating column-level statistics for AWS Glue tables. These statistics are now integrated with the cost-based optimizers (CBO) of HAQM Athena and HAQM Redshift Spectrum, resulting in improved query performance and potential cost savings. Data lakes are designed for storing vast amounts […]
Introducing Apache Hudi support with AWS Glue crawlers
Apache Hudi is an open table format that brings database and data warehouse capabilities to data lakes. Apache Hudi helps data engineers manage complex challenges, such as managing continuously evolving datasets with transactions while maintaining query performance. Data engineers use Apache Hudi for streaming workloads as well as to create efficient incremental data pipelines. Hudi provides tables, transactions, efficient […]
Introducing persistent buffering for HAQM OpenSearch Ingestion
HAQM OpenSearch Ingestion is a fully managed, serverless pipeline that delivers real-time log, metric, and trace data to HAQM OpenSearch Service domains and OpenSearch Serverless collections. Customers use HAQM OpenSearch Ingestion pipelines to ingest data from a variety of data sources, both pull-based and push-based. When ingesting data from pull-based sources, such as HAQM Simple […]
Build scalable and serverless RAG workflows with a vector engine for HAQM OpenSearch Serverless and HAQM Bedrock Claude models
In pursuit of a more efficient and customer-centric support system, organizations are deploying cutting-edge generative AI applications. These applications are designed to excel in four critical areas: multi-lingual support, sentiment analysis, personally identifiable information (PII) detection, and conversational search capabilities. Customers worldwide can now engage with the applications in their preferred language, and the applications […]
Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics
For any modern data-driven company, having smooth data integration pipelines is crucial. These pipelines pull data from various sources, transform it, and load it into destination systems for analytics and reporting. When running properly, it provides timely and trustworthy information. However, without vigilance, the varying data volumes, characteristics, and application behavior can cause data pipelines […]
Introducing AWS Glue serverless Spark UI for better monitoring and troubleshooting
Today, we are pleased to announce serverless Spark UI built into the AWS Glue console. You can now use Spark UI easily as it’s a built-in component of the AWS Glue console, enabling you to access it with a single click when examining the details of any given job run. There’s no infrastructure setup or teardown required. AWS Glue serverless Spark UI is a fully-managed serverless offering and generally starts up in a matter of seconds. Serverless Spark UI makes it significantly faster and easier to get jobs working in production because you have ready access to low level details for your job runs.