Thursday, June 12, 2025
HomeBig DataEntry Amazon Redshift Managed Storage tables via Apache Spark on AWS Glue...

Entry Amazon Redshift Managed Storage tables via Apache Spark on AWS Glue and Amazon EMR utilizing Amazon SageMaker Lakehouse

Information environments in data-driven organizations are altering to fulfill the rising calls for for analytics, together with enterprise intelligence (BI) dashboarding, one-time querying, knowledge science, machine studying (ML), and generative AI. These organizations have an enormous demand for lakehouse options that mix the most effective of information warehouses and knowledge lakes to simplify knowledge administration with easy accessibility to all knowledge from their most well-liked engines.

Amazon SageMaker Lakehouse unifies all of your knowledge throughout Amazon Easy Storage Service (Amazon S3) knowledge lakes and Amazon Redshift knowledge warehouses, serving to you construct highly effective analytics and synthetic intelligence and machine studying (AI/ML) purposes on a single copy of information. SageMaker Lakehouse offers you the flexibleness to entry and question your knowledge  in place with all Apache Iceberg suitable instruments and engines. It secures your knowledge within the lakehouse by defining fine-grained permissions, that are persistently utilized throughout all analytics and ML instruments and engines. You may carry knowledge from operational databases and purposes into your lakehouse in close to actual time via zero-ETL integrations. It accesses and queries knowledge in-place with federated question capabilities throughout third-party knowledge sources via Amazon Athena.

With SageMaker Lakehouse, you may entry tables saved in Amazon Redshift managed storage (RMS) via Iceberg APIs, utilizing the Iceberg REST catalog backed by AWS Glue Information Catalog. This expands your knowledge integration workload throughout knowledge lakes and knowledge warehouses, enabling seamless entry to various knowledge sources.

Amazon SageMaker Unified Studio, Amazon EMR 7.5.0 and better, and AWS Glue 5.0 natively help SageMaker Lakehouse. This publish describes the right way to combine knowledge on RMS tables via Apache Spark utilizing SageMaker Unified Studio, Amazon EMR 7.5.0 and better, and AWS Glue 5.0.

Learn how to entry RMS tables via Apache Spark on AWS Glue and Amazon EMR

With SageMaker Lakehouse, RMS tables are accessible via the Apache Iceberg REST catalog. Open supply engines reminiscent of Apache Spark are suitable with Apache Iceberg, they usually can work together with RMS tables by configuring this Iceberg REST catalog. You may study extra in Connecting to the Information Catalog utilizing AWS Glue Iceberg REST extension endpoint.

Be aware that the Iceberg REST extensions endpoint is used if you entry RMS tables. This endpoint is accessible via the Apache Iceberg AWS Glue Information Catalog extensions, which comes preinstalled on AWS Glue 5.0 and Amazon EMR 7.5.0 or increased. The extension library allows entry to RMS tables utilizing the Amazon Redshift connector for Apache Spark.

To entry RMS backed catalog databases from Spark, every RMS database requires its personal Spark session catalog configuration. Listed here are the required Spark configurations:

Spark config key Worth
spark.sql.catalog.{catalog_name} org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.{catalog_name}.sort glue
spark.sql.catalog.{catalog_name}.glue.id {account_id}:{rms_catalog_name}/{database_name}
spark.sql.catalog.{catalog_name}.shopper.area {aws_region}
spark.sql.extensions org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

Configuration parameters:

  • {catalog_name}: Your chosen title for referencing the RMS catalog database in your utility code
  • {rms_catalog_name}: The RMS catalog title as proven within the AWS Lake Formation catalogs part
  • {database_name}: The RMS database title
  • {aws_region}: The AWS Area the place the RMS catalog is situated

For a deeper understanding of how the Amazon Redshift hierarchy (databases, schemas, and tables) is mapped to the AWS Glue multilevel catalogs, you may check with the Bringing Amazon Redshift knowledge into the AWS Glue Information Catalog documentation.

Within the following part, we show the right way to entry RMS tables via Apache Spark utilizing SageMaker Unified Studio JupyterLab notebooks with the AWS Glue 5.0 runtime and Amazon EMR Serverless.

Though we will carry present Amazon Redshift tables into the AWS Glue Information catalog by making a Lakehouse Redshift catalog from an present Redshift namespace and supply entry to a SageMaker Unified Studio venture, within the following instance, you’ll create a managed Amazon Redshift Lakehouse catalog straight from SageMaker Unified Studio and work with that.

Conditions

To observe these directions, you have to have the next conditions:

Create a SageMaker Unified Studio venture

Full the next steps to create a SageMaker Unified Studio venture:

  1. Check in to SageMaker Unified Studio.
  2. Select Choose a venture on the highest menu and select Create venture.
  3. For Challenge titleenter demo.
  4. For Challenge profileselect All capabilities.
  5. Select Proceed.

  1. Depart the default values and select Proceed.
  2. Evaluation the configurations and select Create venture.

You might want to look forward to the venture to be created. Challenge creation can take about 5 minutes. When the venture standing modifications to Energeticchoose the venture title to entry the venture’s house web page.

  1. Make notice of the Challenge position ARN since you’ll want it for subsequent steps.

You’ve efficiently created the venture and famous the venture position ARN. The following step is to configure a Lakehouse catalog in your RMS.

Configure a Lakehouse catalog in your RMS

Full the next steps to configure a Lakehouse catalog in your RMS:

  1. Within the navigation pane, select Information.
  2. Select the + (plus) signal.
  3. Choose Create Lakehouse catalog to create a brand new catalog and select Subsequent.

  1. For Lakehouse catalog titleenter rms-catalog-demo.
  2. Select Add catalog.

  1. Look ahead to the catalog to be created.

  1. In SageMaker Unified Studio, select Information within the left navigation pane, then choose the three vertical dots subsequent to Redshift (Lakehouse) and select Refresh to ensure the Amazon Redshift compute is lively.

Create a brand new desk within the RMS Lakehouse catalog:

  1. In SageMaker Unified Studio, on the highest menu, beneath Constructselect Question Editor.
  2. On the highest proper, select Choose knowledge supply.
  3. For CONNECTIONSselect Redshift (Lakehouse).
  4. For DATABASESselect dev@rms-catalog-demo.
  5. For SCHEMASselect public.
  6. Select Select.

  1. Within the question cell, enter and execute the next question to create a brand new schema:
create schema "dev@rms-catalog-demo".salesdb

  1. In a brand new cell, enter and execute the next question to create a brand new desk:
create desk salesdb.store_sales (ss_sold_timestamp timestamp, ss_item textual content, ss_sales_price float);

  1. In a brand new cell, enter and execute the next question to populate the desk with pattern knowledge:
insert into salesdb.store_sales values ('2024-12-01T09:00:00Z', 'Product 1', 100.0),
('2024-12-01T11:00:00Z', 'Product 2', 500.0),
('2024-12-01T15:00:00Z', 'Product 3', 20.0),
('2024-12-01T17:00:00Z', 'Product 4', 1000.0),
('2024-12-01T18:00:00Z', 'Product 5', 30.0),
('2024-12-02T10:00:00Z', 'Product 6', 5000.0),
('2024-12-02T16:00:00Z', 'Product 7', 5.0);

  1. In a brand new cell, enter and run the next question to confirm the desk contents:
choose * from salesdb.store_sales;

(Non-obligatory) Create an Amazon EMR Serverless utility

IMPORTANT: This part is barely required when you plan to check additionally utilizing Amazon EMR Serverless. If you happen to intend to make use of AWS Glue solely, you may skip this part totally.

  1. Navigate to the venture web page. Within the left navigation pane, choose Computethen choose the Information processing Select Add compute.

  1. Select Create new compute assetsthen select Subsequent.

  1. Choose EMR Serverless.

  1. Specify emr_serverless_application as Compute titlechoose Compatibility as Permission modeand select Add compute.

  1. Monitor the deployment progress. Look ahead to the Amazon EMR Serverless utility to finish its deployment. This course of can take a minute.

Entry Amazon Redshift Managed Storage tables via Apache Spark

On this part, we show the right way to question tables saved in RMS utilizing a SageMaker Unified Studio pocket book.

  1. Within the navigation pane, select Information
  2. Below Lakehousechoose the down arrow subsequent to rms-catalog-demo
  3. Below dev, choose the down arrow subsequent salesdbselect store_salesand select the three dots

SageMaker Lakehouse offers a number of evaluation choices: Question with Athena, Question with Redshiftand Open in Jupyter Lab pocket book.

  1. Select Open in Jupyter Lab pocket book
  2. On the Launcher tab, select Python 3 (ipykernel)

In SageMaker Unified Studio JupyterLab, you may specify totally different compute varieties for every pocket book cell. Though this instance demonstrates utilizing AWS Glue compute (venture.spark.compatibility), the identical code may be executed utilizing Amazon EMR Serverless by choosing the suitable compute within the cell settings. The next desk reveals the connection sort and compute values to specify when operating PySpark code or Spark SQL code with totally different engines:

Compute possibility Pyspark code Spark SQL
Connection sort Compute Connection sort Compute
AWS Glue Barn venture.spark.compatibility SQL venture.spark.compatibility
Amazon EMR Serverless Barn emr-s.emr_serverless_application SQL emr-s.emr_serverless_application
  1. Within the pocket book cell’s prime left nook, set Connection Sort to PySpark and choose spark.compatibility (AWS Glue 5.0) as Compute
  2. Execute the next code to initialize the SparkSession and configure rmscatalog because the session catalog for accessing the dev database beneath the rms-catalog-demo RMS catalog:
from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
#Change  along with your AWS account ID
rms_catalog_id = ":rms-catalog-demo/dev"

#Change along with your AWS area
aws_region="us-east-2"

spark = SparkSession.builder.appName('rms_demo') 
    .config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 
    .config(f'spark.sql.catalog.{catalog_name}.sort', 'glue') 
    .config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_id) 
    .config(f'spark.sql.catalog.{catalog_name}.shopper.area', aws_region) 
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
    .getOrCreate()

  1. Create a brand new cell and change the connection sort from PySpark to SQL to execute Spark SQL instructions straight
  2. Enter the next SQL assertion to view all tables beneath salesdb (RMS schema) inside rmscatalog:
SHOW TABLES IN rmscatalog.salesdb

  1. In a brand new SQL cell, enter the next DESCRIBE EXTENDED assertion to view detailed details about the store_sales desk within the salesdb schema:
DESCRIBE EXTENDED rmscatalog.salesdb.store_sales

Within the output, you’ll observe that the Supplier is ready to iceberg. This means that the desk is acknowledged as an Iceberg desk, regardless of being saved in Amazon Redshift managed storage.

  1. In a brand new SQL cell, enter the next SELECT assertion to view the content material of the desk
SELECT * FROM rmscatalog.salesdb.store_sales

All through this instance, we demonstrated the right way to create a desk in Amazon Redshift Serverless and seamlessly question it as an Iceberg desk utilizing Apache Spark inside a SageMaker Unified Studio pocket book.

Clear up

To keep away from incurring future costs, clear up all created assets:

  1. Delete the created SageMaker Unified Studio venture. This step will mechanically delete Amazon EMR compute (for instance, the Amazon EMR Serverless utility) that was provisioned from the venture:
    1. Inside SageMaker Studio, navigate to the demo venture’s Challenge overview part.
    2. Select Actionsthen choose Delete venture.
    3. Sort affirm and select Delete venture.
  1. Delete the created Lakehouse catalog:
    1. Navigate to the AWS Lake Formation web page within the Catalogs part.
    2. Choose the rms-catalog-demo catalog, select Actionsthen choose Delete.
    3. Within the affirmation window sort rms-catalog-demo after which select Drop.

Conclusion

On this publish, we demonstrated the right way to use Apache Spark to work together with Amazon Redshift Managed Storage tables via Amazon SageMaker Lakehouse utilizing the Iceberg REST catalog. This integration offers a unified view of your knowledge throughout Amazon S3 knowledge lakes and Amazon Redshift knowledge warehouses, so you may construct highly effective analytics and AI/ML purposes whereas sustaining a single copy of your knowledge.

For added workloads and implementations, go to Simplify knowledge entry in your enterprise utilizing Amazon SageMaker Lakehouse.


Concerning the Authors

Noritaka Sekiyama is a Principal Large Information Architect with Amazon Internet Providers (AWS) Analytics providers. He’s liable for constructing software program artifacts to assist prospects. In his spare time, he enjoys biking on his street bike.

Stefano Sandonà is a Senior Large Information Specialist Resolution Architect at Amazon Internet Providers (AWS). Enthusiastic about knowledge, distributed techniques, and safety, he helps prospects worldwide architect high-performance, environment friendly, and safe knowledge options.

Derek Liu is a Senior Options Architect primarily based out of Vancouver, BC. He enjoys serving to prospects remedy huge knowledge challenges via Amazon Internet Providers (AWS) analytic providers.

Raj ramasubbu is a Senior Analytics Specialist Options Architect targeted on huge knowledge and analytics and AI/ML with Amazon Internet Providers (AWS). He helps prospects architect and construct extremely scalable, performant, and safe cloud-based options on AWS. Raj supplied technical experience and management in constructing knowledge engineering, huge knowledge analytics, enterprise intelligence, and knowledge science options for over 18 years previous to becoming a member of AWS. He helped prospects in varied business verticals like healthcare, medical units, life science, retail, asset administration, automobile insurance coverage, residential REIT, agriculture, title insurance coverage, provide chain, doc administration, and actual property.

Nation is a Sr. EMEA Information & AI PSA, primarily based in Madrid. He has beforehand labored on analysis associated to knowledge analytics and AI in various European analysis initiatives. In his present position, Angel helps companions develop companies centered on knowledge and AI.


Appendix: Pattern script for Lake Formation FGAC enabled Spark cluster

If you wish to entry RMS tables from Lake Formation FGAC enabled Spark cluster on AWS Glue or Amazon EMR, check with the next code instance:

from pyspark.sql import SparkSession

catalog_name = "rmscatalog"
rms_catalog_name = "123456789012:rms-catalog-demo/dev"
account_id = "123456789012"
area = "us-east-2"

spark = SparkSession.builder.appName('rms_demo') 
.config('spark.sql.defaultCatalog', catalog_name) 
.config(f'spark.sql.catalog.{catalog_name}', 'org.apache.iceberg.spark.SparkCatalog') 
.config(f'spark.sql.catalog.{catalog_name}.sort', 'glue') 
.config(f'spark.sql.catalog.{catalog_name}.glue.id', rms_catalog_name) 
.config(f'spark.sql.catalog.{catalog_name}.shopper.area', area) 
.config(f'spark.sql.catalog.{catalog_name}.glue.account-id', account_id) 
.config(f'spark.sql.catalog.{catalog_name}.glue.catalog-arn',f'arn:aws:glue:{area}:{rms_catalog_name}') 
.config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') 
.getOrCreate()

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments