Experiment Glue data staging

This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.

Items:

- Please follow the Glue local development setup - https://github.com/umccr/orcahouse/tree/main/infra/glue
  - To understand basic about [PySpark API](https://www.google.com/search?q=PySpark) and, how it process the data, Spark dataframe & its data parallelism concept, etc. 
- Please exercise to see whether you can reuse `skel` template, to say, if you would process and stage another data source for warehouse
- Observe terraform from deploy directory and its Glue ETL scripts deployment in prod account (try invoke the job run, etc - no worries, ETL scripts are [idempotent](https://www.google.com/search?q=idempotent))
- Observe existing Glue script on spreadsheet processing
  - In current Glue ETL script; prior loading into the warehouse staging database, investigate whether we can also perform "datalake" the output data (processed or unprocessed/source as-is or etc)? Historical, archival, etc.
  - Perhaps, better yet, [S3 Table bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html) with Iceberg over datalake?
  - To discuss pros/cons and, use when it fits the use case, etc
- Choose Python or Scala for the activity


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment Glue data staging #60

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Experiment Glue data staging #60

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions