Skip to content

Experiment Glue data staging #60

@victorskl

Description

@victorskl

This task aims as experimentation and, to acquire understanding on warehouse data staging activity via Glue.

Items:

  • Please follow the Glue local development setup - https://github.com/umccr/orcahouse/tree/main/infra/glue
    • To understand basic about PySpark API and, how it process the data, Spark dataframe & its data parallelism concept, etc.
  • Please exercise to see whether you can reuse skel template, to say, if you would process and stage another data source for warehouse
  • Observe terraform from deploy directory and its Glue ETL scripts deployment in prod account (try invoke the job run, etc - no worries, ETL scripts are idempotent)
  • Observe existing Glue script on spreadsheet processing
    • In current Glue ETL script; prior loading into the warehouse staging database, investigate whether we can also perform "datalake" the output data (processed or unprocessed/source as-is or etc)? Historical, archival, etc.
    • Perhaps, better yet, S3 Table bucket with Iceberg over datalake?
    • To discuss pros/cons and, use when it fits the use case, etc
  • Choose Python or Scala for the activity

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions