Skip to content

feat: DH-18399: Add ParquetColumnResolver#6558

Merged
devinrsmith merged 12 commits intodeephaven:mainfrom
devinrsmith:DH-18399-parquet-column-resolver
Jan 22, 2025
Merged

feat: DH-18399: Add ParquetColumnResolver#6558
devinrsmith merged 12 commits intodeephaven:mainfrom
devinrsmith:DH-18399-parquet-column-resolver

Conversation

@devinrsmith
Copy link
Copy Markdown
Member

No description provided.

@devinrsmith devinrsmith added parquet Related to the Parquet integration NoDocumentationNeeded ReleaseNotesNeeded Release notes are needed labels Jan 13, 2025
@devinrsmith devinrsmith added this to the 0.38.0 milestone Jan 13, 2025
@devinrsmith devinrsmith self-assigned this Jan 13, 2025
} else {
final ColumnDescriptor columnDescriptor = resolver.mapping().get(columnName);
if (columnDescriptor == null) {
nameList = List.of(); // empty, will not resolve
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this actually results in supply ColumnRegion.Null implementations when a user tries to access the column.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, verified.

// There are different resolution strategies that could all be reasonable. We could consider using only the
// field id closest to the leaf. This version, however, takes the most general approach and considers field
// ids wherever they appear; ultimately, only being resolvable if the field id mapping is unambiguous.
for (Type type : path) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we do encounter a nested type here? That is, what's the current outcome?

Copy link
Copy Markdown
Member Author

@devinrsmith devinrsmith Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the question. There may definitely be "nested types" here; but path represents the full path to a single leaf (primitive) field.

* @param tableLocationKey the Parquet TLK
* @return the Parquet column resolver
*/
ParquetColumnResolver init(TableKey tableKey, ParquetTableLocationKey tableLocationKey);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

init -> resolver?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used generic naming of

* The map from Deephaven column name to {@link ColumnDescriptor}. The {@link #schema()} must contain each column
* descriptor.
*/
public abstract Map<String, ColumnDescriptor> mapping();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this to keep ColumnDescriptor as part of the interface of the implementation? Do we want the Iceberg implementation using this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very much in favor of keeping this implementation strongly-typed with the safety checks. Iceberg should be able to use this implementation (and even use the same Factory to create it) in some situations.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed my mind; the map implementation is now very generic, no Parquet specific types.

Copy link
Copy Markdown
Member

@rcaudy rcaudy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@devinrsmith devinrsmith marked this pull request as ready for review January 16, 2025 01:57
Copy link
Copy Markdown
Contributor

@malhotrashivam malhotrashivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

Also update relevant tests to check against full MessageType schema instead of pulling out the ColumnDescriptors
Comment on lines +58 to +59
public static List<String[]> getPaths(MessageType schema) {
final List<String[]> out = new ArrayList<>();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to add a comment here what these method actually do because there's no comments in the MessageType methods and they are a bit tricky.

return null;
}
primitiveType = field.asPrimitiveType();
if (isRepeated(primitiveType)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to merge the two if branches because isRequired and isRepeated only take Type, and not Primitive or Group type

}

/**
* A more efficient implementation of {@link MessageType#getColumnDescription(String[])}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I understand the -1 in the calculation of max repetition level and max definition level in MessageType class.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -1 in the upstream code is because the are counting the MessageType as part of the calculation (and it always appears to be REPEATED).

@devinrsmith
Copy link
Copy Markdown
Member Author

nightlies all pass

Copy link
Copy Markdown
Contributor

@malhotrashivam malhotrashivam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments

@devinrsmith devinrsmith merged commit 4b3ea4b into deephaven:main Jan 22, 2025
@devinrsmith devinrsmith deleted the DH-18399-parquet-column-resolver branch January 22, 2025 18:07
@github-actions github-actions Bot locked and limited conversation to collaborators Jan 22, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

NoDocumentationNeeded parquet Related to the Parquet integration ReleaseNotesNeeded Release notes are needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants