feat: DH-18399: Add ParquetColumnResolver#6558
Conversation
| } else { | ||
| final ColumnDescriptor columnDescriptor = resolver.mapping().get(columnName); | ||
| if (columnDescriptor == null) { | ||
| nameList = List.of(); // empty, will not resolve |
There was a problem hiding this comment.
Let's make sure this actually results in supply ColumnRegion.Null implementations when a user tries to access the column.
| // There are different resolution strategies that could all be reasonable. We could consider using only the | ||
| // field id closest to the leaf. This version, however, takes the most general approach and considers field | ||
| // ids wherever they appear; ultimately, only being resolvable if the field id mapping is unambiguous. | ||
| for (Type type : path) { |
There was a problem hiding this comment.
What happens if we do encounter a nested type here? That is, what's the current outcome?
There was a problem hiding this comment.
I'm not sure I understand the question. There may definitely be "nested types" here; but path represents the full path to a single leaf (primitive) field.
| * @param tableLocationKey the Parquet TLK | ||
| * @return the Parquet column resolver | ||
| */ | ||
| ParquetColumnResolver init(TableKey tableKey, ParquetTableLocationKey tableLocationKey); |
There was a problem hiding this comment.
used generic naming of
| * The map from Deephaven column name to {@link ColumnDescriptor}. The {@link #schema()} must contain each column | ||
| * descriptor. | ||
| */ | ||
| public abstract Map<String, ColumnDescriptor> mapping(); |
There was a problem hiding this comment.
Do we need this to keep ColumnDescriptor as part of the interface of the implementation? Do we want the Iceberg implementation using this?
There was a problem hiding this comment.
I'm very much in favor of keeping this implementation strongly-typed with the safety checks. Iceberg should be able to use this implementation (and even use the same Factory to create it) in some situations.
There was a problem hiding this comment.
Changed my mind; the map implementation is now very generic, no Parquet specific types.
Also update relevant tests to check against full MessageType schema instead of pulling out the ColumnDescriptors
| public static List<String[]> getPaths(MessageType schema) { | ||
| final List<String[]> out = new ArrayList<>(); |
There was a problem hiding this comment.
It would be better to add a comment here what these method actually do because there's no comments in the MessageType methods and they are a bit tricky.
| return null; | ||
| } | ||
| primitiveType = field.asPrimitiveType(); | ||
| if (isRepeated(primitiveType)) { |
There was a problem hiding this comment.
You might be able to merge the two if branches because isRequired and isRepeated only take Type, and not Primitive or Group type
| } | ||
|
|
||
| /** | ||
| * A more efficient implementation of {@link MessageType#getColumnDescription(String[])}. |
There was a problem hiding this comment.
I am not sure I understand the -1 in the calculation of max repetition level and max definition level in MessageType class.
There was a problem hiding this comment.
The -1 in the upstream code is because the are counting the MessageType as part of the calculation (and it always appears to be REPEATED).
|
nightlies all pass |
No description provided.