feat: expose Payload.WritePayload to allow serializing into IPC format#421
Conversation
|
Naming is hard, so I'm open to suggestion on the |
|
Is there any way we can convince the BigQuery side to instead allow the schema message? The Arrow IPC spec states that there should be a schema message as the first message in the stream. By rejecting the schema message, BigQuery is requiring an invalid IPC stream. Also, instead of adding a hardcoded option like this, could you just utilize the |
I'm also discussing that with the team, but those things take a bit more time. That was my first thought when I found this issue
Good to know that, will use that as more evidence that we might need to fix on the service side.
I tried to go this route, but how to reuse all the internal function to do that, like |
|
I tried some silly implementation like this, but could not get it to work, it seems to be missing some alignment of the data, which is handled by the |
|
@zeroshade I hit a wall here and could not make any |
|
I pushed another commit hiding the On the BigQuery side, the argument is that indeed we are only suppose to send RecordBatches, as receiving a schema trigger some logic to check if there is any change to the table schema, which is a mechanism that also exists when using the protobuf format. This mechanism is expensive and should be avoided when possible. |
I'll take a look in a bit and get back with my thoughts.
Why is the schema verification expensive? Also, wouldn't you still need to run the same logic to check for changes to the table schema upon receiving the first RecordBatch? How does avoiding the Schema message allow you to avoid validating the schema? |
It basically adds an extra database call/metadata check on the backend compare the BigQuery Table schema with Arrow Schema and see if there is any changes.
Yes, but only on the first RecordBatch and/or when opening the stream to start writing data. Also when the schema do changes, there is a separated field where the schema can be informed. Schema and serialized record batches goes on separated fields to the backend.
We only check for schema changes, when the specific schema field is filled out, works the same in Arrow and Protobuf. |
Can't this logic just shift to only do the check on the Schema message then? A Schema message is semantically the same as a RecordBatch message with no rows. The same semantics that guarantee the record batches all have the same schema is still true for the schema message (along with dictionary messages). It seems like you have to validate the schema regardless (whether on the first record batch or on the schema message which is always the first message), so wouldn't it be preferable to accept a valid IPC stream (which contains the schema message) instead of an invalid IPC stream that is missing the schema message? |
|
It's still pretty easy to populate the protobuf by calling |
But the schema message would go on every request, not sure if I follow the logic here. The idea is to avoid sending the schema on every request. To add another point, Pyarrow allows for reading the schema and recordbatches separately in IPC format: |
I tried this approach, but the messages got rejected, I'll get more information on the error here, but it seems like it doesn't expected |
My suggestion was that instead of avoiding sending the schema on every request, the backend would just use whatever logic it does to validate the schema of the first record batch to validate the schema and thus it would be able to skip validating the record batch schema (since it was already validated by validating the Schema message). Just shifting the logic from the first record batch message to the schema message (nothing else about the logic would change). As I said, according to the spec leaving off the schema message is technically an invalid IPC stream.
You can already do the equivalent to that Python code in Go, though I guess the issue you run into is the lack of the padding handling. If we simply add a new method to the Payload struct, we can achieve the exact same logic. This PR could instead just be the following: // a drawback to this is having to use bytes.Buffer to get the raw bytes
// if you aren't already using an io.Writer.
func (p *Payload) WritePayload(w io.Writer) (int, error) {
return writeIPCPayload(w, *p)
}
// alternatively if we just want to get the raw bytes, we can do
func (p *Payload) SerializedBytes() ([]byte, error) {
var b bytes.Buffer
_, err := writeIPCPayload(&b, *p)
if err != nil {
return err
}
return b.Bytes(), nil
}Then you can create the equivalent Go to the pyarrow example you provided, without needing to have an entire new writer. func appendRows(tbl arrow.Table, projectID, datasetID, tableID string) error {
// create request etc....
schemaPayload := ipc.GetSchemaPayload(tbl.Schema(), memory.DefaultAllocator)
serializedSchemaBytes, err := schemaPayload.SerializedBytes()
if err != nil {
return err
}
// do whatever you want with the byte slice for the schema
rdr := array.NewTableReader(tbl, tbl.NumRows())
defer rdr.Release()
// the pyarrow example only uses the first record batch, you probably would instead use
// for rdr.Next() to loop over all the batches.... but i'll mirror the pyarrow example for now
rdr.Next()
payload, err := ipc.GetRecordBatchPayload(rdr.Record())
if err != nil {
return err
}
serializedRecordBytes, err := payload.SerializedBytes()
if err != nil {
return err
}
// do whatever you like with the serializedRecordBytes
// ....
} |
|
This is what I tried with func writeToBQClient(schema *arrow.Schema, records []arrow.Record, writeStreamName string, appendStream storagepb.BigQueryWrite_AppendRowsClient) error {
// Serialize schema using IPC format (just the schema, no data)
var err error
var schemaBuf bytes.Buffer
schemaPayload := ipc.GetSchemaPayload(schema, memory.DefaultAllocator)
schemaMd := schemaPayload.Meta()
defer schemaMd.Release()
schemaBuf.Write(schemaMd.Bytes())
schemaPayload.SerializeBody(&schemaBuf)
schemaData := schemaBuf.Bytes()
// Serialize record batch using IPC format
var recordBuf bytes.Buffer
for _, record := range records {
p, err := ipc.GetRecordBatchPayload(record, ipc.WithSchema(schema))
if err != nil {
slog.Error("failed to write Arrow record", "error", err)
return err
}
m := p.Meta()
defer m.Release()
recordBuf.Write(m.Bytes())
p.SerializeBody(&recordBuf)
}
recordData := recordBuf.Bytes()
slog.Info("Schema IPC data size", "bytes", len(schemaData))
slog.Info("Record IPC data size", "bytes", len(recordData))
request := &storagepb.AppendRowsRequest{
WriteStream: writeStreamName,
Rows: &storagepb.AppendRowsRequest_ArrowRows{
ArrowRows: &storagepb.AppendRowsRequest_ArrowData{
WriterSchema: &storagepb.ArrowSchema{
SerializedSchema: schemaData,
},
Rows: &storagepb.ArrowRecordBatch{
SerializedRecordBatch: recordData,
},
},
},
}
// Send the request
err = appendStream.Send(request)
if err != nil {
slog.Error("failed to send AppendRows request", "error", err)
return err
}
err = appendStream.CloseSend()
if err != nil {
slog.Error("failed to close AppendRows request: %v")
return err
}
return nil
} |
Yea, that's because |
this approach works for me too, I figure out that exposing |
|
Yea, the preferred approach would be to simply expose Well, the preferred approach would be that BigQuery accept a valid IPC stream with the Schema message 😄 but this works too |
Agree, doing this right now, exposing the
Still trying to win this fight here, but I brought up the reasons on why is working this way currently. |
Absolutely, I appreciate the context and feel free to have anyone internal with more questions reach out to me if they want more info. I'm gonna hope you can win that fight 😄 |
|
Pushed updates to the PR and updated description |
| ensureNativeEndian bool | ||
| noAutoSchema bool | ||
| emitDictDeltas bool | ||
| skipEmittingSchema bool |
| lastWrittenDicts map[int64]arrow.Array | ||
| emitDictDeltas bool | ||
| lastWrittenDicts map[int64]arrow.Array | ||
| emitDictDeltas bool | ||
| skipEmittingSchema bool |
There was a problem hiding this comment.
Same as above, can we revert the changes for skipEmittingSchema since we are just adding the WritePayload method?
ah sorry, I'm stupid, forgot those changes. Let me remove here |
zeroshade
left a comment
There was a problem hiding this comment.
Thanks for this and the discussion. Looking forward to more discussion with BigQuery in the future!
Rationale for this change
The BigQuery Storage Write API now accepts Arrow data. But schema data and record batches needs to send separately. Right now the
ipc.NewWriterwrites the schema data to the output buffer, which makes it fail on the BigQuery side, as the first message is the schema. So we need a way to write just RecordBatches.See related discussion on googleapis/google-cloud-go#12478
What changes are included in this PR?
Add
WritePayloadmethod toPayload, which exposes the underlyingwriteIPCPayloadmethod, allowing to write IPC messages more easily.Are these changes tested?
Yes, tested with BigQuery Storage Write API. But also added local tests to it too.
Are there any user-facing changes?
Yes,
WritePayloadwill be a public method available option to users.