From c45b8c4e61fade2424497a1b0190b56f27c1e88f Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Tue, 10 May 2022 12:28:24 -0500 Subject: [PATCH 1/7] changes for eh troubleshooting guide --- sdk/eventhub/azure-eventhub/README.md | 30 +- .../azure-eventhub/TROUBLESHOOTING.md | 265 ++++++++++++++++++ 2 files changed, 266 insertions(+), 29 deletions(-) create mode 100644 sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md diff --git a/sdk/eventhub/azure-eventhub/README.md b/sdk/eventhub/azure-eventhub/README.md index 1d8e6ddebe94..0def1c06c922 100644 --- a/sdk/eventhub/azure-eventhub/README.md +++ b/sdk/eventhub/azure-eventhub/README.md @@ -395,35 +395,7 @@ Refer to [IoT Hub Connection String Sample](https://github.com/Azure/azure-sdk-f ## Troubleshooting -### General - -The Event Hubs APIs generate the following exceptions in azure.eventhub.exceptions - -- **AuthenticationError:** Failed to authenticate because of wrong address, SAS policy/key pair, SAS token or azure identity. -- **ConnectError:** Failed to connect to the EventHubs. The AuthenticationError is a type of ConnectError. -- **ConnectionLostError:** Lose connection after a connection has been built. -- **EventDataError:** The EventData to be sent fails data validation. For instance, this error is raised if you try to send an EventData that is already sent. -- **EventDataSendError:** The Eventhubs service responds with an error when an EventData is sent. -- **OperationTimeoutError:** EventHubConsumer.send() times out. -- **EventHubError:** All other Eventhubs related errors. It is also the root error class of all the errors described above. - -### Logging - -- Enable `azure.eventhub` logger to collect traces from the library. -- Enable `uamqp` logger to collect traces from the underlying uAMQP library. -- Enable AMQP frame level trace by setting `logging_enable=True` when creating the client. -- There may be cases where you consider the `uamqp` logging to be too verbose. To suppress unnecessary logging, add the following snippet to the top of your code: -```python -import logging - -# The logging levels below may need to be adjusted based on the logging that you want to suppress. -uamqp_logger = logging.getLogger('uamqp') -uamqp_logger.setLevel(logging.ERROR) - -# or even further fine-grained control, suppressing the warnings in uamqp.connection module -uamqp_connection_logger = logging.getLogger('uamqp.connection') -uamqp_connection_logger.setLevel(logging.ERROR) -``` +See the `azure-eventhubs` [troubleshooting guide](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/keyvault/TROUBLESHOOTING.md) for details on how to diagnose various failure scenarios ## Next steps diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md new file mode 100644 index 000000000000..a67081437c36 --- /dev/null +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -0,0 +1,265 @@ +# Troubleshoot Event Hubs issues + +This troubleshooting guide covers failure investigation techniques, common errors for the credential types in the Azure Identity Java client library, and mitigation steps to resolve these errors. + +- [Handle Event Hubs exceptions](#handle-event-hubs-exceptions) + - [Find relevant information in exception messages](#find-relevant-information-in-exception-messages) + - [Commonly encountered exceptions](#commonly-encountered-exceptions) +- [Permission issues](#permission-issues) +- [Connectivity issues](#connectivity-issues) + - [Timeout when connecting to service](#timeout-when-connecting-to-service) + - [SSL handshake failures](#ssl-handshake-failures) + - [Socket exhaustion errors](#socket-exhaustion-errors) + - [Connect using an IoT connection string](#connect-using-an-iot-connection-string) + - [Cannot add components to the connection string](#cannot-add-components-to-the-connection-string) +- [Enable and configure logging](#enable-and-configure-logging) + - [Configuring Log4J 2](#configuring-log4j-2) + - [Configuring logback](#configuring-logback) + - [Enable AMQP transport logging](#enable-amqp-transport-logging) + - [Reduce logging](#reduce-logging) +- [Troubleshoot EventProducerAsyncClient/EventProducerClient issues](#troubleshoot-eventproducerasyncclienteventproducerclient-issues) + - [Cannot set multiple partition keys for events in EventDataBatch](#cannot-set-multiple-partition-keys-for-events-in-eventdatabatch) + - [Setting partition key on EventData is not set in Kafka consumer](#setting-partition-key-on-eventdata-is-not-set-in-kafka-consumer) +- [Troubleshoot EventProcessorClient issues](#troubleshoot-eventprocessorclient-issues) + - [412 precondition failures when checkpointing](#412-precondition-failures-when-checkpointing) + - [Partition ownership changes a lot](#partition-ownership-changes-a-lot) + - ["...current receiver 'nil' with epoch '0' is getting disconnected"](#current-receiver-nil-with-epoch-0-is-getting-disconnected) + - [High CPU usage](#high-cpu-usage) + - [Processor client stops receiving](#processor-client-stops-receiving) + - [Migrate from legacy to new client library](#migrate-from-legacy-to-new-client-library) +- [Get additional help](#get-additional-help) + +## Handle Event Hubs exceptions + +The Event Hubs APIs generate the following exceptions in azure.eventhub.exceptions + +- **AuthenticationError:** Failed to authenticate because of wrong address, SAS policy/key pair, SAS token or azure identity. +- **ConnectError:** Failed to connect to the EventHubs. The AuthenticationError is a type of ConnectError. +- **ConnectionLostError:** Lose connection after a connection has been built. +- **EventDataError:** The EventData to be sent fails data validation. For instance, this error is raised if you try to send an EventData that is already sent. +- **EventDataSendError:** The Eventhubs service responds with an error when an EventData is sent. +- **OperationTimeoutError:** EventHubConsumer.send() times out. +- **EventHubError:** All other Eventhubs related errors. It is also the root error class of all the errors described above. + +### Find relevant information in exception messages + +An [AmqpException][AmqpException] contains three fields which describe the error. + +* **getErrorCondition**: The underlying AMQP error. A description of the errors can be found in the [AmqpErrorCondition][AmqpErrorCondition] javadocs or the OASIS AMQP 1.0 spec. +* **isTransient**: Whether or not trying to perform the same operation is possible. SDK clients apply the retry policy when the error is transient. +* **getErrorContext**: Information about where the AMQP error originated. + * [LinkErrorContext][LinkErrorContext]: Errors that occur in either the send/receive link. + * [SessionErrorContext][SessionErrorContext]: Errors that occur in the session. + * [AmqpErrorContext][AmqpErrorContext]: Errors that occur in the connection or a general AMQP error. + +### Commonly encountered exceptions + +#### amqp\:connection\:forced and amqp\:link\:detach-forced + +When the connection to Event Hubs is idle, the service will disconnect the client after some time. This is not a problem as the clients will re-establish a connection with the service. More information for users is in the [AMQP troubleshooting documentation][AmqpTroubleshooting]. + +## Permission issues + +An `EventHubError` with an [`AmqpErrorCondition`][AmqpErrorCondition] of "amqp:unauthorized-access" means that the customer's credentials do not allow for them to perform the action (receiving or sending) with Event Hubs. + +* [Double check you have the correct connection string][GetConnectionString] +* [Ensure your SAS token is generated correctly][AuthorizeSAS] + +[Troubleshoot authentication and authorization issues with Event Hubs][troubleshoot_authentication_authorization] lists other possible solutions. + +## Connectivity issues + +### Timeout when connecting to service + +* Verify that the connection string or fully qualified domain name specified when creating the client is correct. [Get an Event Hubs connection string][GetConnectionString] demonstrates how to acquire a connection string. +* Check the firewall and port permissions in your hosting environment and that the AMQP ports 5671 and 5762 are open. + * Make sure that the endpoint is allowed through the firewall. +* Try using WebSockets, which connects on port 443. See [configure web sockets][PublishEventsWithWebSocketsAndProxy] sample. +* See if your network is blocking specific IP addresses. + * [What IP addresses do I need to allow?][EventHubsIPAddresses] +* If applicable, check the proxy configuration. See [configure proxy][PublishEventsWithWebSocketsAndProxy] sample. +* For more information about troubleshooting network connectivity is at [Event Hubs troubleshooting][EventHubsTroubleshooting] + +### SSL handshake failures + +This error can occur when an intercepting proxy is used and the proxy is not configured correctly. We recommend testing in your hosting environment with the proxy disabled to verify. + +### Socket exhaustion errors + +Applications should prefer treating the Event Hubs clients as a singleton, creating and using a single instance through the lifetime of their application. This is important as each client type manages its connection; creating a new Event Hub client results in a new AMQP connection, which uses a socket. Additionally, it is essential to be aware that clients inherit from `java.io.Closeable`, so your application is responsible for calling `close()` when it is finished using a client. + +To use the same AMQP connection when creating multiple clients, you can use the `EventHubClientBuilder.shareConnection()` flag, hold a reference to that `EventHubClientBuilder`, and create new clients from that same builder instance. + +### Connect using an IoT connection string + +Because translating a connection string requires querying the IoT Hub service, the Event Hubs client library cannot use it directly. The [IoTConnectionString.java][IoTConnectionString] sample describes how to query IoT Hub to translate an IoT connection string into one that can be used with Event Hubs. + +Further reading: +* [Control access to IoT Hub using Shared Access Signatures][IoTHubSAS] +* [Read device-to-cloud messages from the built-in endpoint][IoTEventHubEndpoint] + +### Cannot add components to the connection string + +The legacy Event Hub clients allowed customers to add components to the connection string retrieved from the portal. The legacy clients are in packages [com.microsoft.azure:azure-eventhubs][MavenAzureEventHubs] and [com.microsoft.azure:azure-eventhubs-eph][MavenAzureEventHubsEPH]. + +#### Adding "TransportType=AmqpWebSockets" + +The previous generation of the Event Hubs client library supported extending connection strings using special tokens for certain scenarios. The current generation supports connection strings only in the form published by the Azure portal. To request using the `AmqpWebSockets` transport, it would be specified when building the client. See [PublishEventsWithSocketsAndProxy.java][PublishEventsWithWebSocketsAndProxy] for more details. + +#### Adding "Authentication=Managed Identity" + +The legacy clients allowed customers to modify the connection string to enable capabilities. In this case, connect to Event Hubs using managed identity rather than a connection string. To achieve the same scenario, see [PublishEventsWithAzureIdentity.java][PublishEventsWithAzureIdentity]. + +For more information about our identity library, check out our [Authentication and the Azure SDK][AuthenticationAndTheAzureSDK] blog post. + +## Enable and configure logging + +Azure SDK for Java offers a consistent logging story to help troubleshoot application errors and expedite their resolution. The logs produced will capture the flow of an application before reaching the terminal state to help locate the root issue. View the [logging][Logging] wiki for guidance about enabling logging. + +In addition to enabling logging, setting the log level to `VERBOSE` or `DEBUG` provides insights into the library's state. Below are sample log4j2 and logback configurations to reduce the excessive +messages when verbose logging is enabled. + +### Configuring Log4J 2 + +1. Add the dependencies in your pom.xml using ones from the [logging sample pom.xml][LoggingPom] under the "Dependencies required for Log4j2" section. +2.Add [log4j2.xml][log4j2] to your `src/main/resources`. + +### Configuring logback + +1. Add the dependencies in your pom.xml using ones from the [logging sample pom.xml][LoggingPom] under the "Dependencies required for logback" section. +2. Add [logback.xml][logback] to your `src/main/resources`. + +### Enable AMQP transport logging + +If enabling client logging is not enough to diagnose your issues. You can enable logging to a file in the underlying +AMQP library, [Qpid Proton-J][qpid_proton_j_apache]. Qpid Proton-J uses `java.util.logging`. You can enable logging by +creating a configuration file with the contents below. Or set `proton.trace.level=ALL` and whichever configuration options +you want for the `java.util.logging.Handler` implementation. The implementation classes and their options can be found in +[Java 8 SDK javadoc][java_8_sdk_javadocs]. + +To trace the AMQP transport frames, set the environment variable: `PN_TRACE_FRM=1`. + +#### Sample "logging.properties" file + +The configuration file below logs TRACE level output from proton-j to the file "proton-trace.log". + +``` +handlers=java.util.logging.FileHandler +.level=OFF +proton.trace.level=ALL +java.util.logging.FileHandler.level=ALL +java.util.logging.FileHandler.pattern=proton-trace.log +java.util.logging.FileHandler.formatter=java.util.logging.SimpleFormatter +java.util.logging.SimpleFormatter.format=[%1$tF %1$tr] %3$s %4$s: %5$s %n +``` + +### Reduce logging + +One way to decrease logging is to change the verbosity. Another is to add filters that exclude logs from logger names packages like `com.azure.messaging.eventhubs` or `com.azure.core.amqp`. Examples of this can be found in the XML files in [Configuring Log4J 2](#configuring-log4j-2) and [Configure logback](#configuring-logback). + +When submitting a bug, log messages from classes in the following packages are interesting: + +* `com.azure.core.amqp.implementation` +* `com.azure.core.amqp.implementation.handler` + * The exception is that the onDelivery message in ReceiveLinkHandler can be ignored. +* `com.azure.messaging.eventhubs.implementation` + +## Troubleshoot EventProducerAsyncClient/EventProducerClient issues + +### Cannot set multiple partition keys for events in EventDataBatch + +When publishing messages, the Event Hubs service supports a single partition key for each EventDataBatch. Customers can consider using the buffered producer client `EventHubBufferedProducerClient` if they want that capability. Otherwise, they'll have to manage their batches. + +### Setting partition key on EventData is not set in Kafka consumer + +The partition key of the EventHubs event is available in the Kafka record headers, the protocol specific key being "x-opt-partition-key" in the header. + +By design, we don't promote the Kafka message key to be the Event Hubs partition key nor the reverse because with the same value, the Kafka client and the Event Hub client likely send the message to two different partitions. It might cause some confusion if we set the value in the cross-protocol communication case. Exposing the properties with a protocol specific key to the other protocol client should be good enough. + +## Troubleshoot EventProcessorClient issues + +### 412 precondition failures when checkpointing + +412 precondition errors occur when the client tries to take or renew ownership of a partition, but the local version of the checkpoint is outdated. This occurs when another processor instance steals partition ownership. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for more information. + +### Partition ownership changes a lot + +When the number of EventProcessorClient instances changes (i.e. added or removed), the running instances try to load-balance partitions between themselves. The default balancing is greedy, so an EventProcessorClient will take as many partitions at once to reach a balanced state. As additional nodes are added, they may steal these partitions to balance themselves out. If this is not the case, a GitHub issue with logs and a repro should be filed. + +### "...current receiver 'nil' with epoch '0' is getting disconnected" + +The entire error message looks something like this: + +> New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' +> is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. +> TrackingId:, SystemTracker::eventhub:|, +> Timestamp:2022-01-01T12:00:00}"} + +This error is expected when load balancing occurs after EventProcessorClient instances are added or removed. Load balancing is an ongoing process. When using the BlobCheckpointStore with your consumer, every ~30 seconds (by default), the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer. The service side mechanism used to 'steal' partitions is [Epoch][Epoch]. + +However, if no instances are being added or removed, there is an underlying issue that should be addressed. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for additional information. + +### High CPU usage + +High CPU usage is usually because an instance owns too many partitions. We recommend no more than three partitions for every 1 CPU core; better to start with 1.5 partitions for each CPU core and test increasing the number of partitions owned. + +### Processor client stops receiving + +Customers often run the processor client for days on end. Sometimes, they notice that EventProcessorClient is not processing one or more partitions. Usually, this is not enough information to determine why the exception occurred. The EventProcessorClient stopping is the symptom of an underlying cause (i.e. race condition) that occurred while trying to recover from a transient error. For the team to determine the reason, we need to ask for the following details: + +* Event Hub environment + * How many partitions? +* EventProcessorClient environment + * What is the machine(s) specs processing your Event Hub? + * How many instances are running? + * What is the max heap set (i.e. -Xmx)? +* What is the average size of each EventData? +* What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventProcessorClient is always busy or there are slow traffic periods.) +* Repro code and steps + * This is important as we often cannot reproduce the issue in our environment. +* Logs. We need DEBUG logs, but if that is not possible, INFO at least. Error and warning level logs do not provide enough information. The period of at least +/- 10 minutes from when the issue occurred. + +### Migrate from legacy to new client library + +The [migration guide][MigrationGuide] includes steps on migrating from the legacy client and migrating legacy checkpoints. + +## Get additional help + +Additional information on ways to reach out for support can be found in the [SUPPORT.md][SUPPORT] at the repo's root. + + +[IoTConnectionString]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/IoTHubConnectionSample.java +[log4j2]: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/eventhubs/azure-messaging-eventhubs/docs/log4j2.xml +[logback]: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/eventhubs/azure-messaging-eventhubs/docs/logback.xml +[LoggingPom]: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/eventhubs/azure-messaging-eventhubs/docs/pom.xml +[MigrationGuide]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/migration-guide.md +[PublishEventsToSpecificPartition]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/PublishEventsToSpecificPartition.java +[PublishEventsWithAzureIdentity]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/PublishEventsWithAzureIdentity.java +[PublishEventsWithWebSocketsAndProxy]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/PublishEventsWithWebSocketsAndProxy.java +[SUPPORT]: https://github.com/Azure/azure-sdk-for-java/blob/main/SUPPORT.md + + +[AmqpErrorCondition]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.amqperrorcondition +[AmqpErrorContext]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.amqperrorcontext +[AmqpException]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.amqpexception +[SessionErrorContext]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.sessionerrorcontext +[LinkErrorContext]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.linkerrorcontext + +[AmqpTroubleshooting]: https://docs.microsoft.com/azure/service-bus-messaging/service-bus-amqp-troubleshoot +[AuthorizeSAS]: https://docs.microsoft.com/azure/event-hubs/authorize-access-shared-access-signature +[Epoch]: https://docs.microsoft.com/azure/event-hubs/event-hubs-event-processor-host#epoch +[EventHubsIPAddresses]: https://docs.microsoft.com/azure/event-hubs/troubleshooting-guide#what-ip-addresses-do-i-need-to-allow +[EventHubsMessagingExceptions]: https://docs.microsoft.com/azure/event-hubs/event-hubs-messaging-exceptions +[EventHubsTroubleshooting]: https://docs.microsoft.com/azure/event-hubs/troubleshooting-guide +[GetConnectionString]: https://docs.microsoft.com/azure/event-hubs/event-hubs-get-connection-string +[IoTEventHubEndpoint]: https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-messages-read-builtin +[IoTHubSAS]: https://docs.microsoft.com/azure/iot-hub/iot-hub-dev-guide-sas#security-tokens +[Logging]: https://docs.microsoft.com/azure/developer/java/sdk/logging-overview +[troubleshoot_authentication_authorization]: https://docs.microsoft.com/azure/event-hubs/troubleshoot-authentication-authorization + + +[AuthenticationAndTheAzureSDK]: https://devblogs.microsoft.com/azure-sdk/authentication-and-the-azure-sdk +[MavenAzureEventHubs]: https://search.maven.org/artifact/com.microsoft.azure/azure-eventhubs/ +[MavenAzureEventHubsEPH]: https://search.maven.org/artifact/com.microsoft.azure/azure-eventhubs-eph +[java_8_sdk_javadocs]: https://docs.oracle.com/javase/8/docs/api/java/util/logging/package-summary.html +[qpid_proton_j_apache]: https://qpid.apache.org/proton/ \ No newline at end of file From 120e388e2af06b39200eac94564353448d56a54e Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Fri, 3 Jun 2022 09:21:47 -0500 Subject: [PATCH 2/7] trouble shooting guide for python --- .../azure-eventhub/TROUBLESHOOTING.md | 185 +++++++----------- 1 file changed, 74 insertions(+), 111 deletions(-) diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md index a67081437c36..6a9d7dd38472 100644 --- a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -1,6 +1,6 @@ # Troubleshoot Event Hubs issues -This troubleshooting guide covers failure investigation techniques, common errors for the credential types in the Azure Identity Java client library, and mitigation steps to resolve these errors. +This troubleshooting guide covers failure investigation techniques, common errors for the credential types in the Azure Event Hubs Python client library, and mitigation steps to resolve these errors. - [Handle Event Hubs exceptions](#handle-event-hubs-exceptions) - [Find relevant information in exception messages](#find-relevant-information-in-exception-messages) @@ -21,46 +21,39 @@ This troubleshooting guide covers failure investigation techniques, common error - [Cannot set multiple partition keys for events in EventDataBatch](#cannot-set-multiple-partition-keys-for-events-in-eventdatabatch) - [Setting partition key on EventData is not set in Kafka consumer](#setting-partition-key-on-eventdata-is-not-set-in-kafka-consumer) - [Troubleshoot EventProcessorClient issues](#troubleshoot-eventprocessorclient-issues) - - [412 precondition failures when checkpointing](#412-precondition-failures-when-checkpointing) - - [Partition ownership changes a lot](#partition-ownership-changes-a-lot) - - ["...current receiver 'nil' with epoch '0' is getting disconnected"](#current-receiver-nil-with-epoch-0-is-getting-disconnected) + - [412 precondition failures when using an event processor](#412-precondition-failures-when-using-an-event-processor) + - [Partition ownership changes frequently](#partition-ownership-changes-frequently) + - ["...current receiver '' with epoch '0' is getting disconnected"](#current-receiver-receiver_name-with-epoch-0-is-getting-disconnected) - [High CPU usage](#high-cpu-usage) - [Processor client stops receiving](#processor-client-stops-receiving) - [Migrate from legacy to new client library](#migrate-from-legacy-to-new-client-library) - [Get additional help](#get-additional-help) + - [Filing GitHub issues](#filing-github-issues) ## Handle Event Hubs exceptions -The Event Hubs APIs generate the following exceptions in azure.eventhub.exceptions +All Event Hubs exceptions are wrapped in an [EventHubError][EventHubError]. They often have an underlying AMQP error code which specifies whether an error should be retried. For retryable errors (ie. `amqp:connection:forced` or `amqp:link:detach-forced`), the client libraries will attempt to recover from these errors based on the [retry options][AmqpRetryOptions] specified when instantiating the client. To configure retry options, follow the sample [Client Creation][ClientCreation]. If the error is non-retryable, there is some configuration issue that needs to be resolved. -- **AuthenticationError:** Failed to authenticate because of wrong address, SAS policy/key pair, SAS token or azure identity. -- **ConnectError:** Failed to connect to the EventHubs. The AuthenticationError is a type of ConnectError. -- **ConnectionLostError:** Lose connection after a connection has been built. -- **EventDataError:** The EventData to be sent fails data validation. For instance, this error is raised if you try to send an EventData that is already sent. -- **EventDataSendError:** The Eventhubs service responds with an error when an EventData is sent. -- **OperationTimeoutError:** EventHubConsumer.send() times out. -- **EventHubError:** All other Eventhubs related errors. It is also the root error class of all the errors described above. +The recommended way to solve the specific exception the AMQP exception represents is to follow the +[Event Hubs Messaging Exceptions][EventHubsMessagingExceptions] guidance. ### Find relevant information in exception messages -An [AmqpException][AmqpException] contains three fields which describe the error. - -* **getErrorCondition**: The underlying AMQP error. A description of the errors can be found in the [AmqpErrorCondition][AmqpErrorCondition] javadocs or the OASIS AMQP 1.0 spec. -* **isTransient**: Whether or not trying to perform the same operation is possible. SDK clients apply the retry policy when the error is transient. -* **getErrorContext**: Information about where the AMQP error originated. - * [LinkErrorContext][LinkErrorContext]: Errors that occur in either the send/receive link. - * [SessionErrorContext][SessionErrorContext]: Errors that occur in the session. - * [AmqpErrorContext][AmqpErrorContext]: Errors that occur in the connection or a general AMQP error. +An [EventHubError][EventHubError] contains three fields which describe the error. +* **message**: The underlying AMQP error message. A description of the errors can be found in the [Exceptions module][ExceptionModule] or the [OASIS AMQP 1.0 spec][AmqpSpec]. +* **error**: The error condition if available. +* **details**: The error details, if included in the service response + ### Commonly encountered exceptions -#### amqp\:connection\:forced and amqp\:link\:detach-forced +#### `amqp:connection:forced` and `amqp:link:detach-forced` -When the connection to Event Hubs is idle, the service will disconnect the client after some time. This is not a problem as the clients will re-establish a connection with the service. More information for users is in the [AMQP troubleshooting documentation][AmqpTroubleshooting]. +When the connection to Event Hubs is idle, the service will disconnect the client after some time. This is not a problem as the clients will re-establish a connection when a service operation is requested. More information can be found in the [AMQP troubleshooting documentation][AmqpTroubleshooting]. ## Permission issues -An `EventHubError` with an [`AmqpErrorCondition`][AmqpErrorCondition] of "amqp:unauthorized-access" means that the customer's credentials do not allow for them to perform the action (receiving or sending) with Event Hubs. +An `AuthenticationError` means that the provided credentials do not allow for them to perform the action (receiving or sending) with Event Hubs. * [Double check you have the correct connection string][GetConnectionString] * [Ensure your SAS token is generated correctly][AuthorizeSAS] @@ -82,17 +75,16 @@ An `EventHubError` with an [`AmqpErrorCondition`][AmqpErrorCondition] of "amqp:u ### SSL handshake failures -This error can occur when an intercepting proxy is used and the proxy is not configured correctly. We recommend testing in your hosting environment with the proxy disabled to verify. +This error can occur when an intercepting proxy is used. We recommend testing in your hosting environment with the proxy disabled to verify. ### Socket exhaustion errors -Applications should prefer treating the Event Hubs clients as a singleton, creating and using a single instance through the lifetime of their application. This is important as each client type manages its connection; creating a new Event Hub client results in a new AMQP connection, which uses a socket. Additionally, it is essential to be aware that clients inherit from `java.io.Closeable`, so your application is responsible for calling `close()` when it is finished using a client. +Applications should prefer treating the Event Hubs clients as a singleton, creating and using a single instance through the lifetime of their application. This is important as each client type manages its connection; creating a new Event Hub client results in a new AMQP connection, which uses a socket. Additionally, it is essential to be aware that your client is responsible for calling `close()` when it is finished using a client or to use the `with statement` for clients so that they are automatically closed after the flow execution leaves that block. -To use the same AMQP connection when creating multiple clients, you can use the `EventHubClientBuilder.shareConnection()` flag, hold a reference to that `EventHubClientBuilder`, and create new clients from that same builder instance. ### Connect using an IoT connection string -Because translating a connection string requires querying the IoT Hub service, the Event Hubs client library cannot use it directly. The [IoTConnectionString.java][IoTConnectionString] sample describes how to query IoT Hub to translate an IoT connection string into one that can be used with Event Hubs. +Because translating a connection string requires querying the IoT Hub service, the Event Hubs client library cannot use it directly. The [IoT Hub Connection String Sample][IoTConnectionString] sample describes how to query IoT Hub to translate an IoT connection string into one that can be used with Event Hubs. Further reading: * [Control access to IoT Hub using Shared Access Signatures][IoTHubSAS] @@ -100,93 +92,70 @@ Further reading: ### Cannot add components to the connection string -The legacy Event Hub clients allowed customers to add components to the connection string retrieved from the portal. The legacy clients are in packages [com.microsoft.azure:azure-eventhubs][MavenAzureEventHubs] and [com.microsoft.azure:azure-eventhubs-eph][MavenAzureEventHubsEPH]. +The legacy Event Hub clients allowed customers to add components to the connection string retrieved from the portal. The legacy clients are in packages [com.microsoft.azure:azure-eventhubs][MavenAzureEventHubs] and [com.microsoft.azure:azure-eventhubs-eph][MavenAzureEventHubsEPH]. The current generation supports connection strings only in the form published by the Azure portal. #### Adding "TransportType=AmqpWebSockets" -The previous generation of the Event Hubs client library supported extending connection strings using special tokens for certain scenarios. The current generation supports connection strings only in the form published by the Azure portal. To request using the `AmqpWebSockets` transport, it would be specified when building the client. See [PublishEventsWithSocketsAndProxy.java][PublishEventsWithWebSocketsAndProxy] for more details. +To use web sockets, pass in a kwarg `transport_type = TransportType.AmqpOverWebsocket` during client creation . #### Adding "Authentication=Managed Identity" -The legacy clients allowed customers to modify the connection string to enable capabilities. In this case, connect to Event Hubs using managed identity rather than a connection string. To achieve the same scenario, see [PublishEventsWithAzureIdentity.java][PublishEventsWithAzureIdentity]. +To authenticate with Managed Identity, see the sample [client_identity_authentication.py][PublishEventsWithAzureIdentity]. -For more information about our identity library, check out our [Authentication and the Azure SDK][AuthenticationAndTheAzureSDK] blog post. +For more information about the `Azure.Identity` library, check out our [Authentication and the Azure SDK][AuthenticationAndTheAzureSDK] blog post. ## Enable and configure logging -Azure SDK for Java offers a consistent logging story to help troubleshoot application errors and expedite their resolution. The logs produced will capture the flow of an application before reaching the terminal state to help locate the root issue. View the [logging][Logging] wiki for guidance about enabling logging. - -In addition to enabling logging, setting the log level to `VERBOSE` or `DEBUG` provides insights into the library's state. Below are sample log4j2 and logback configurations to reduce the excessive -messages when verbose logging is enabled. - -### Configuring Log4J 2 - -1. Add the dependencies in your pom.xml using ones from the [logging sample pom.xml][LoggingPom] under the "Dependencies required for Log4j2" section. -2.Add [log4j2.xml][log4j2] to your `src/main/resources`. +The Azure SDK for Python offers a consistent logging story to help troubleshoot application errors and expedite their resolution. The logs produced will capture the flow of an application before reaching the terminal state to help locate the root issue. -### Configuring logback +This library uses the standard [Logging] library for logging -1. Add the dependencies in your pom.xml using ones from the [logging sample pom.xml][LoggingPom] under the "Dependencies required for logback" section. -2. Add [logback.xml][logback] to your `src/main/resources`. +- Enable `azure.eventhub` logger to collect traces from the library. +- Enable `uamqp` logger to collect traces from the underlying uAMQP library. ### Enable AMQP transport logging -If enabling client logging is not enough to diagnose your issues. You can enable logging to a file in the underlying -AMQP library, [Qpid Proton-J][qpid_proton_j_apache]. Qpid Proton-J uses `java.util.logging`. You can enable logging by -creating a configuration file with the contents below. Or set `proton.trace.level=ALL` and whichever configuration options -you want for the `java.util.logging.Handler` implementation. The implementation classes and their options can be found in -[Java 8 SDK javadoc][java_8_sdk_javadocs]. - -To trace the AMQP transport frames, set the environment variable: `PN_TRACE_FRM=1`. - -#### Sample "logging.properties" file - -The configuration file below logs TRACE level output from proton-j to the file "proton-trace.log". - -``` -handlers=java.util.logging.FileHandler -.level=OFF -proton.trace.level=ALL -java.util.logging.FileHandler.level=ALL -java.util.logging.FileHandler.pattern=proton-trace.log -java.util.logging.FileHandler.formatter=java.util.logging.SimpleFormatter -java.util.logging.SimpleFormatter.format=[%1$tF %1$tr] %3$s %4$s: %5$s %n -``` +If enabling client logging is not enough to diagnose your issues. You can enable AMQP frame level trace by setting `logging_enable=True` when creating the client. ### Reduce logging -One way to decrease logging is to change the verbosity. Another is to add filters that exclude logs from logger names packages like `com.azure.messaging.eventhubs` or `com.azure.core.amqp`. Examples of this can be found in the XML files in [Configuring Log4J 2](#configuring-log4j-2) and [Configure logback](#configuring-logback). +There may be cases where you consider the `uamqp` logging to be too verbose. To suppress unnecessary logging, add the following snippet to the top of your code: -When submitting a bug, log messages from classes in the following packages are interesting: +```python +import logging -* `com.azure.core.amqp.implementation` -* `com.azure.core.amqp.implementation.handler` - * The exception is that the onDelivery message in ReceiveLinkHandler can be ignored. -* `com.azure.messaging.eventhubs.implementation` +# The logging levels below may need to be adjusted based on the logging that you want to suppress. +uamqp_logger = logging.getLogger('uamqp') +uamqp_logger.setLevel(logging.ERROR) + +# or even further fine-grained control, suppressing the warnings in uamqp.connection module +uamqp_connection_logger = logging.getLogger('uamqp.connection') +uamqp_connection_logger.setLevel(logging.ERROR) +``` ## Troubleshoot EventProducerAsyncClient/EventProducerClient issues ### Cannot set multiple partition keys for events in EventDataBatch -When publishing messages, the Event Hubs service supports a single partition key for each EventDataBatch. Customers can consider using the buffered producer client `EventHubBufferedProducerClient` if they want that capability. Otherwise, they'll have to manage their batches. +When publishing messages, the Event Hubs service supports a single partition key for each EventDataBatch. Customers can consider using the producer client in `buffered mode` if they want that capability. Otherwise, they'll have to manage their batches. ### Setting partition key on EventData is not set in Kafka consumer The partition key of the EventHubs event is available in the Kafka record headers, the protocol specific key being "x-opt-partition-key" in the header. -By design, we don't promote the Kafka message key to be the Event Hubs partition key nor the reverse because with the same value, the Kafka client and the Event Hub client likely send the message to two different partitions. It might cause some confusion if we set the value in the cross-protocol communication case. Exposing the properties with a protocol specific key to the other protocol client should be good enough. +By design, Event Hubs does not promote the Kafka message key to be the Event Hubs partition key nor the reverse because with the same value, the Kafka client and the Event Hub client likely send the message to two different partitions. It might cause some confusion if we set the value in the cross-protocol communication case. Exposing the properties with a protocol specific key to the other protocol client should be good enough. ## Troubleshoot EventProcessorClient issues -### 412 precondition failures when checkpointing +### 412 precondition failures when using an event processor -412 precondition errors occur when the client tries to take or renew ownership of a partition, but the local version of the checkpoint is outdated. This occurs when another processor instance steals partition ownership. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for more information. +412 precondition errors occur when the client tries to take or renew ownership of a partition, but the local version of the ownership record is outdated. This occurs when another processor instance steals partition ownership. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for more information. -### Partition ownership changes a lot +### Partition ownership changes frequently -When the number of EventProcessorClient instances changes (i.e. added or removed), the running instances try to load-balance partitions between themselves. The default balancing is greedy, so an EventProcessorClient will take as many partitions at once to reach a balanced state. As additional nodes are added, they may steal these partitions to balance themselves out. If this is not the case, a GitHub issue with logs and a repro should be filed. +When the number of EventProcessorClient instances changes (i.e. added or removed), the running instances try to load-balance partitions between themselves. For a few minutes after the number of processors changes, partitions are expected to change owners. Once balanced, partition ownership should be stable and change infrequently. If partition ownership is changing frequently when the number of processors is constant, this likely indicates a problem. It is recommended that a GitHub issue with logs and a repro be filed in this case. -### "...current receiver 'nil' with epoch '0' is getting disconnected" +### "...current receiver '' with epoch '0' is getting disconnected" The entire error message looks something like this: @@ -195,9 +164,9 @@ The entire error message looks something like this: > TrackingId:, SystemTracker::eventhub:|, > Timestamp:2022-01-01T12:00:00}"} -This error is expected when load balancing occurs after EventProcessorClient instances are added or removed. Load balancing is an ongoing process. When using the BlobCheckpointStore with your consumer, every ~30 seconds (by default), the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer. The service side mechanism used to 'steal' partitions is [Epoch][Epoch]. +This error is expected when load balancing occurs after EventProcessorClient instances are added or removed. Load balancing is an ongoing process. When using the BlobCheckpointStore with your consumer, every ~30 seconds (by default), the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer. The service mechanism used to assert exclusive ownership over a partition is known as the [Epoch][Epoch]. -However, if no instances are being added or removed, there is an underlying issue that should be addressed. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for additional information. +However, if no instances are being added or removed, there is an underlying issue that should be addressed. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for additional information and [Filing GitHub issues](#filing-github-issues). ### High CPU usage @@ -205,46 +174,43 @@ High CPU usage is usually because an instance owns too many partitions. We reco ### Processor client stops receiving -Customers often run the processor client for days on end. Sometimes, they notice that EventProcessorClient is not processing one or more partitions. Usually, this is not enough information to determine why the exception occurred. The EventProcessorClient stopping is the symptom of an underlying cause (i.e. race condition) that occurred while trying to recover from a transient error. For the team to determine the reason, we need to ask for the following details: +The processor client often is continually running in a host application for days on end. Sometimes, they notice that EventProcessorClient is not processing one or more partitions. Usually, this is not enough information to determine why the exception occurred. The EventProcessorClient stopping is the symptom of an underlying cause (i.e. race condition) that occurred while trying to recover from a transient error. Please see [Filing Github issues](#filing-github-issues) for the information we require. + +### Migrate from legacy to new client library + +The [migration guide][MigrationGuide] includes steps on migrating from the legacy client and migrating legacy checkpoints. + +## Get additional help + +Additional information on ways to reach out for support can be found in the [SUPPORT.md][SUPPORT] at the repo's root. + +### Filing GitHub issues + +When filing GitHub issues, the following details are requested: * Event Hub environment * How many partitions? * EventProcessorClient environment * What is the machine(s) specs processing your Event Hub? * How many instances are running? - * What is the max heap set (i.e. -Xmx)? + * What is the max heap set? * What is the average size of each EventData? -* What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventProcessorClient is always busy or there are slow traffic periods.) +* What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventProcessorClient is always busy or has slow traffic periods.) * Repro code and steps * This is important as we often cannot reproduce the issue in our environment. * Logs. We need DEBUG logs, but if that is not possible, INFO at least. Error and warning level logs do not provide enough information. The period of at least +/- 10 minutes from when the issue occurred. -### Migrate from legacy to new client library - -The [migration guide][MigrationGuide] includes steps on migrating from the legacy client and migrating legacy checkpoints. - -## Get additional help - -Additional information on ways to reach out for support can be found in the [SUPPORT.md][SUPPORT] at the repo's root. - -[IoTConnectionString]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/IoTHubConnectionSample.java -[log4j2]: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/eventhubs/azure-messaging-eventhubs/docs/log4j2.xml -[logback]: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/eventhubs/azure-messaging-eventhubs/docs/logback.xml -[LoggingPom]: https://github.com/Azure/azure-sdk-for-java/tree/main/sdk/eventhubs/azure-messaging-eventhubs/docs/pom.xml -[MigrationGuide]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/migration-guide.md -[PublishEventsToSpecificPartition]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/PublishEventsToSpecificPartition.java -[PublishEventsWithAzureIdentity]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/PublishEventsWithAzureIdentity.java -[PublishEventsWithWebSocketsAndProxy]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/PublishEventsWithWebSocketsAndProxy.java -[SUPPORT]: https://github.com/Azure/azure-sdk-for-java/blob/main/SUPPORT.md +[IoTConnectionString]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/async_samples/iot_hub_connection_string_receive_async.py +[MigrationGuide]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/migration_guide.md +[ClientCreation]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/sync_samples/client_creation.py +[PublishEventsWithAzureIdentity]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/sync_samples/client_identity_authentication.py +[PublishEventsWithWebSocketsAndProxy]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/ +[SUPPORT]: https://github.com/Azure/azure-sdk-for-python/blob/main/SUPPORT.md -[AmqpErrorCondition]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.amqperrorcondition -[AmqpErrorContext]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.amqperrorcontext -[AmqpException]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.amqpexception -[SessionErrorContext]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.sessionerrorcontext -[LinkErrorContext]: https://docs.microsoft.com/java/api/com.azure.core.amqp.exception.linkerrorcontext - +[ExceptionModule]: https://docs.microsoft.com/en-us/python/api/azure-eventhub/azure.eventhub.exceptions +[EventHubError]: https://docs.microsoft.com/en-us/python/api/azure-eventhub/azure.eventhub.exceptions.eventhuberror [AmqpTroubleshooting]: https://docs.microsoft.com/azure/service-bus-messaging/service-bus-amqp-troubleshoot [AuthorizeSAS]: https://docs.microsoft.com/azure/event-hubs/authorize-access-shared-access-signature [Epoch]: https://docs.microsoft.com/azure/event-hubs/event-hubs-event-processor-host#epoch @@ -252,14 +218,11 @@ Additional information on ways to reach out for support can be found in the [SUP [EventHubsMessagingExceptions]: https://docs.microsoft.com/azure/event-hubs/event-hubs-messaging-exceptions [EventHubsTroubleshooting]: https://docs.microsoft.com/azure/event-hubs/troubleshooting-guide [GetConnectionString]: https://docs.microsoft.com/azure/event-hubs/event-hubs-get-connection-string -[IoTEventHubEndpoint]: https://docs.microsoft.com/en-us/azure/iot-hub/iot-hub-devguide-messages-read-builtin +[IoTEventHubEndpoint]: https://docs.microsoft.com/azure/iot-hub/iot-hub-devguide-messages-read-builtin [IoTHubSAS]: https://docs.microsoft.com/azure/iot-hub/iot-hub-dev-guide-sas#security-tokens -[Logging]: https://docs.microsoft.com/azure/developer/java/sdk/logging-overview [troubleshoot_authentication_authorization]: https://docs.microsoft.com/azure/event-hubs/troubleshoot-authentication-authorization [AuthenticationAndTheAzureSDK]: https://devblogs.microsoft.com/azure-sdk/authentication-and-the-azure-sdk -[MavenAzureEventHubs]: https://search.maven.org/artifact/com.microsoft.azure/azure-eventhubs/ -[MavenAzureEventHubsEPH]: https://search.maven.org/artifact/com.microsoft.azure/azure-eventhubs-eph -[java_8_sdk_javadocs]: https://docs.oracle.com/javase/8/docs/api/java/util/logging/package-summary.html -[qpid_proton_j_apache]: https://qpid.apache.org/proton/ \ No newline at end of file +[AmqpSpec]: https://docs.oasis-open.org/amqp/core/v1.0/os/amqp-core-types-v1.0-os.html +[Logging]: https://docs.python.org/3/library/logging.html From b1544590cf7d23ade9a3ff9e47db1f6404104fcf Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Mon, 6 Jun 2022 08:39:54 -0500 Subject: [PATCH 3/7] review comments --- sdk/eventhub/azure-eventhub/README.md | 2 +- .../azure-eventhub/TROUBLESHOOTING.md | 21 +++++++------------ 2 files changed, 8 insertions(+), 15 deletions(-) diff --git a/sdk/eventhub/azure-eventhub/README.md b/sdk/eventhub/azure-eventhub/README.md index 0def1c06c922..539bfca5279f 100644 --- a/sdk/eventhub/azure-eventhub/README.md +++ b/sdk/eventhub/azure-eventhub/README.md @@ -395,7 +395,7 @@ Refer to [IoT Hub Connection String Sample](https://github.com/Azure/azure-sdk-f ## Troubleshooting -See the `azure-eventhubs` [troubleshooting guide](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/keyvault/TROUBLESHOOTING.md) for details on how to diagnose various failure scenarios +See the `azure-eventhubs` [troubleshooting guide](https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/keyvault/TROUBLESHOOTING.md) for details on how to diagnose various failure scenarios. ## Next steps diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md index 6a9d7dd38472..755a508adf0f 100644 --- a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -39,11 +39,11 @@ The recommended way to solve the specific exception the AMQP exception represent ### Find relevant information in exception messages -An [EventHubError][EventHubError] contains three fields which describe the error. +An [EventHubError][EventHubError] contains three fields which describe the error: * **message**: The underlying AMQP error message. A description of the errors can be found in the [Exceptions module][ExceptionModule] or the [OASIS AMQP 1.0 spec][AmqpSpec]. * **error**: The error condition if available. -* **details**: The error details, if included in the service response +* **details**: The error details, if included in the service response. ### Commonly encountered exceptions @@ -71,7 +71,7 @@ An `AuthenticationError` means that the provided credentials do not allow for th * See if your network is blocking specific IP addresses. * [What IP addresses do I need to allow?][EventHubsIPAddresses] * If applicable, check the proxy configuration. See [configure proxy][PublishEventsWithWebSocketsAndProxy] sample. -* For more information about troubleshooting network connectivity is at [Event Hubs troubleshooting][EventHubsTroubleshooting] +* For more information about troubleshooting network connectivity, refer to [Event Hubs troubleshooting][EventHubsTroubleshooting] ### SSL handshake failures @@ -81,7 +81,6 @@ This error can occur when an intercepting proxy is used. We recommend testing i Applications should prefer treating the Event Hubs clients as a singleton, creating and using a single instance through the lifetime of their application. This is important as each client type manages its connection; creating a new Event Hub client results in a new AMQP connection, which uses a socket. Additionally, it is essential to be aware that your client is responsible for calling `close()` when it is finished using a client or to use the `with statement` for clients so that they are automatically closed after the flow execution leaves that block. - ### Connect using an IoT connection string Because translating a connection string requires querying the IoT Hub service, the Event Hubs client library cannot use it directly. The [IoT Hub Connection String Sample][IoTConnectionString] sample describes how to query IoT Hub to translate an IoT connection string into one that can be used with Event Hubs. @@ -145,7 +144,7 @@ The partition key of the EventHubs event is available in the Kafka record header By design, Event Hubs does not promote the Kafka message key to be the Event Hubs partition key nor the reverse because with the same value, the Kafka client and the Event Hub client likely send the message to two different partitions. It might cause some confusion if we set the value in the cross-protocol communication case. Exposing the properties with a protocol specific key to the other protocol client should be good enough. -## Troubleshoot EventProcessorClient issues +## Troubleshoot EventHubConsumerClient issues ### 412 precondition failures when using an event processor @@ -174,7 +173,7 @@ High CPU usage is usually because an instance owns too many partitions. We reco ### Processor client stops receiving -The processor client often is continually running in a host application for days on end. Sometimes, they notice that EventProcessorClient is not processing one or more partitions. Usually, this is not enough information to determine why the exception occurred. The EventProcessorClient stopping is the symptom of an underlying cause (i.e. race condition) that occurred while trying to recover from a transient error. Please see [Filing Github issues](#filing-github-issues) for the information we require. +The processor client often is continually running in a host application for days on end. Sometimes, they notice that EventHubConsumerClient is not processing one or more partitions. Usually, this is not enough information to determine why the exception occurred. The EventHubConsumerClient stopping is the symptom of an underlying cause (i.e. race condition) that occurred while trying to recover from a transient error. Please see [Filing Github issues](#filing-github-issues) for the information we require. ### Migrate from legacy to new client library @@ -193,7 +192,6 @@ When filing GitHub issues, the following details are requested: * EventProcessorClient environment * What is the machine(s) specs processing your Event Hub? * How many instances are running? - * What is the max heap set? * What is the average size of each EventData? * What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventProcessorClient is always busy or has slow traffic periods.) * Repro code and steps @@ -202,15 +200,10 @@ When filing GitHub issues, the following details are requested: [IoTConnectionString]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/async_samples/iot_hub_connection_string_receive_async.py -[MigrationGuide]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/migration_guide.md -[ClientCreation]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/sync_samples/client_creation.py -[PublishEventsWithAzureIdentity]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/sync_samples/client_identity_authentication.py -[PublishEventsWithWebSocketsAndProxy]: https://github.com/Azure/azure-sdk-for-java/blob/main/sdk/eventhubs/azure-messaging-eventhubs/src/samples/java/com/azure/messaging/eventhubs/ -[SUPPORT]: https://github.com/Azure/azure-sdk-for-python/blob/main/SUPPORT.md -[ExceptionModule]: https://docs.microsoft.com/en-us/python/api/azure-eventhub/azure.eventhub.exceptions -[EventHubError]: https://docs.microsoft.com/en-us/python/api/azure-eventhub/azure.eventhub.exceptions.eventhuberror +[ExceptionModule]: https://docs.microsoft.com/python/api/azure-eventhub/azure.eventhub.exceptions +[EventHubError]: https://docs.microsoft.com/python/api/azure-eventhub/azure.eventhub.exceptions.eventhuberror [AmqpTroubleshooting]: https://docs.microsoft.com/azure/service-bus-messaging/service-bus-amqp-troubleshoot [AuthorizeSAS]: https://docs.microsoft.com/azure/event-hubs/authorize-access-shared-access-signature [Epoch]: https://docs.microsoft.com/azure/event-hubs/event-hubs-event-processor-host#epoch From afe41a5d0e33fc98274e64cc038f2095b559a8cc Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Mon, 6 Jun 2022 08:53:04 -0500 Subject: [PATCH 4/7] more fixes --- sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md index 755a508adf0f..a50ac211f598 100644 --- a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -20,7 +20,7 @@ This troubleshooting guide covers failure investigation techniques, common error - [Troubleshoot EventProducerAsyncClient/EventProducerClient issues](#troubleshoot-eventproducerasyncclienteventproducerclient-issues) - [Cannot set multiple partition keys for events in EventDataBatch](#cannot-set-multiple-partition-keys-for-events-in-eventdatabatch) - [Setting partition key on EventData is not set in Kafka consumer](#setting-partition-key-on-eventdata-is-not-set-in-kafka-consumer) -- [Troubleshoot EventProcessorClient issues](#troubleshoot-eventprocessorclient-issues) +- [Troubleshoot EventHubConsumerClient issues](#troubleshoot-eventprocessorclient-issues) - [412 precondition failures when using an event processor](#412-precondition-failures-when-using-an-event-processor) - [Partition ownership changes frequently](#partition-ownership-changes-frequently) - ["...current receiver '' with epoch '0' is getting disconnected"](#current-receiver-receiver_name-with-epoch-0-is-getting-disconnected) @@ -95,7 +95,7 @@ The legacy Event Hub clients allowed customers to add components to the connecti #### Adding "TransportType=AmqpWebSockets" -To use web sockets, pass in a kwarg `transport_type = TransportType.AmqpOverWebsocket` during client creation . +To use web sockets, pass in a kwarg `transport_type = TransportType.AmqpOverWebsocket` during client creation. #### Adding "Authentication=Managed Identity" @@ -132,7 +132,7 @@ uamqp_connection_logger = logging.getLogger('uamqp.connection') uamqp_connection_logger.setLevel(logging.ERROR) ``` -## Troubleshoot EventProducerAsyncClient/EventProducerClient issues +## Troubleshoot EventHubProducerClient (Sync/Async) issues ### Cannot set multiple partition keys for events in EventDataBatch @@ -152,7 +152,7 @@ By design, Event Hubs does not promote the Kafka message key to be the Event Hub ### Partition ownership changes frequently -When the number of EventProcessorClient instances changes (i.e. added or removed), the running instances try to load-balance partitions between themselves. For a few minutes after the number of processors changes, partitions are expected to change owners. Once balanced, partition ownership should be stable and change infrequently. If partition ownership is changing frequently when the number of processors is constant, this likely indicates a problem. It is recommended that a GitHub issue with logs and a repro be filed in this case. +When the number of EventHubConsumerClient instances changes (i.e. added or removed), the running instances try to load-balance partitions between themselves. For a few minutes after the number of processors changes, partitions are expected to change owners. Once balanced, partition ownership should be stable and change infrequently. If partition ownership is changing frequently when the number of processors is constant, this likely indicates a problem. It is recommended that a GitHub issue with logs and a repro be filed in this case. ### "...current receiver '' with epoch '0' is getting disconnected" @@ -163,7 +163,7 @@ The entire error message looks something like this: > TrackingId:, SystemTracker::eventhub:|, > Timestamp:2022-01-01T12:00:00}"} -This error is expected when load balancing occurs after EventProcessorClient instances are added or removed. Load balancing is an ongoing process. When using the BlobCheckpointStore with your consumer, every ~30 seconds (by default), the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer. The service mechanism used to assert exclusive ownership over a partition is known as the [Epoch][Epoch]. +This error is expected when load balancing occurs after EventHubConsumerClient instances are added or removed. Load balancing is an ongoing process. When using the BlobCheckpointStore with your consumer, every ~30 seconds (by default), the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer. The service mechanism used to assert exclusive ownership over a partition is known as the [Epoch][Epoch]. However, if no instances are being added or removed, there is an underlying issue that should be addressed. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for additional information and [Filing GitHub issues](#filing-github-issues). @@ -189,11 +189,11 @@ When filing GitHub issues, the following details are requested: * Event Hub environment * How many partitions? -* EventProcessorClient environment +* EventHubConsumerClient environment * What is the machine(s) specs processing your Event Hub? * How many instances are running? * What is the average size of each EventData? -* What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventProcessorClient is always busy or has slow traffic periods.) +* What is the traffic pattern like in your Event Hub? (i.e. # messages/minute and if the EventHubConsumerClient is always busy or has slow traffic periods.) * Repro code and steps * This is important as we often cannot reproduce the issue in our environment. * Logs. We need DEBUG logs, but if that is not possible, INFO at least. Error and warning level logs do not provide enough information. The period of at least +/- 10 minutes from when the issue occurred. From e90534c8e06ba0727047a2be6eee0b32da893675 Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Mon, 6 Jun 2022 09:55:24 -0500 Subject: [PATCH 5/7] changes for error handling etc --- sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md index a50ac211f598..a2847e4cf21a 100644 --- a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -47,9 +47,9 @@ An [EventHubError][EventHubError] contains three fields which describe the error ### Commonly encountered exceptions -#### `amqp:connection:forced` and `amqp:link:detach-forced` +#### ConnectionLostError Exception -When the connection to Event Hubs is idle, the service will disconnect the client after some time. This is not a problem as the clients will re-establish a connection when a service operation is requested. More information can be found in the [AMQP troubleshooting documentation][AmqpTroubleshooting]. +When the connection to Event Hubs is idle, the service will disconnect the client after some time and raise a `ConnectionLostError` exception. The underlying issues that cause this are `amqp:connection:forced` and `amqp:link:detach-forced`. This is not a problem as the clients will re-establish a connection when a service operation is requested. More information can be found in the [AMQP troubleshooting documentation][AmqpTroubleshooting]. ## Permission issues @@ -57,6 +57,7 @@ An `AuthenticationError` means that the provided credentials do not allow for th * [Double check you have the correct connection string][GetConnectionString] * [Ensure your SAS token is generated correctly][AuthorizeSAS] +* [Verify the correct RBAC roles were granted][RBACRoles] [Troubleshoot authentication and authorization issues with Event Hubs][troubleshoot_authentication_authorization] lists other possible solutions. @@ -148,7 +149,7 @@ By design, Event Hubs does not promote the Kafka message key to be the Event Hub ### 412 precondition failures when using an event processor -412 precondition errors occur when the client tries to take or renew ownership of a partition, but the local version of the ownership record is outdated. This occurs when another processor instance steals partition ownership. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for more information. +Logs reflect intermittent HTTP 412 and HTTP 409 responses from storage when the client tries to take or renew ownership of a partition, but the local version of the ownership record is outdated. This occurs when another processor instance steals partition ownership. See [Partition ownership changes a lot](#partition-ownership-changes-a-lot) for more information. ### Partition ownership changes frequently @@ -206,6 +207,7 @@ When filing GitHub issues, the following details are requested: [EventHubError]: https://docs.microsoft.com/python/api/azure-eventhub/azure.eventhub.exceptions.eventhuberror [AmqpTroubleshooting]: https://docs.microsoft.com/azure/service-bus-messaging/service-bus-amqp-troubleshoot [AuthorizeSAS]: https://docs.microsoft.com/azure/event-hubs/authorize-access-shared-access-signature +[RBACRoles]: https://docs.microsoft.com/azure/event-hubs/troubleshoot-authentication-authorization [Epoch]: https://docs.microsoft.com/azure/event-hubs/event-hubs-event-processor-host#epoch [EventHubsIPAddresses]: https://docs.microsoft.com/azure/event-hubs/troubleshooting-guide#what-ip-addresses-do-i-need-to-allow [EventHubsMessagingExceptions]: https://docs.microsoft.com/azure/event-hubs/event-hubs-messaging-exceptions From 9602f66ca702c4d93ca414c5542993fae2b2ca96 Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Wed, 8 Jun 2022 09:32:59 -0500 Subject: [PATCH 6/7] update for retry policy --- sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md index a2847e4cf21a..340cd96b07a5 100644 --- a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -44,6 +44,13 @@ An [EventHubError][EventHubError] contains three fields which describe the error * **message**: The underlying AMQP error message. A description of the errors can be found in the [Exceptions module][ExceptionModule] or the [OASIS AMQP 1.0 spec][AmqpSpec]. * **error**: The error condition if available. * **details**: The error details, if included in the service response. + +By default the producer and consumer clients will retry for error conditions. We recommend that users of the clients use the following keyword arguments during creation of the client to change the retry behavior rather than retrying on their own: +* **retry_total**: The total number of attempts to redo a failed operation when an error occurs. Default + value is 3 +* **retry_backoff_factor**: A backoff factor to apply between attempts after the second try +* **retry_backoff_max**: The maximum back off time. Default value is 120 seconds +* **retry_mode: The delay behavior between retry attempts. Supported values are 'fixed' or 'exponential', where default is 'exponential' ### Commonly encountered exceptions @@ -59,7 +66,7 @@ An `AuthenticationError` means that the provided credentials do not allow for th * [Ensure your SAS token is generated correctly][AuthorizeSAS] * [Verify the correct RBAC roles were granted][RBACRoles] -[Troubleshoot authentication and authorization issues with Event Hubs][troubleshoot_authentication_authorization] lists other possible solutions. +[Troubleshoot authentication and authorization issues with Event Hubs][TroubleshootAuthenticationAuthorization] lists other possible solutions. ## Connectivity issues @@ -215,7 +222,7 @@ When filing GitHub issues, the following details are requested: [GetConnectionString]: https://docs.microsoft.com/azure/event-hubs/event-hubs-get-connection-string [IoTEventHubEndpoint]: https://docs.microsoft.com/azure/iot-hub/iot-hub-devguide-messages-read-builtin [IoTHubSAS]: https://docs.microsoft.com/azure/iot-hub/iot-hub-dev-guide-sas#security-tokens -[troubleshoot_authentication_authorization]: https://docs.microsoft.com/azure/event-hubs/troubleshoot-authentication-authorization +[TroubleshootAuthenticationAuthorization]: https://docs.microsoft.com/azure/event-hubs/troubleshoot-authentication-authorization [AuthenticationAndTheAzureSDK]: https://devblogs.microsoft.com/azure-sdk/authentication-and-the-azure-sdk From 6d8c7d1b24f5533901d5f1888cd65df7101a3eb0 Mon Sep 17 00:00:00 2001 From: Kashif Khan Date: Wed, 8 Jun 2022 09:44:43 -0500 Subject: [PATCH 7/7] migration guide changes --- sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md index 340cd96b07a5..20d913ff097e 100644 --- a/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md +++ b/sdk/eventhub/azure-eventhub/TROUBLESHOOTING.md @@ -97,10 +97,6 @@ Further reading: * [Control access to IoT Hub using Shared Access Signatures][IoTHubSAS] * [Read device-to-cloud messages from the built-in endpoint][IoTEventHubEndpoint] -### Cannot add components to the connection string - -The legacy Event Hub clients allowed customers to add components to the connection string retrieved from the portal. The legacy clients are in packages [com.microsoft.azure:azure-eventhubs][MavenAzureEventHubs] and [com.microsoft.azure:azure-eventhubs-eph][MavenAzureEventHubsEPH]. The current generation supports connection strings only in the form published by the Azure portal. - #### Adding "TransportType=AmqpWebSockets" To use web sockets, pass in a kwarg `transport_type = TransportType.AmqpOverWebsocket` during client creation. @@ -208,6 +204,7 @@ When filing GitHub issues, the following details are requested: [IoTConnectionString]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/samples/async_samples/iot_hub_connection_string_receive_async.py +[MigrationGuide]: https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/eventhub/azure-eventhub/migration_guide.md [ExceptionModule]: https://docs.microsoft.com/python/api/azure-eventhub/azure.eventhub.exceptions