Skip to content

Add Prometheus scrape/pull source to prometheus plugin [Adding in the same Prometheus Remote write source] #6743

Open
srikanthpadakanti wants to merge 2 commits intoopensearch-project:mainfrom
srikanthpadakanti:prometheus-scrape-source
Open

Add Prometheus scrape/pull source to prometheus plugin [Adding in the same Prometheus Remote write source] #6743
srikanthpadakanti wants to merge 2 commits intoopensearch-project:mainfrom
srikanthpadakanti:prometheus-scrape-source

Conversation

@srikanthpadakanti
Copy link
Copy Markdown
Contributor

Description

Add Prometheus scrape/pull source to the existing prometheus plugin. The source periodically HTTP GETs target /metrics endpoints, parses the Prometheus text exposition format, and converts metrics into Data Prepper events (Gauge, Sum, Histogram, Summary). Supports static target URLs, configurable scrape interval/timeout, HTTP Basic Auth, Bearer Token, and TLS certificates. Both scrape and remote write modes can run simultaneously under the same plugin.

Three usage modes:

 Mode 1: Remote write only (existing behavior, unchanged)
  source:
    prometheus:
      port: 9090

Mode 2: Scrape only
  source:
    prometheus:
      scrape:
        targets:
          - url: "http://host:9090/metrics"

Mode 3: Both simultaneously
  source:
    prometheus:
      port: 9090
      scrape:
        targets:
          - url: "http://host:9090/metrics"

Issues Resolved

#1997
Resolves #1997

Check List

  • [ X ] New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • [ X ] New functionality has javadoc added
  • [ X ] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @srikanthpadakanti .

At a high-level would a scraper be expected to get data from the same Prometheus nodes pushing? I'm wondering if it makes sense to have a separate source or not.

@JsonProperty("http_basic")
private HttpBasicConfig httpBasic;

@JsonProperty("bearer_token")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be an oauth2 group with only bearer_token for consistency.

scrape:
  oauth2:
    bearer_token:

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bearer token here is a static token passed as an HTTP Authorization header, not an OAuth2 flow. Nesting it under oauth2 would imply token endpoint and refresh support that we do not have. Kept it flat under authentication to match the actual behavior.

Happy to add an oauth2 group later when we support the full flow with token exchange.

public class ScrapeTargetScraper {

private static final Logger LOG = LoggerFactory.getLogger(ScrapeTargetScraper.class);
private static final String ACCEPT_HEADER_VALUE = "text/plain;version=0.0.4";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the version direct Prometheus to provide the scrape data in a certain format? Will it return Not Acceptable if it doesn't support the requested version?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The version=0.0.4 in the Accept header identifies the Prometheus text exposition format per the spec. If the version parameter is missing, the server falls back to the most recent text format version which is still 0.0.4

Prometheus servers do not return 406 for this, they serve the default format. Content negotiation between text, OpenMetrics, and protobuf is handled via distinct content types (text/plain, application/openmetrics-text, application/vnd.google.protobuf). We can add OpenMetrics support as a follow-up.

*/
public String scrape(final String url) {
final URI uri = URI.create(url);
final String baseUri = uri.getScheme() + "://" + uri.getAuthority();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use Spring's UriComponentsBuilder instead of string manipulation. If Armeria has something like this it would be ideal to use that instead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on avoiding string manipulation. Will switch to java.net.URI constructor to extract the base URI cleanly rather than adding a Spring dependency. Armeria's WebClient already handles URI parsing internally so we just need the base extraction to be safe.

final HttpHeadersBuilder builder = HttpHeaders.builder()
.add(HttpHeaderNames.ACCEPT, ACCEPT_HEADER_VALUE);

if (config.getAuthentication() != null) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be ideal to have small classes and a common interface for updating the request based on the authentication. This could allow more complicated auth schemes like SigV4 in the future.

I had some similar comments on an http sink PR that is in progress currently. #6747 (comment). Though my comments went the other way. Have the code more amenable to basic auth instead of SigV4. 😄

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extracted into a ScrapeRequestAuthenticator interface with BasicAuthenticator and BearerTokenAuthenticator as separate classes. Factory method handles selection. Adding SigV4 or other schemes in the future is just a new implementation without touching the scraper.

import java.util.Map;
import java.util.TreeMap;

public class TextExpositionParser {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the relation between this and RemoteWriteProtobufParser. Is there opportunity for shared code for parsing?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different input formats with no parsing overlap. Text parser handles line-oriented exposition, protobuf parser handles Snappy-compressed binary.

Both already share the same output builders (JacksonGauge, JacksonSum, etc.) from data-prepper-api. No practical code to extract.

@JsonProperty("flatten_labels")
private boolean flattenLabels = false;

@JsonProperty("insecure")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the URLs in targets has a non-HTTPS scheme we should throw a validation error unless insecure is true.

You can add a JSR @AssertTrue to handle this in this class.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Added @AssertTrue validation.

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
@srikanthpadakanti srikanthpadakanti force-pushed the prometheus-scrape-source branch from d118aa1 to 713b2d2 Compare April 10, 2026 17:29
@srikanthpadakanti
Copy link
Copy Markdown
Contributor Author

Thanks @srikanthpadakanti .

At a high-level would a scraper be expected to get data from the same Prometheus nodes pushing? I'm wondering if it makes sense to have a separate source or not.

A scraper would not typically target the same nodes pushing via remote write. They serve different use cases: remote write is Prometheus servers configured to push, while scrape pulls from any endpoint exposing the Prometheus text exposition format, including application instances that are not Prometheus servers themselves.

The reason they share a plugin is the precedent from kkondaka's feedback on PR #6627 where the recommendation was to keep it as a single prometheus source with config options for different modes, similar to how s3 source handles s3-sqs and s3-scan under one plugin. Both modes can also run simultaneously which simplifies pipeline configuration when users need both ingestion paths.

That said, if you prefer a separate source I am open to splitting it out.

Signed-off-by: Srikanth Padakanti <srikanth_padakanti@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ingest data from Prometheus as Source

2 participants