You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai/advanced-retrieval-augmented-generation.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Build Advanced Retrieval-Augmented Generation Systems
3
3
description: As a developer, learn about real-world considerations and patterns for retrieval-augmented generation (RAG)-based chat systems.
4
-
ms.date: 07/31/2025
4
+
ms.date: 01/30/2026
5
5
ms.topic: how-to
6
6
ms.custom: build-2024-intelligent-apps
7
7
ms.subservice: intelligent-apps
@@ -296,7 +296,7 @@ For more information, see these articles:
296
296
297
297
### Testing and verifying the safeguards
298
298
299
-
_Red-teaming_ is key—it means to act like an attacker to find weak spots in the system. This step is especially important to stop jailbreaking. For tips on planning and managing red teaming for responsible AI, see [Planning red teaming for large language models (LLMs) and their applications](/azure/ai-foundry/openai/concepts/red-teaming).
299
+
_Red-teaming_ is key—it means to act like an attacker to find weak spots in the system. This step is especially important to stop jailbreaking. For tips on planning and managing red teaming for responsible AI, see [Planning red teaming for large language models (LLMs) and their applications](/azure/ai-foundry/openai/concepts/red-teaming?view=foundry&preserve-view=true).
300
300
301
301
Developers should test RAG system safeguards in different scenarios to make sure they work. This step makes the system stronger and also helps fine-tune responses to follow ethical standards and rules.
Copy file name to clipboardExpand all lines: articles/ai/augment-llm-rag-fine-tuning.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Augment LLMs with RAGs or Fine-Tuning
3
3
description: Get a conceptual introduction to creating retrieval-augmented generation (RAG)-based chat systems, with an emphasis on integration, optimization, and ethical considerations for delivering contextually relevant responses.
4
-
ms.date: 07/31/2025
4
+
ms.date: 01/30/2026
5
5
ms.topic: concept-article
6
6
ms.custom: build-2024-intelligent-apps
7
7
ms.collection: ce-skilling-ai-copilot
@@ -23,7 +23,7 @@ The next sections break down both methods.
23
23
24
24
RAG enables the key "chat over my data" scenario. In this scenario, an organization has a potentially large corpus of textual content, like documents, documentation, and other proprietary data. It uses this corpus as the basis for answers to user prompts.
25
25
26
-
RAG lets you build chatbots that answer questions using your own documents. Here's how it works:
26
+
RAG lets you build chatbots or agents that answer questions using your own documents. Here's how it works:
27
27
28
28
1. Store your documents (or parts of them, called *chunks*) in a database
29
29
2. Create an *embedding* for each chunk; a list of numbers that describe it
@@ -51,7 +51,7 @@ One way to create an embedding is to send your content to the Azure OpenAI Embed
51
51
52
52
All these numbers together show where the content sits in a multi-dimensional space. Imagine a 3D graph, but with hundreds or thousands of dimensions. Computers can work with this kind of space, even if we can’t draw it.
53
53
54
-
The [Tutorial: Explore Azure OpenAI in Azure AI Foundry Models embeddings and document search](/azure/ai-foundry/openai/tutorials/embeddings?tabs=python-new%2Ccommand-line&pivots=programming-language-python) provides a guide on how to use the Azure OpenAI Embeddings API to create embeddings for your documents.
54
+
The [Tutorial: Explore Azure OpenAI in Microsoft Foundry Models embeddings and document search](/azure/ai-foundry/openai/tutorials/embeddings?view=foundry&tabs=command-line&pivots=programming-language-python&preserve-view=true) provides a guide on how to use the Azure OpenAI Embeddings API to create embeddings for your documents.
55
55
56
56
#### Storing the vector and content
57
57
@@ -118,7 +118,7 @@ Fine-tuning also has some challenges:
118
118
119
119
120
120
121
-
[Customize a model through fine-tuning](/azure/ai-services/openai/how-to/fine-tuning?tabs=turbo%2Cpython-new&pivots=programming-language-studio) explains how to fine-tune a model.
121
+
[Customize a model through fine-tuning](/azure/ai-foundry/openai/how-to/fine-tuning?view=foundry&tabs=oai-sdk%2Cazure-openai&pivots=programming-language-studio&preserve-view=true) explains how to fine-tune a model.
Copy file name to clipboardExpand all lines: articles/ai/gen-ai-concepts-considerations-developers.md
+7-7Lines changed: 7 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,7 +1,7 @@
1
1
---
2
2
title: Key concepts and considerations in generative AI
3
3
description: Learn about the limitations of large language models (LLMs) and how to get the best results by modifying prompts, building an inference pipeline, and adjusting API call parameters.
4
-
ms.date: 07/31/2025
4
+
ms.date: 01/30/2026
5
5
content_well_notification:
6
6
- AI-contribution
7
7
ai-usage: ai-assisted
@@ -64,9 +64,9 @@ For developers, the following libraries help estimate token counts for prompts a
64
64
65
65
### Token usage affects billing
66
66
67
-
Each Azure OpenAI API has a different billing methodology. For processing and generating text with the Chat Completions API, you're billed based on the number of tokens you submit as a prompt and the number of tokens that are generated as a result (completion).
67
+
Each Azure OpenAI API has a different billing methodology. For processing and generating text with the Responses or Chat Completions API, you're billed based on the number of tokens you submit as a prompt and the number of tokens that are generated as a result (completion).
68
68
69
-
Each LLM model (for example, GPT-4.1, GPT-4o, or GPT-4o mini) usually has a different price, which reflects the amount of computation required to process and generate tokens. Many times, price is presented as "price per 1,000 tokens" or "price per 1 million tokens."
69
+
Each LLM model (for example, GPT-5.2, or GPT-5.2-mini) usually has a different price, which reflects the amount of computation required to process and generate tokens. Many times, price is presented as "price per 1,000 tokens" or "price per 1 million tokens."
70
70
71
71
This pricing model has a significant effect on how you design the user interactions and the amount of preprocessing and post-processing you add.
72
72
@@ -125,7 +125,7 @@ For information about the specific steps to take to build an inference pipeline,
125
125
126
126
Beyond programmatically modifying the prompt, creating an inference pipeline, and other techniques, more details are discussed in [Augmenting a large-language model with retrieval-augmented generation and fine-tuning](augment-llm-rag-fine-tuning.md). Also, you can modify parameters when you make calls to the Azure OpenAI API.
127
127
128
-
To review required and optional parameters to pass that can affect various aspects of the completion, see the [Chat endpoint documentation](https://platform.openai.com/docs/api-reference/chat/create). If you're using an SDK, see the SDK documentation for the language you use. You can experiment with the parameters in the [Playground](https://platform.openai.com/playground/chat).
128
+
Here are some of the key parameters youcan adjust to influence the model's output:
129
129
130
130
-**`Temperature`**: Control the randomness of the output the model generates. At zero, the model becomes deterministic, consistently selecting the most likely next token from its training data. At a temperature of 1, the model balances between choosing high-probability tokens and introducing randomness into the output.
131
131
@@ -145,7 +145,7 @@ To review required and optional parameters to pass that can affect various aspec
145
145
146
146
In addition to keeping the LLM's responses bound to specific subject matter or domains, you also likely are concerned about the kinds of questions your users are asking of the LLM. It's important to consider the kinds of answers it's generating.
147
147
148
-
First, API calls to Microsoft Azure OpenAI Services automatically filter content that the API finds potentially offensive and reports this back to you in many filtering categories.
148
+
First, API calls to Microsoft Azure OpenAI Models in Microsoft Foundry automatically filter content that the API finds potentially offensive and reports this back to you in many filtering categories.
149
149
150
150
You can directly use the [Content Moderation API](/azure/ai-services/content-moderator/api-reference) directly to check any content for potentially harmful content.
151
151
@@ -155,14 +155,14 @@ Then, you can use [Azure AI Content Safety](/azure/ai-services/content-safety/ov
155
155
156
156
AI agents are a new way to build generative AI apps that work on their own. They use LLMs to read and write text, and they can also connect to outside systems, APIs, and data sources.
157
157
AI agents can manage complex tasks, make choices using real-time data, and learn from how people use them.
158
-
For more information about AI agents, see [Quickstart: Create a new agent](/azure/ai-foundry/agents/quickstart?pivots=programming-language-python-azure).
158
+
For more information about AI agents, see [Quickstart: Create a new agent](/azure/ai-foundry/agents/quickstart?view=foundry-classic&pivots=programming-language-python-azure&preserve-view=true).
159
159
160
160
### Tool calling
161
161
162
162
AI agents can use outside tools and APIs to get information, take action, or connect with other services. This feature lets them do more than just generate text and handle more complex tasks.
163
163
164
164
For example, an AI agent can get real-time weather updates from a weather API or pull details from a database based on what a user asks.
165
-
For more information about tool calling, see [Tool calling in Azure AI Foundry](/azure/ai-foundry/agents/how-to/tools/overview).
165
+
For more information about tool calling, see [Discover and manage tools in the Foundry tool catalog (preview)](/azure/ai-foundry/agents/concepts/tool-catalog?view=foundry&preserve-view=true).
Copy file name to clipboardExpand all lines: articles/ai/includes/scaling-load-balancer-capacity.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,13 +1,13 @@
1
1
---
2
2
ms.custom: overview
3
3
ms.topic: include
4
-
ms.date: 06/26/2025
4
+
ms.date: 01/30/2026
5
5
ms.service: azure
6
6
---
7
7
8
8
## Configure the TPM quota
9
9
10
-
By default, each of the Azure OpenAI instances in the load balancer is deployed with a capacity of 30,000 tokens per minute (TPM). You can use the chat app with the confidence that it scales across many users without running out of quota. Change this value when:
10
+
By default, each of the Azure OpenAI Models in Microsoft Foundry instances in the load balancer is deployed with a capacity of 30,000 tokens per minute (TPM). You can use the chat app with the confidence that it scales across many users without running out of quota. Change this value when:
11
11
12
12
* You get deployment capacity errors: Lower the value.
Copy file name to clipboardExpand all lines: articles/ai/includes/scaling-load-balancer-introduction-azure-api-management.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
2
ms.custom: overview
3
3
ms.topic: include
4
-
ms.date: 06/26/2025
4
+
ms.date: 01/30/2026
5
5
ms.service: azure
6
6
---
7
7
8
-
Learn how to add enterprise-grade load balancing to your application to extend the chat app beyond the Azure OpenAI Service token and model quota limits. This approach uses Azure API Management to intelligently direct traffic between three Azure OpenAI resources.
8
+
Learn how to add enterprise-grade load balancing to your application to extend the chat app beyond the Azure OpenAI Models in Microsoft Foundry token and model quota limits. This approach uses Azure API Management to intelligently direct traffic between three Azure OpenAI resources.
9
9
10
10
This article requires you to deploy two separate samples:
11
11
@@ -19,7 +19,7 @@ This article requires you to deploy two separate samples:
19
19
20
20
## Architecture for load balancing Azure OpenAI with Azure API Management
21
21
22
-
Because the Azure OpenAI resource has specific token and model quota limits, a chat app that uses a single Azure OpenAI resource is prone to have conversation failures because of those limits.
22
+
Because the Azure OpenAI Models in Microsoft Foundry have specific token and model quota limits, a chat app that uses a single Azure OpenAI resource is prone to have conversation failures because of those limits.
23
23
24
24
:::image type="content" source="../media/get-started-scaling-load-balancer-azure-api-management/chat-app-original-architecuture.png" alt-text="Diagram that shows chat app architecture with an Azure OpenAI resource highlighted.":::
Copy file name to clipboardExpand all lines: articles/ai/includes/scaling-load-balancer-introduction-azure-container-apps.md
+4-4Lines changed: 4 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,11 @@
1
1
---
2
2
ms.custom: overview
3
3
ms.topic: include
4
-
ms.date: 12/20/2024
4
+
ms.date: 01/30/2026
5
5
ms.service: azure
6
6
---
7
7
8
-
Learn how to add load balancing to your application to extend the chat app beyond the Azure OpenAI Service token and model quota limits. This approach uses Azure Container Apps to create three Azure OpenAI endpoints and a primary container to direct incoming traffic to one of the three endpoints.
8
+
Learn how to add load balancing to your application to extend the chat app beyond the Azure OpenAI Models in Microsoft Foundry token and model quota limits. This approach uses Azure Container Apps to create three Azure OpenAI endpoints and a primary container to direct incoming traffic to one of the three endpoints.
9
9
10
10
This article requires you to deploy two separate samples:
11
11
@@ -25,9 +25,9 @@ This article requires you to deploy two separate samples:
25
25
26
26
## Architecture for load balancing Azure OpenAI with Azure Container Apps
27
27
28
-
Because the Azure OpenAI resource has specific token and model quota limits, a chat app that uses a single Azure OpenAI resource is prone to have conversation failures because of those limits.
28
+
Because the Azure OpenAI Models in Microsoft Foundry resource has specific token and model quota limits, a chat app that uses a single Azure OpenAI Models in Microsoft Foundry resource is prone to have conversation failures because of those limits.
29
29
30
-
:::image type="content" source="../media/get-started-scaling-load-balancer-azure-container-apps/chat-app-original-architecuture.png" alt-text="Diagram that shows chat app architecture with the Azure OpenAI resource highlighted.":::
30
+
:::image type="content" source="../media/get-started-scaling-load-balancer-azure-container-apps/chat-app-original-architecuture.png" alt-text="Diagram that shows chat app architecture with the Azure OpenAI Models in Microsoft Foundry resource highlighted.":::
31
31
32
32
To use the chat app without hitting those limits, use a load-balanced solution with Container Apps. This solution seamlessly exposes a single endpoint from Container Apps to your chat app server.
0 commit comments