Skip to main contentSkip to footer
  • 4/14/2025
Welcome to my video about the AI Rate Limiting plugin for the Kong Gateway. This plugin allows us to control traffic to our AI configuration with the AI proxy configuration. If you missed the AI-Proxy video, and you are figuring out what to do with it or just want to know more, then perhaps one good thing to do is to checkout the first video I made about Kong's AI plugins over here: https://youtu.be/6Z8wWX-liBs . This particular vide, the AI Advanced Rate Limiting plugin, makes more sense to use, if we are using multiple LLM models under different routes common services. It allows for a seamless configuration and better cost controls. I tell all the basics in the video. I hope you enjoy the video and be sure to stay tech, keep programming, be kind and have a good on everyone!

---

Chapters:

00:00:00 Start
00:00:33 Introduction
00:03:19 Configuring Mistral AI
00:04:59 The AI Advanced Rate Limiting Plugin in Detail
00:09:19 Interpreting Rate Limiting Headers
00:10:32 Checking out the configuration in Kong Konnect
00:11:07 Talking about Kong Semantic AI Cache
- https://youtu.be/b3dAMZOhr58
00:11:31 Checking out the example
00:12:10 Getting back to the AI Semantic Cache plugin
- https://youtu.be/b3dAMZOhr58
00:15:02 Launching examples and interpreting results
00:20:39 End notes and conclusion
00:21:22 See you in the next video!
00:22:11 Disclaimer

---

Soundtrack

- https://soundcloud.com/joaoesperancinha/slow-guitar-15-jesprotech

---

Related videos:

- https://youtu.be/Kw5GZnMnVhw
- https://youtu.be/rJKbAzjb5lQ
- https://youtu.be/z3Y4NQgjGLE
- https://youtu.be/KE3VTYtLvnI
- https://youtu.be/6Z8wWX-liBs
- https://youtu.be/vRH4qLZ7tz8
- https://youtu.be/Yhv19le0sBw

--- Source code

- https://github.com/jesperancinha/kong-test-drives
- https://github.com/jesperancinha/jeorg-cloud-test-drives

---

Sources:

- https://docs.konghq.com/hub/kong-inc/ai-rate-limiting-advanced/

---


As a short disclaimer, I'd like to mention that I'm not associated or affiliated with any of the brands eventually shown, displayed, or mentioned in this video.

---

All my work and personal interests are also discoverable on other different sites:

- My Website - https://joaofilipesabinoesperancinha.nl/
- Reddit - https://www.reddit.com/user/jesperancinha
- Credly - https://www.credly.com/users/joao-esperancinha/badges
- Pinterest - https://nl.pinterest.com/jesperancinha/
- Facebook - https://www.facebook.com/joaofisaes/
- Spotify - https://open.spotify.com/user/jlnozkcomrxgsaip7yvffpqqm
- Daily Motion - https://www.dailymotion.com/jofisaes
- Bluesky - https://bsky.app/profile/jesperancinha.bsky.social

---

If you have any questions about this video please put a comment in the comment section below and I will be more than happy to help you or discuss any related topic you'd like to discuss.
Transcript
00:30The AI Rate Limiting Advanced plugin allows us to, well, control the rate of our requests
00:39to our API, but this one is specific for AI.
00:44So if you are used to dealing with the Rate Limiting plugin or with the Rate Limiting
00:48Advanced plugin, this is an improvement of that.
00:52And before I continue talking about this plugin, let me just go with you through the documentation
00:59for it.
01:00If we go to the introduction, we will find that it explains that the AI Rate Limiting Advanced
01:05plugin provides rate limiting for the providers used by any AI plugins.
01:10And my question when I started reading this documentation was, wait a minute, we already
01:15have rate limiting plugins in the Kong API gateway.
01:19There are at least two of them that are very known.
01:24One is the Rate Limiting and the other one is the Rate Limiting Advanced plugins.
01:29But there are more.
01:31There is one service protection which allows us to define rate limits to the services and
01:37it is exclusive to services.
01:39Then we've got Rate Limiting which allows to configure rate limiting for the consumer group,
01:45for the consumer, for routes, for services, and the same goes for the Rate Limiting Advanced.
01:53Now, I have spoken a bit about the rate limiting plugins in an article, this one over here, where
01:59I discussed more details about other plugins that can potentially be used in the Kong gateway.
02:06And I explained that using the OSS version of Kong.
02:09So there, I just used the local container and simply configured the plugins to figure out
02:15how to work with the rate limiting plugins.
02:18But the AI Rate Limiting Advanced is something different.
02:22This plugin is specific for AI, specific for our requests to our large language models.
02:29And there are multiple configurations that we can do with this.
02:32We can even connect this plugin to a Redis database.
02:35And there are advantages for that.
02:38In general lines, it means that we can, for example, make sure that the rate limiting is
02:43proportional to the different large language models that we may have configured in Kong
02:48Connect or in our Kong Enterprise Edition.
02:51The purpose here is to understand how this plugin works, regardless of using the Redis database
02:57or not, because in this case, it is not mandatory.
03:00What is mandatory for this plugin to work is the AI proxy.
03:05And the AI proxy, we can configure it, as we have seen in this video, to connect to multiple
03:12different large language models.
03:13We have seen how to configure this with Gemini.
03:16And we have seen how to configure this with Mistrial.
03:20And for Mistrial, we did a special configuration.
03:23We went to the website, like over here, and we created an API.
03:29Now, I have already created another API for this example, so that we don't have to go again
03:33through the configuration of Mistrial.
03:35In this case, we can just go to the API here and just get one API key and then use that API
03:43key in our example.
03:46Another thing that it's important for us to even start considering the use of the AI rate
03:51limiting advanced configuration plugin is the fact that we need to understand how many large
03:58language models we want to use in our services and routes.
04:02For example, we can configure the AI proxy plugin per route or per service.
04:07If we configure it per route, we will have a good advantage when we try to use the AI rate
04:13limiting configuration plugin.
04:14And the reason for that is that we can then configure, for example, two different routes
04:20with two different AI proxy plugins.
04:22And in the service, on the service level, we can then configure the AI rate limiting advanced
04:29plugin, the one we are seeing today, and make sure that it will affect the two different
04:35routes.
04:35Because with that plugin, we can configure rate limiting both for one large language model
04:41and another and how many we want to configure because we give that in as an array.
04:46So keep that in mind.
04:48If we have AI proxy configured in multiple routes, we can configure them all at the same time if
04:54they belong to one single service.
04:56Because that is what we saw just now in the configuration of the AI rate limiting advanced
05:03plugin, where it says that we can use this on a service route consumer and consumer group
05:10level.
05:10And that is just one example of how can we use the configuration of this plugin to affect
05:16multiple different routes.
05:19But to understand this better, let's first have a look at the configuration and what it
05:24provides and what are we going to configure in our example.
05:27If we go to the configuration reference of the AI rate limiting advanced configuration plugin,
05:32we can see that we've got all of these different nodes that we configure.
05:36But important and mandatory is that we configure the LLM providers node.
05:42The LLM providers node is where we say to this plugin, which one of the LLM providers is it going to
05:50affect, which one of the routes that are being forwarded by the AI proxy plugin are going to be
05:55affected by this rate limiting.
05:56Now, if we configure this by service or by route, we will affect either the service and all the routes bound to that service or the route that we want to configure.
06:07And the important bit here, and it's important that we focus exactly on what this actually does, is we've got three different properties per configuration that we do for one particular large language model.
06:22The first one is the window size.
06:25This means the size of a window in seconds.
06:29If you think about it, that means that we are going to affect the rate that we are going to make the requests for a certain window.
06:36And you probably noticed that I am talking just about rate, not talking about the number of requests.
06:43That is because the AI rate limiting plugin does not configure per request.
06:48It configures per token size and per cost.
06:52And what this means is that a token is a word, for example.
06:56A letter is also part of the account for how much it costs to, for example, ask a question to the LLM model.
07:06This is something that depends on the LLM model itself, and it depends on how we configure it.
07:12And so, it is not a number that we can control directly.
07:18It is a number that we can know if we know exactly how our large language model works.
07:25But we probably can do better if we make the requests and make some experiments in our application to figure out what requests have what size.
07:36Because we want, of course, to make sure that we limit the requests that we are sending to the LLM.
07:40We also need to give in the name of our LLM model.
07:44So, here is where we specify OpenAI, Azure, or Anthropic, or Cohere, or Mistral, Wama 2, Bedrock, Gemini, Hugging Face, or Request Prompt.
07:53These are different large language model types that we can use from different brands.
07:58And so, we've got window size, name, but probably the one that is more interesting here is the limit.
08:05What do we mean by limit?
08:06As I mentioned before, this is about costs.
08:08This is about the number of tokens, the token size, the number of letters, and it is about also the different large language models.
08:16Important thing here is that we know what kind of limit do we want to give for a certain window.
08:22Let's say, for example, an arbitrary number, 60.
08:25If we say that we can only send a 60 cost request in a window size of 30 seconds,
08:32then that means that in that 30 second sized window, I can only send up to 60.
08:41And that means that I can send 2, 3, 4, 5 requests, just as long as the sum of all the costs of those requests does not exceed 60.
08:50If I send 1 request that exceeds 60 in those 30 seconds, then that means the amount of time that I have to wait will be accounted for
08:59and will be calculated in function of how many of those tokens I have exceeded in terms of the limit.
09:06But this is something we will see in practice when we run our example.
09:09Important here also to understand is that when we are using the AI Rate Limiting Advanced plugin,
09:15we are also making sure that we are using another functionality that it gives,
09:22which is these headers that will provide us information on what is the status of the request to our LLM model
09:29and what is the status of the rate.
09:32For example, we can see here that they are talking about the Rate Limiting Reset,
09:37the Rate Limit Retry.
09:38After we are talking about the Rate Limit for Azure, in this case, we are seeing here a 30.
09:45And when we are establishing the Rate Limiting, when we are establishing the Window Size,
09:50the Window Size will be reflected over here.
09:54We will see here, for example, if we say 30, that will be 30 seconds.
09:57But if we say, for example, one minute, that means 60 seconds.
10:01If we configure 60 there in the total amount of seconds, we will see here minute instead of 30.
10:06And so on, there are other kind of headers that we receive back
10:10that will inform us of the current state of the Rate Limit for one particular large language model.
10:18And this could be one LLM in one route, or two or three or four routes of the same type in one service.
10:24But of course, our example is much more simpler than that.
10:28What I have already prepared for us is in ConConnect,
10:32I already have two different plugins configured for our route and service.
10:36So let's see how this all is configured together.
10:39If we go here, for example, to the Gateway service,
10:41we see already that I have one service configured with this host and port.
10:45This is not necessary in terms of the definition of the host and port,
10:53but we do need to define something here so that this works.
10:56Because this will be proxied by the AI proxy,
10:59and therefore the URL here takes no effect.
11:01When we go here to our routes, we see that this is just a simple route.
11:05And as we have seen in this video over here, the previous video to this one,
11:09we see that this route is simply a route to MistRow.
11:15So here we are just going to use this path so that we can access MistRow.
11:23But the routes are accessible now because I have these plugins configured.
11:29And that is what we are going to see.
11:30If we go to the example project, Conctest Drives,
11:34I already have a file there called README AI Rate Limiting Advanced.
11:39In this file, I have already an automated configuration system
11:44that we can use to give it our environment variables.
11:48We can then call these scripts safely from our command line
11:52to make sure that we get these plugins configured.
11:55And so let me show you that.
11:57So the first one is the AI proxy.
12:00And for the AI proxy, I have configured here MistRow.
12:02All of these variables is something that we need to give into the command line.
12:06If you want to know more about how this works,
12:09I've explained that already in this video over here.
12:11Make sure to watch that one.
12:12It will be all in the description.
12:13Make sure also to follow the chapters in the description
12:16so that you know exactly which video is this one and where to find it.
12:20So this one is just a simple AI proxy plugin you have seen before.
12:25We've got here the provider MistRow.
12:27We also use the MistRow format of OpenAPI.
12:29We say that the upstream URL is chat completions.
12:32And we are simply using this as a chatbot.
12:35We are using our LLM model simply as a chatbot.
12:39And then we enter the configuration of the topic of this video,
12:43which is the configuration for the AI Rate Limiting Advanced plugin.
12:47And here we have the complete configuration where
12:50we are specifying that any configuration for this particular route,
12:57which we could have configured here services.
13:01But because we are just showing how it works,
13:04we will configure it only for one route.
13:06This route will have this plugin attached to it,
13:10which means that it will have this LLM provider configuration
13:14where the MistRow LLM model that we are using with our AI proxy plugin
13:21will be limited to 10, which is the weight of the question
13:25or the cost of the questions we are making,
13:29which is the sum of the tokens and the letters
13:32and the calculation that needs to give it a cost
13:35and the window size of 2 minutes, 120 seconds.
13:41So if we now go to our command line
13:44from IntelliJ, because this is already configured locally,
13:48I already have here everything configured
13:50and I can show you, I already have here at the top
13:54from this point onwards,
14:00I've got here the container running
14:02and I can show it to you also.
14:04If I do Docker PS,
14:06you'll find that I have here a Kong gateway running locally
14:09as a data plane node for our Kong Connect website.
14:13So now this is connected to Kong Connect
14:15and we are configuring this container
14:18from that point to here.
14:23Having said that, we can now test our configuration
14:27and let's just ask one question
14:30that will take a long input,
14:31which is what was the first popular breakthrough
14:34of Dire Straits,
14:35which led them to become one of the most known bands
14:39in the last decades?
14:41So this is a very long question
14:43and it will certainly cost more than 10
14:45because I've got here one word, one word, one word.
14:49I've got more than 10 words
14:50and I've got words that are longer
14:52than two or three or four characters
14:54and I've got different words
14:55and I've got also punctuations.
14:58I've got all of this stuff.
15:00So this will have a cost.
15:01Which cost do you think that this question will have?
15:04I know it by heart
15:05because I've done this example before,
15:07but if I run this now,
15:08how much do you think this will cost?
15:11And do you think that, okay, it was too quick.
15:16I was going to ask if you think
15:17that we were going to get a question anyways.
15:20Well, it does answer.
15:21It says Dire Straits' first popular breakthrough
15:22was their single Soldiers of Squing
15:25from their self-titled debut album released in 1978.
15:30So that's a long time ago,
15:31but this is just one answer.
15:34I'm not sure if this is correct or not.
15:36I believe it is,
15:37but the important thing
15:39that we want to have a look at here
15:41is how are the headers
15:44that we want to have a look at behaving.
15:46So we see that we've got an XAI rate limit
15:49limit 120 mistrial of 10.
15:52This is true
15:53because this is how big our requests are.
15:57The rate limit remaining says here 10.
16:0110, that is how much the cost can be.
16:05We can see here the rate limit by size query cost.
16:10Our whole query took 290.
16:13So if you said anything less than this,
16:15you were wrong.
16:15If you said anything above this,
16:18you were wrong.
16:18If you said something around this,
16:20you were probably correct in your estimation
16:23of how big this question would be.
16:25Now, I don't know how this is calculated
16:26because this also depends on Mistral,
16:28but effectively we have here a cost for this question.
16:32So we can now get an idea of how costly are these questions.
16:36But we also see other things.
16:39We see that nothing seems to be affected by this request
16:43because this is the state
16:45not of how we leave the rate limiting,
16:48but this is how we entered rate limiting.
16:50That means that if we make a follow-up question,
16:56like for example,
16:58who are dire straits?
17:01Now, if we send this,
17:04it says the AI token rate limit
17:06exceeded four providers' mistrial.
17:08But now, here comes the fun part.
17:11We can have a look at the different headers
17:14that Konk has given back to us.
17:17And now we can see that
17:18the XAI rate limit remaining 120 is zero.
17:21We don't have any more credits,
17:26let's call it that way,
17:27to send these requests.
17:28We have here that the reset will be at 171.
17:33This means that we need to wait 171 seconds still
17:37before we can continue.
17:39Remember, we made a question that it's 290 in cost
17:43and our limit is 10.
17:45Now, we wouldn't wait 120 seconds.
17:50We would have to wait much more than that
17:52because we've wasted so many tokens
17:54at one point where we were able to make a question.
17:57So now, rate limiting is in effect
17:59and we cannot make any other question
18:00until this deadline expires,
18:03until 171 seconds go by.
18:06There are differences between these two.
18:08The retry after 120 mistrial
18:11means that we can make a retry after 171 seconds.
18:14If we say a retry after,
18:16means that we can make another retry after 171 seconds.
18:19But this means the combination
18:21of all of the different rate limiting
18:25that we have configured in our gateway.
18:27So this one is the same
18:29because we only have one in effect right now
18:31and so that means that
18:32they will always match each other
18:34and it seems like a repetition of one or the other
18:37but they are not really that.
18:39And then finally, we've got here
18:41the rate limit, limit 120 mistrial, 10.
18:46That is the current limit we have for cost
18:48and this one is the reset, again 171.
18:52And then now, I think those 171 seconds
18:55have gone by already
18:56and if we make the question once again,
18:59we should now be getting a response.
19:04Which is taking a while.
19:08Maybe I should have used this plugin,
19:10the AI Cache plugin.
19:12But anyways, it says now that
19:14it tells the story,
19:16it seems like you might be confused
19:17between two different things,
19:18Dire Straits, which is a British rock band,
19:20and Dire Straits, which is a phrase
19:23meaning a serious or desperate situation.
19:26It is true.
19:27One or the other.
19:28It doesn't matter.
19:29We made the question
19:30and the AI model is now trying to figure out
19:33a way to give us a response
19:34and this is a very, very correct response
19:37because one is the band
19:38and the other one is the popular saying.
19:41And now, what is our limitation?
19:43Well, if we go here further above,
19:46we will see that our question now,
19:48the total cost of this question is 262.
19:51So this means that the question is smaller,
19:54but the cost is still 262.
19:56That means we have already reached the border.
19:58We are already way above 10.
20:01So that means if we make a question now,
20:03for example, who is Brian May?
20:06And if we run it,
20:07we should now be still under the limit.
20:12And now it says the AI token rate
20:14limit exceeded four providers' misdraw.
20:16Again, we have exceeded the rate limitation
20:19and that means that we get too many requests
20:21and we should be getting now
20:23the state of our rate limiting.
20:28And that means that the state is,
20:29again, we don't have any more credits.
20:31We need to wait 197 seconds still
20:33and only then we will be able to
20:35make another question to our AI model.
20:39All right, everyone.
20:40So this is my short presentation
20:41about the AI rate limiting configuration
20:45advanced plugin.
20:46This is a plugin that has multiple functionalities,
20:49not just this.
20:49This is a bit of a crash course
20:51into how to use it.
20:53There will be more videos coming in
20:54with a lot more detail about it.
20:56Make sure to stay tuned to this channel
20:58and make sure to subscribe to this channel
21:00so that you don't miss out
21:01on following videos explaining this.
21:03Now, if you enjoyed this video,
21:04make sure to give us a like,
21:06make sure to subscribe to the channel.
21:08As I mentioned before,
21:09make sure to leave your comment.
21:11It's very important for you
21:12to give a comment to this video.
21:14I will answer questions that you put in there
21:16and don't forget to activate
21:17that notification bell
21:18so that you are up to date
21:20with the videos that I post over here.
21:22Thank you so much for watching the video
21:24and until the next one,
21:26be sure to stay tech,
21:28keep programming,
21:29be kind
21:30and have a good one.
21:31Bye.
21:32Bye.
21:34Bye.
21:35Bye.
21:36Bye.
21:37Bye.
21:38Bye.
21:39Bye.
21:40Bye.
21:41Bye.
21:42Bye.
21:43Bye.
21:44Bye.
21:45Bye.
21:46Bye.
21:47Bye.
21:48Bye.
21:49Bye.
21:50Bye.
21:51Bye.
21:52Bye.
21:53Bye.
21:54Bye.
21:55Bye.
21:56Bye.
21:57Bye.
21:58Bye.
21:59Bye.
22:00Bye.
22:01Bye.
22:02Bye.
22:03As a short disclaimer, I'd like to mention that I'm not associated or affiliated with
22:15any of the brands eventually shown, displayed or mentioned in this video.

Recommended