DeepSeek just changed the game — again.
Their latest breakthrough: a self-teaching AI system that learns without supervision, adapts in real time, and may have just outperformed OpenAI’s best models. 🤖🔥
💡 Is this the dawn of true AGI?
🇨🇳 How is China racing ahead in the AI arms race?
💥 What does this mean for OpenAI, Google, and the global tech balance?
From architecture to performance benchmarks, we break down exactly why DeepSeek's model matters — and what it tells us about the future of machine intelligence.
📊 Watch now to see how this underdog is disrupting the AI elite.
#DeepSeekAI #AGIBreakthrough #SelfTeachingAI #AIvsOpenAI #AICompetition #DeepSeekVsOpenAI #ArtificialGeneralIntelligence #FutureOfAI #TechWar #ChineseAI #AGI2025 #NextGenAI #AutonomousLearning #OpenAISurpassed #MachineLearningRevolution #EmergingTech #AIInnovation #AIModelBattle #AIResearch #GameChangerAI
Their latest breakthrough: a self-teaching AI system that learns without supervision, adapts in real time, and may have just outperformed OpenAI’s best models. 🤖🔥
💡 Is this the dawn of true AGI?
🇨🇳 How is China racing ahead in the AI arms race?
💥 What does this mean for OpenAI, Google, and the global tech balance?
From architecture to performance benchmarks, we break down exactly why DeepSeek's model matters — and what it tells us about the future of machine intelligence.
📊 Watch now to see how this underdog is disrupting the AI elite.
#DeepSeekAI #AGIBreakthrough #SelfTeachingAI #AIvsOpenAI #AICompetition #DeepSeekVsOpenAI #ArtificialGeneralIntelligence #FutureOfAI #TechWar #ChineseAI #AGI2025 #NextGenAI #AutonomousLearning #OpenAISurpassed #MachineLearningRevolution #EmergingTech #AIInnovation #AIModelBattle #AIResearch #GameChangerAI
Category
🤖
TechTranscript
00:00Something wild is happening in the AI world right now, and it's not just hype.
00:07Deep Seek just dropped a new way to make AI models teach themselves how to think better,
00:13and it's starting to outperform giants like GPT-4-0.
00:17Meanwhile, OpenAI is cooking up a bunch of new models and just gave ChatGPT a serious memory upgrade.
00:24It can now remember everything you've ever told it.
00:27Things are moving fast, and it's honestly like talking to an AI that grows with you day by day.
00:32Before we start, quick heads up, we just dropped a free course on AI avatars.
00:37What they are, how to use them to speed up content creation, pick the right tools, and even make money with them.
00:43Spots are limited, so check the link in the description and grab your spot now.
00:47Alright, now let's talk about Deep Seek's new AI system.
00:50They call it Deep Seek GRM, and they say that once an AI model is trained with SPCT,
00:56it can handle single, paired, or multiple responses with ease.
01:01So let's say you give the model several possible answers.
01:04It will churn out a little piece of text that shows its principles, basically the rules it's learned,
01:11to judge what a good answer looks like.
01:13And then it critiques each answer in detail before assigning a score from 1 to 10.
01:18If the answer meets its criteria for correctness, clarity, safety, or whatever else it has decided is important,
01:26it'll get a higher score.
01:27If the answer looks dodgy, it'll get a lower score.
01:30But here's what sets it apart.
01:32It can do something called repeated sampling at inference time, meaning you can sample the model's internal judging process multiple times
01:40and either average out or vote on the results.
01:43That might sound a bit time-intensive, but it leads to more accurate final decisions
01:47because you get a broader distribution of possible critiques.
01:51As an extra step, they introduced a sort of gatekeeper called a meta reward model, or meta RM,
01:57that filters out any subpar critiques the system might generate.
02:01So you don't need to just trust one critique.
02:04You can gather several, filter out the nonsense, and end up with a better final decision.
02:08Now let's take a step back and explore how they train DeepSeq GRM.
02:13The process is called self-principled critique tuning, SPCT, and it happens in two phases.
02:19The first is rejective fine-tuning, RFT, which aims to give the model a baseline sense of what good critiques look like.
02:27This phase uses 1.07 million pieces of general instruction data, plus 186,000 items of rejectively sampled data.
02:37The idea behind rejective sampling is that if the model's predicted scores
02:41don't align with the known best responses, those generated samples are tossed out.
02:46If everything is somehow correct right away, they discard that as well,
02:50because they want to focus on challenging examples.
02:53They also throw in some single response data, labeling correct responses with a 1 and wrong ones with a 0.
02:59That's the training set for the baseline.
03:01They say that for their 27 billion parameter model, called GEMMA227B,
03:06they used a batch size of 1,024 and a learning rate of 5e6,
03:12training for about 900 steps on 128 A100 GPUs.
03:18That took them close to 19.2 hours.
03:21Then, after RFT, they do the second phase, which is rule-based online reinforcement learning
03:26using something they call GRPO, similar to PPO.
03:30It compares the model's predicted best response to the actual best response,
03:34awarding a reward of plus 1 if it matches and minus 1 if it doesn't.
03:39They also incorporate a KL penalty of 0.08 to keep the model from drifting too far off track.
03:45This phase uses 237k data points and runs for 900 steps with a batch size of 512, also on 128 A100s,
03:56for about 15.6 hours.
03:58DeepSeq has also tested different sizes of this generative reward model,
04:03from a 16B version built on a Mixture of Experts foundation,
04:07to a 27B version built on GEMMA227B,
04:11to massive 236B and 671B versions,
04:15also employing Mixture of Experts architectures.
04:18They keep pointing out that the default size they showcase results for is 27B,
04:23presumably because it offers a sweet spot of performance and cost,
04:27but they do say that if you try enough repeated sampling at inference time,
04:31you can get the 27B model's performance pretty close to the single-pass performance of the 671B monster.
04:39That's a big deal, because scaling up to a 671B model is obviously more expensive and hardware-intensive,
04:46whereas sampling a 27B model 32 times might be more feasible in certain use cases.
04:53It's basically a statement that if you want to do this in a real-world pipeline,
04:57you might not always need to jump straight to the biggest model to get top-tier results.
05:02The real highlight is how this approach performs in benchmarks.
05:05They tested across multiple sets like RewardBench, which looks at chat, safety and reasoning.
05:11There's also PPE, which has preference and correctness parts,
05:16and RMB for helpfulness and harmlessness tasks.
05:19Then there's re-RM mistake for single-responsor error detection.
05:24For RewardBench, the 27B version of DeepSeq.grm gets about 86.0% with a single greedy pass,
05:33which is already pretty high.
05:35Then if they do something like 8 sample voting, that jumps to about 88.5%.
05:40Combine that with the meta-RM filtering, and it can even hit 90.4% in that particular domain.
05:47Across other tests, you see smaller but still notable bumps.
05:51PPE preference might go from 64.7% up to 65.3% or 67.2%,
05:58PPE correctness from 59.8% up to 63.2%, and so on.
06:04They say that overall if you just do one pass, you get around 69.5% on average.
06:11But if you do 32 sample voting plus the meta-RM, it pushes to 72.8%.
06:17Notably, these results beat some other large public reward models,
06:22like Nemetron 4340BR Reward 70.5%, and even approach or exceed GPT-4O's 71.3% in certain comparisons.
06:33Now, the system isn't flawless, of course.
06:35It can stumble on tasks where only one correct answer exists, like math or coding,
06:40especially if no reference is provided.
06:42But if you give it ground truth solutions, it can nail over 90% on math,
06:47almost matching specialized models.
06:49Generating detailed critiques instead of a single score can be slower and cost more compute,
06:55especially if you sample multiple times to boost accuracy.
06:59There's also a delicate balance with the KL penalty.
07:01Set it too low, and the model can misbehave.
07:04Ablation studies confirm that principal generation and rejective sampling are crucial.
07:10Removing them undermines performance.
07:12Still, the generative approach is flexible for everything from chat help to safety checks.
07:17And if you need more accuracy, you can sample repeatedly and let a meta-RM filter out duds,
07:23all without retraining a bigger model.
07:25Now, the rumor mill has also been churning about DeepSeq's next chatbot, allegedly called R2.
07:31The initial R1 was already a big splash earlier in the year,
07:34so everyone's curious if R2 is going to leverage DeepSeq GRM right out of the box,
07:39or if it's going to do something even more advanced.
07:43The company hasn't made any official statements on R2's release date,
07:46or whether it will incorporate all the features described in the new paper,
07:50but there's definitely plenty of speculation in the community.
07:53On top of that, DeepSeq is talking about open sourcing these advanced AI models at some point,
07:58but again, no firm timeline has been released.
08:01Still, that's big news in a world where people are hungry for more open, large-scale AI solutions.
08:08So, here's how it all fits together.
08:10SPCT is basically the backbone of DeepSeq's self-improving model.
08:15Instead of just scaling up the model, they teach it to create its own rules,
08:20critique its answers, and give it self-feedback.
08:23If its response matches what it considers a good answer, it reinforces that behavior.
08:29It's like a student explaining their answer and learning better through feedback.
08:35Just way faster and automated.
08:37What makes this approach so flexible is that it works across different tasks
08:41without needing to rebuild the whole thing.
08:44If the goal is correctness, you feed in references.
08:47If it's about safety or politeness, you train it with those values.
08:52Whether it's choosing the best of a few responses or judging a single one,
08:57it handles it all with the same architecture.
08:59Now, what's really wild is that the 27B version of DeepSeq GRM,
09:04with enough sampling and meta-RM filtering,
09:07can match or even outperform giants like GPT-40 and NemoTron 4340B in some benchmarks.
09:14That's huge, especially coming from a newer player backed by Tsinghua University,
09:19which adds a lot of credibility to their work.
09:21It also speaks to China's growing push to build competitive AI locally.
09:26At the end of the day, this whole setup is about smarter, more adaptable models.
09:31Instead of just going bigger, DeepSeq trains models to critique and improve themselves on the fly.
09:37The results are strong, the critiques are actually useful for devs,
09:41and the flexibility is a big win.
09:43It's not always the fastest method, especially with sampling,
09:46but it gives you a deeper level of insight and control.
09:49Alright, and just as DeepSeq's making waves with its self-improving models,
09:53OpenAI's got some big moves of its own coming up, too.
09:56They're gearing up to drop a bunch of new AI models soon,
09:59and at the top of the list is GPT-4.1, a more refined version of GPT-40.
10:05They're flagship multimodal model that handles text, image, and audio together in real-time.
10:12GPT-4.1 is expected to take that core and polish it further,
10:16offering better performance across the board.
10:18We're also hearing that OpenAI's planning to launch mini and nano versions of GPT-4.1,
10:24which are likely optimized for speed and lighter devices.
10:29Plus, they've got something called the O3 model lined up,
10:32and an O4 mini that might even drop sooner.
10:35An AI engineer actually spotted references to these in ChatGPT's web code,
10:39so this is probably not just a rumor, it's happening soon, maybe even next week.
10:44But now, here's the real kicker.
10:46Sam Altman just announced that ChatGPT's memory feature has been greatly improved.
10:52It can now remember all your past chats.
10:55Not just a little context. Everything.
10:57This means your conversations with ChatGPT can now build on your previous ones,
11:02making responses feel way more personal, useful, and honestly, a lot more human-like.
11:08Before this, the memory feature was a bit limited and already rolled out in September last year
11:12to users on free plus team and enterprise plans.
11:15But now it's leveled up.
11:17ChatGPT can reference your past chats to give you advice,
11:21write things, help you learn, all in a way that's actually tailored to you.
11:25Sam even said this is a surprisingly great feature and hinted it's a step toward AI systems that grow with you over time
11:32and become a kind of personal assistant that knows you better the more you use it.
11:38It's currently rolling out to ChatGPT Pro users and will hit plus users soon.
11:43Enterprise, Edu, and Team users will get it in the coming weeks.
11:47But if you're in the EEA, UK, Switzerland, Norway, Iceland, or Liechtenstein, you're not getting it.
11:54For now, at least.
11:56And as for free users, no update yet.
11:58You're still totally in control of it, though.
12:01If you're not cool with ChatGPT remembering anything, you can opt out anytime in your settings.
12:06There's even a temporary chat mode where it won't use or store any memory from your conversation.
12:13You can also view, manage, or clear what it remembers at any time.
12:17So it's not forcing anything on you. You decide what it remembers.
12:21So is it brilliant progress? Or are we getting a little too comfortable with AI remembering everything we say?
12:28And if AI can now judge, critique, and evolve on its own, are we still the ones in control?
12:36And don't forget, the free course on AI avatars is open. Links in the description.
12:41And if you haven't yet, hit like and subscribe.
12:44Thanks for watching, and I'll catch you in the next one.