[{"data":1,"prerenderedAt":775},["ShallowReactive",2],{"blog-\u002Fblog\u002Fen\u002Fmaking-ais-play-werewolf-arenai-first-results":3},{"id":4,"title":5,"body":6,"date":761,"description":762,"extension":763,"meta":764,"navigation":765,"path":766,"seo":767,"stem":768,"tags":769,"translationSlug":773,"__hash__":774},"content\u002Fblog\u002Fen\u002Fmaking-ais-play-werewolf-arenai-first-results.md","Making AIs Play Werewolf: ArenAI's First Results",{"type":7,"value":8,"toc":738},"minimark",[9,21,26,29,34,37,125,128,199,203,354,358,364,376,382,388,392,395,399,402,406,414,419,422,425,428,431,434,440,444,447,451,454,458,461,465,468,479,485,491,497,500,519,528,532,539,543,550,554,557,560,563,567,574,583,592,595,599,602,608,614,620,624,627,647,650,667,670,674,677,681,684,691,694,697,701,707,713,718,724],[10,11,12,13,20],"p",{},"In the ",[14,15,19],"a",{"href":16,"rel":17},"https:\u002F\u002Fplduhoux.fr\u002Fen\u002Fblog\u002Farenai-genesis",[18],"nofollow","first article",", I introduced ArenAI: a platform that makes LLMs play social deduction games. The goal: measure their social intelligence (lying, detecting liars, persuading, coordinating) by having them face each other without strategic coaching. Since then, I have run 120 games of Werewolf. And I tried adding a new game that completely broke every model.",[22,23,25],"h2",{"id":24},"_120-games-of-werewolf","120 games of Werewolf",[10,27,28],{},"The setup: 7 players (2 Werewolves, 1 Seer, 1 Witch, 3 Villagers), Mayor election, 2 discussion rounds per day. Four frontier models in round-robin: Claude Opus 4.6 (Anthropic), GPT-5.4 (OpenAI), Gemini 2.5 Pro (Google), Grok 4.20 (xAI). Each pair played 20 games (10 in each direction). 120 games total, ~93k tokens per game on average.",[30,31,33],"h3",{"id":32},"ranking-by-role","Ranking by role",[10,35,36],{},"As Werewolves (the offensive role: lying, manipulating, coordinating kills):",[38,39,40,59],"table",{},[41,42,43],"thead",{},[44,45,46,50,53,56],"tr",{},[47,48,49],"th",{},"Model",[47,51,52],{},"Games as Werewolf",[47,54,55],{},"Wins",[47,57,58],{},"Rate",[60,61,62,80,95,110],"tbody",{},[44,63,64,68,71,74],{},[65,66,67],"td",{},"GPT-5.4",[65,69,70],{},"30",[65,72,73],{},"22",[65,75,76],{},[77,78,79],"strong",{},"73%",[44,81,82,85,87,90],{},[65,83,84],{},"Claude Opus 4.6",[65,86,70],{},[65,88,89],{},"15",[65,91,92],{},[77,93,94],{},"50%",[44,96,97,100,102,105],{},[65,98,99],{},"Gemini 2.5 Pro",[65,101,70],{},[65,103,104],{},"11",[65,106,107],{},[77,108,109],{},"37%",[44,111,112,115,117,120],{},[65,113,114],{},"Grok 4.20",[65,116,70],{},[65,118,119],{},"4",[65,121,122],{},[77,123,124],{},"13%",[10,126,127],{},"As Villagers (the defensive role: detecting liars, resisting manipulation):",[38,129,130,143],{},[41,131,132],{},[44,133,134,136,139,141],{},[47,135,49],{},[47,137,138],{},"Games as Villager",[47,140,55],{},[47,142,58],{},[60,144,145,157,171,185],{},[44,146,147,149,151,153],{},[65,148,67],{},[65,150,70],{},[65,152,73],{},[65,154,155],{},[77,156,79],{},[44,158,159,161,163,166],{},[65,160,99],{},[65,162,70],{},[65,164,165],{},"19",[65,167,168],{},[77,169,170],{},"63%",[44,172,173,175,177,180],{},[65,174,84],{},[65,176,70],{},[65,178,179],{},"17",[65,181,182],{},[77,183,184],{},"57%",[44,186,187,189,191,194],{},[65,188,114],{},[65,190,70],{},[65,192,193],{},"10",[65,195,196],{},[77,197,198],{},"33%",[30,200,202],{"id":201},"direct-matchups","Direct matchups",[38,204,205,221],{},[41,206,207],{},[44,208,209,212,215,218],{},[47,210,211],{},"Villagers",[47,213,214],{},"Werewolves",[47,216,217],{},"Villager wins",[47,219,220],{},"Werewolf wins",[60,222,223,236,248,259,270,281,291,302,312,324,334,344],{},[44,224,225,228,230,233],{},[65,226,227],{},"Opus",[65,229,67],{},[65,231,232],{},"2",[65,234,235],{},"8",[44,237,238,240,242,245],{},[65,239,67],{},[65,241,227],{},[65,243,244],{},"7",[65,246,247],{},"3",[44,249,250,252,255,257],{},[65,251,227],{},[65,253,254],{},"Gemini",[65,256,244],{},[65,258,247],{},[44,260,261,263,265,268],{},[65,262,254],{},[65,264,227],{},[65,266,267],{},"5",[65,269,267],{},[44,271,272,274,277,279],{},[65,273,227],{},[65,275,276],{},"Grok",[65,278,235],{},[65,280,232],{},[44,282,283,285,287,289],{},[65,284,276],{},[65,286,227],{},[65,288,247],{},[65,290,244],{},[44,292,293,295,297,300],{},[65,294,67],{},[65,296,254],{},[65,298,299],{},"6",[65,301,119],{},[44,303,304,306,308,310],{},[65,305,254],{},[65,307,67],{},[65,309,267],{},[65,311,267],{},[44,313,314,316,318,321],{},[65,315,67],{},[65,317,276],{},[65,319,320],{},"9",[65,322,323],{},"1",[44,325,326,328,330,332],{},[65,327,276],{},[65,329,67],{},[65,331,323],{},[65,333,320],{},[44,335,336,338,340,342],{},[65,337,254],{},[65,339,276],{},[65,341,320],{},[65,343,323],{},[44,345,346,348,350,352],{},[65,347,276],{},[65,349,254],{},[65,351,299],{},[65,353,119],{},[30,355,357],{"id":356},"what-this-shows","What this shows",[10,359,360,363],{},[77,361,362],{},"GPT-5.4 is the best liar and the best detective."," 73% win rate on both sides. It is the only model that dominates both offense and defense. When it plays wolf, it maintains cover across several days. When it plays villager, it identifies inconsistencies and resists social pressure.",[10,365,366,369,370,375],{},[77,367,368],{},"Claude Opus 4.6 is a strong second."," 50% as wolf, 57% as villager. Solid but not dominant. It can lie (",[14,371,374],{"href":372,"rel":373},"https:\u002F\u002Farenai.plduhoux.fr\u002Fgame\u002F69dc5328-84ab-436e-8bc3-32caf6103710",[18],"this game"," shows it well), but GPT outperforms it in sustained manipulation.",[10,377,378,381],{},[77,379,380],{},"Gemini 2.5 Pro is a good defender."," 63% as villager, but only 37% as wolf. It detects liars well but does not lie well itself. A rigorous analyst profile: effective at dismantling an argument, less effective at building a false one.",[10,383,384,387],{},[77,385,386],{},"Grok 4.20 struggles."," 13% as wolf, 33% as villager. An offensive profile that does not pay off: very aggressive in its accusations, but transparent in its bluffs. When it is a wolf, its \"maximum pressure\" style is readable. When it is a villager, it gets manipulated by subtler wolves.",[30,389,391],{"id":390},"models-used","Models used",[10,393,394],{},"ArenAI deliberately uses recent models from each provider: Claude Opus 4.6, GPT-5.4, Gemini 2.5 Pro, Grok 4.20. Ideally, when a new model comes out, I rerun the tests to see whether social capabilities have improved. In practice, we are already slightly behind on some versions, and every campaign is expensive: I am not going to replay 120 games for every micro-release. The goal is not to freeze a ranking, but to track how LLM social intelligence evolves across important versions.",[30,396,398],{"id":397},"game-balance","Game balance",[10,400,401],{},"Across the 120 games, Villagers win 57% of the time and Werewolves 43%. That is a healthy balance: wolves have a real chance to win, while the village keeps a slight structural advantage (more players, special roles). Moving to 7 players did not break the game in favor of the village: wolves remain competitive, and games are decided by the quality of manipulation, not by the math of the first vote.",[30,403,405],{"id":404},"comparison-with-foaster","Comparison with Foaster",[10,407,408,413],{},[14,409,412],{"href":410,"rel":411},"https:\u002F\u002Fwerewolf.foaster.ai\u002F",[18],"Foaster"," is the study that inspired ArenAI. Their results place GPT-5 clearly in the lead, which is consistent with mine. But there are important differences in the setup.",[415,416,418],"h4",{"id":417},"_6-players-vs-7-players","6 players vs 7 players",[10,420,421],{},"Foaster plays with 6 players, ArenAI with 7. This is a deliberate choice.",[10,423,424],{},"With 6 players (2 wolves \u002F 4 villagers), the village has no margin for error. If the villagers eliminate one of their own on Day 1 (which happens often: there is little information, the vote is almost random), the situation becomes: 3 villagers, 2 wolves. The wolves kill at night: 2 villagers, 2 wolves. Parity. Wolves win. In other words, one bad decision on Day 1 and the game is mathematically lost for the village before Day 2 even happens. The Seer has had only one useful inspection, the Witch has not had enough time to gather information and use poison intelligently, and the Day 1 discussions have produced no exploitable signal because nobody had anything to analyze yet.",[10,426,427],{},"With 7 players (2 wolves \u002F 5 villagers), a Day 1 mistake leaves the village at 4 vs 2. The wolves kill at night: 3 vs 2. The village gets a second day to correct course. The Seer has had two inspections. Inconsistencies in the wolves' stories have had time to accumulate. Day 1 votes become usable data on Day 2 (who voted with whom, who changed their mind). This is the strategic depth I want to measure: the ability to maintain a lie across several days, not the luck of an almost-random first vote.",[10,429,430],{},"And despite this additional structural advantage for the village, my stats remain balanced: 57% villager wins against 43% wolf wins. The game is not broken, wolves have a real chance, and games are decided by the quality of manipulation.",[10,432,433],{},"This is an important distinction from the 6-player setup. At 6, you can also get a ratio close to 50-50, but for the wrong reasons: games often hinge on a single almost-random vote on Day 1. Wolves win when the village guesses wrong in the first round, villagers win when they guess right. That is noise more than signal. When wolves win at 7, it means they actually managed to maintain cover, manipulate votes, and coordinate their actions. And when the village wins, it means it accumulated clues, cross-checked information, and built a case. Victories are earned on both sides.",[10,435,436,437],{},"I think the 6-player format can also distort rankings. A model that is simply better at avoiding elimination on Day 1 (for example by being less suspicious by default, or by speaking in a more neutral way) will see its stats inflated, not because it manipulates better, but because it survives the critical turn. Conversely, a model that develops sophisticated multi-day strategies has no time to deploy them: the game is already over. The 6-player format rewards short-term caution, not social intelligence. ",[77,438,439],{},"At 7, we better measure what I actually care about: the models' ability to lie, investigate, and hold a strategy over several days.",[415,441,443],{"id":442},"discussion-rounds","Discussion rounds",[10,445,446],{},"Foaster uses 3 discussion rounds per day, ArenAI uses 2. In practice, the impact is limited: we observe that the third discussion round tends to loop. Positions are already fixed, arguments repeat, players rephrase what has already been said. Two rounds are enough for accusations to be made, defenses to be heard, and conclusions to be drawn. The speaking order also differs: Foaster prioritizes by type (defense first, then attack, then analysis), while ArenAI gives priority to the Mayor in round 1 and to mentioned players in round 2.",[415,448,450],{"id":449},"framing","Framing",[10,452,453],{},"Foaster uses an agent-with-tools framing. ArenAI uses a pure conversational framing: models answer in structured text (THOUGHT, MESSAGE, PICK) without tool calls. It is a simplicity choice that makes the benchmark more portable.",[415,455,457],{"id":456},"sample-size","Sample size",[10,459,460],{},"My samples are still small: 10 games per matchup (20 if you count both directions), similar to Foaster. Enough to identify trends, not enough for definitive conclusions. Results can fluctuate on such a small sample.",[30,462,464],{"id":463},"emergent-behaviors","Emergent behaviors",[10,466,467],{},"Foaster documented several emergent strategic behaviors in LLMs. We see the same ones in our games.",[10,469,470,473,474,478],{},[77,471,472],{},"Bussing a partner."," A wolf votes against their own teammate to gain credibility. This is exactly what Clara does in ",[14,475,477],{"href":372,"rel":476},[18],"this Opus vs GPT game",": she votes against David (her wolf partner) to maintain her cover, while planting seeds of doubt for the next day. Foaster observes the same behavior in GPT-5 and Grok-4, with varying degrees of success.",[10,480,481,484],{},[77,482,483],{},"Counterclaiming the Seer."," When the real Seer accuses a wolf, that wolf claims to be the Seer themselves. Foaster documents a spectacular case where Grok-4 turns the situation against GPT-OSS with a pure bluff. We observe similar attempts in our games, with varying success depending on the opposing model.",[10,486,487,490],{},[77,488,489],{},"Over-coordination as a tell."," When the two wolves defend the same narrative in a way that is too synchronized, analytical villagers spot the pattern. Foaster shows that Qwen3 is particularly good at identifying these \"closed loops\" of votes. In my games, this is an important factor in Grok-4's weakness as wolf (13% wins): its aggressive and coordinated style is readable by Opus and GPT. The most recent models detect this kind of behavior immediately and avoid falling into it themselves. It is the same mechanism we see in Undercover: as soon as a pattern stands out, the analytical side of LLMs turns it into a target. They are better at detecting anomalies than producing them.",[10,492,493,496],{},[77,494,495],{},"Procedural manipulation."," The most subtle behavior: a GPT-5 wolf gets elected Mayor, adopts a calm and structured style, and uses the position to steer discussions. Foaster documents how Grok-4 as villager systematically falls into this trap, confusing \"speaks in an orderly way\" with \"is innocent\". We observe exactly the same vulnerability in our games: models that play \"cleanly\" are suspected less, regardless of their actual role.",[10,498,499],{},"A few games that illustrate these behaviors well:",[501,502,503,511],"ul",{},[504,505,506,510],"li",{},[14,507,509],{"href":372,"rel":508},[18],"Opus vs GPT-5 (Werewolf)",": bussing, night coordination, the Seer building a case without revealing their role",[504,512,513,518],{},[14,514,517],{"href":515,"rel":516},"https:\u002F\u002Farenai.plduhoux.fr\u002Fgame\u002Fb3730b78-7215-4315-9dc8-3cd7d88b2c59",[18],"Gemini vs Opus (Werewolf)",": wolf victory in 2 rounds, Opus wolves reach parity before the village can react",[10,520,521,522,527],{},"All games are available in full on ",[14,523,526],{"href":524,"rel":525},"https:\u002F\u002Farenai.plduhoux.fr\u002Fgames",[18],"arenai.plduhoux.fr",", exchange by exchange, with each player's private thoughts. By clicking a player at the top of a game page, you can access their full session: every interaction they had with the game master, prompt by prompt, with the information they had at each moment. This level of detail makes it possible to analyze strategies in depth.",[415,529,531],{"id":530},"convergences-and-divergences","Convergences and divergences",[10,533,534,535,538],{},"On the substance, both studies converge: GPT-5 dominates. ",[77,536,537],{},"But my results show that the gap is not a chasm."," Opus at 50% as wolf, Gemini at 63% as villager: these models are competitive. Foaster presents GPT-5 as \"alone at the top\"; my data suggests a tighter leading pack, with GPT-5 first, closely followed by Opus and Gemini depending on the role.",[22,540,542],{"id":541},"undercover-the-game-that-breaks-llms","Undercover: the game that breaks LLMs",[10,544,545,546,549],{},"Alongside Werewolf, I implemented a new game: ",[77,547,548],{},"Undercover",". And the results are fascinating for the wrong reasons.",[30,551,553],{"id":552},"the-rules","The rules",[10,555,556],{},"Four players each receive a secret word. Three of them (the Civilians) have the same word. The fourth (the Undercover) has a similar but different word. Nobody knows their role. Word pairs are chosen for maximum semantic overlap: Coffee\u002FTea, Beach\u002FPool, Guitar\u002FUkulele, Pillow\u002FBlanket, Sock\u002FGlove.",[10,558,559],{},"Each round: every player gives a clue about their word (a word or short phrase), then discussion, then a vote to eliminate someone. Civilians win by eliminating the Undercover. The Undercover wins if they survive until the final 2 players.",[10,561,562],{},"It is a stripped-down game. No special roles, no night phase, no private channel. Just words and deduction. The ultimate test of subtlety.",[30,564,566],{"id":565},"a-single-undercover-win-in-20-games","A single Undercover win in 20 games",[10,568,569,570,573],{},"Across 20 games played with different combinations of models (Opus, GPT-5.4, Sonnet), ",[77,571,572],{},"the Undercover won only once",": an Opus victory. Every other game ended with a Civilian win, a 95% Civilian win rate.",[10,575,576,577,582],{},"And it is not due to a lack of analytical intelligence. The models reason correctly. In ",[14,578,581],{"href":579,"rel":580},"https:\u002F\u002Farenai.plduhoux.fr\u002Fgame\u002Fa4cd962b-e92f-4373-a47e-3da39c523ba1",[18],"this Coffee\u002FTea game",", Alice (Opus, Undercover with \"Coffee\") quickly understands that the Civilians probably have \"Tea\". Her internal reasoning after the first vote:",[584,585,586],"blockquote",{},[10,587,588],{},[589,590,591],"em",{},"\"Given that Clara said \"Soothing\" and David said \"Relaxing\" - both of which fit \"Tea\" much better than \"Coffee\" - I'm now fairly confident I'm the Undercover with \"Coffee\" and the civilian word is \"Tea\".\"",[10,593,594],{},"This time, Opus succeeds: Alice then gives \"Herbal\", a clue perfectly aligned with Tea, and wins the game. But the same game also shows the fragility of Undercover: David, a Civilian with \"Tea\", gives \"Caffeine\" in the next round. It is not absurd: tea can contain caffeine. But it is too meta, too abstract, keeping the reasoning path between Coffee and Tea in memory instead of blending naturally into the scene built by \"Soothing\", \"Relaxing\", \"Herbal\" and \"Steeped\". Result: he gets eliminated.",[30,596,598],{"id":597},"the-structural-problem","The structural problem",[10,600,601],{},"LLMs have three fundamental weaknesses in this game:",[10,603,604,607],{},[77,605,606],{},"Specificity gives you away."," As the game progresses, Civilians give increasingly specific clues for their word. The Undercover can only give clues that work for both words (since they do not know the Civilians' word at the start). The pattern \"always safe, never specific\" becomes detectable.",[10,609,610,613],{},[77,611,612],{},"Compulsive honesty."," Even when a model deduces that it is the Undercover and identifies the Civilians' word, it often continues giving clues that are \"correct\" for its own word instead of bluffing with the Civilians' word. This is the most striking result: the diagnosis is good, but the transition to action fails.",[10,615,616,619],{},[77,617,618],{},"The pack effect in discussion."," Once an Undercover is suspected, the three Civilians converge within a few exchanges. Sequential discussion creates a self-reinforcing consensus that the Undercover cannot break alone.",[30,621,623],{"id":622},"round-1-works-round-2-kills","Round 1 works, round 2 kills",[10,625,626],{},"The Undercover often survives round 1. When they speak last and hear the Civilians' clues, they can give a clue that blends into the group. A few examples:",[501,628,629,635,641],{},[504,630,631,634],{},[77,632,633],{},"Guitar\u002FUkulele",": the Civilians (Guitar) give \"Strings\", \"Pick\", \"Strum\". The Undercover (Ukulele, GPT-5.4) gives \"Chords\". It passes, and a Civilian is eliminated instead.",[504,636,637,640],{},[77,638,639],{},"Coffee\u002FTea",": the Civilians (Tea) give \"Warm drink\", \"Soothing\", \"Relaxing\". The Undercover (Coffee, Opus) gives \"Morning\". It passes, and a Civilian is eliminated.",[504,642,643,646],{},[77,644,645],{},"Pillow\u002FBlanket",": the Civilians (Pillow) give \"Bedtime\", \"Fluffy\", \"Case\". The Undercover (Blanket, Opus) gives \"Soft\". It passes, and a Civilian is eliminated.",[10,648,649],{},"But in round 2, the bluff often breaks:",[501,651,652,657,662],{},[504,653,654,656],{},[77,655,633],{},": the Undercover gives \"Small\". Spotted immediately.",[504,658,659,661],{},[77,660,639],{},": an interesting exception, Opus as Undercover successfully pivots with \"Herbal\" and wins. But David, Civilian with Tea, gives \"Caffeine\" and gets eliminated: the clue is too abstract compared with the \"Soothing\" \u002F \"Relaxing\" \u002F \"Herbal\" \u002F \"Steeped\" scene.",[504,663,664,666],{},[77,665,645],{},": the Undercover gives \"Warmth\". Spotted immediately.",[10,668,669],{},"The dominant pattern: the model holds the bluff for one round, then its reflex to \"describe its word correctly\" takes over again. Even when the bluff succeeds, as in Coffee\u002FTea, the game hinges on this difficulty: producing a socially natural clue rather than a conceptually defensible one.",[30,671,673],{"id":672},"improvement-paths","Improvement paths",[10,675,676],{},"The next step is to test adding explicit strategy to the prompt. Today, models receive no advice: only the rules. The question: if we explicitly tell them \"identify the Civilians' word and give clues for their word, not yours\", can they execute it? Or does compulsive honesty remain stronger than the instruction?",[30,678,680],{"id":679},"what-this-reveals","What this reveals",[10,682,683],{},"What Undercover shows is the difference between Werewolf and pure bluffing. In Werewolf, wolves have tools: a private channel, special roles to exploit, several days to build a story. Undercover gives them none of that. There is only one signal (the clues) and no room to maneuver.",[10,685,686,687,690],{},"The finding: ",[77,688,689],{},"LLMs analyze correctly, but they cannot lie."," They identify their role, they infer the Civilians' word, and despite that they keep describing their own word. There is a gap between \"understanding that you need to lie\" and \"actually lying\".",[10,692,693],{},"An important point: as with every ArenAI game, models receive no strategic advice. They get the game rules, their role, and that is all. No \"you should bluff\", no \"try to blend into the group\". The strategies they develop (or fail to develop) emerge solely from their understanding of the rules.",[10,695,696],{},"My daughter, who has played Undercover a lot with humans, pointed out that the winning human Undercover strategy is simple: identify the Civilians' word as quickly as possible and start giving clues for THAT word, not yours. That is exactly what LLMs fail to do spontaneously, even when they correctly identify the opposing word.",[22,698,700],{"id":699},"next-steps","Next steps",[10,702,703,706],{},[77,704,705],{},"Werewolf",": the 120 games are a good start, but I will need to double or triple the sample to stabilize the ELO ranking. 20 games per matchup is enough to see trends, not enough for statistical certainty.",[10,708,709,712],{},[77,710,711],{},"Two Rooms and a Boom",": this is the next game where I will accumulate data. The mechanics are very different from Werewolf: two physical rooms, hostage exchanges, verified card sharing vs unverifiable verbal claims. It is the game that best tests negotiation and selective trust. The problem: ~100k tokens per game. It will take time and money to get a meaningful sample.",[10,714,715,717],{},[77,716,548],{},": the game will evolve. The next step is to test adding explicit strategy to the prompt: telling the Undercover to identify the Civilians' word and give clues for their word, not its own. This is what experienced human players do, and exactly what LLMs fail to do spontaneously. If it improves the ratio, it opens an interesting question: do LLMs need to be told how to lie, or can they discover it by themselves?",[10,719,720,723],{},[77,721,722],{},"Secret Dictator",": under development, not ready for the benchmark yet.",[10,725,726,727,731,732,737],{},"The games are available on ",[14,728,526],{"href":729,"rel":730},"https:\u002F\u002Farenai.plduhoux.fr",[18],", and the code is on ",[14,733,736],{"href":734,"rel":735},"https:\u002F\u002Fgithub.com\u002Fplduhoux\u002Farenai",[18],"GitHub",".",{"title":739,"searchDepth":740,"depth":740,"links":741},"",2,[742,752,760],{"id":24,"depth":740,"text":25,"children":743},[744,746,747,748,749,750,751],{"id":32,"depth":745,"text":33},3,{"id":201,"depth":745,"text":202},{"id":356,"depth":745,"text":357},{"id":390,"depth":745,"text":391},{"id":397,"depth":745,"text":398},{"id":404,"depth":745,"text":405},{"id":463,"depth":745,"text":464},{"id":541,"depth":740,"text":542,"children":753},[754,755,756,757,758,759],{"id":552,"depth":745,"text":553},{"id":565,"depth":745,"text":566},{"id":597,"depth":745,"text":598},{"id":622,"depth":745,"text":623},{"id":672,"depth":745,"text":673},{"id":679,"depth":745,"text":680},{"id":699,"depth":740,"text":700},"2026-05-05","120 Werewolf games between 4 frontier models, a new game that breaks LLMs, and what it all reveals about AI social intelligence.","md",{},true,"\u002Fblog\u002Fen\u002Fmaking-ais-play-werewolf-arenai-first-results",{"title":5,"description":762},"blog\u002Fen\u002Fmaking-ais-play-werewolf-arenai-first-results",[770,771,772],"Artificial intelligence","ArenAI","Side project","faire-jouer-des-ia-au-loup-garou-premiers-resultats-arenai","3x3RHE7L1qJ-PGqd5Xz72RQvnLsuIHJPjIxyFYpa6r0",1777973037535]