After 500 games of testing I think it's official
Posted on Sept. 4, 2021, 12:17 p.m. by jaymc1130
So at this point I've managed to get in 500 games of Hullbreacher and Opposition Agent play in various decks and that's a large enough sample size to be pretty confident in the results of the data.
Hullbreacher already got banned but I'll start with some statistics for it. 503 game played with Hullbreacher included in at least one of the lists. 318 wins for Hullbreacher decks in that span, and with everything accounted for the card wound up with an expected win share percentage of + 6.6%. In other words, just adding that card alone to a deck would bump it's win share % up from 25% in a vacuum to over 31% in that same vacuum. Our database has some 5,000 total cEDH games tracked and logged at this point, and 10s of thousands of individual cards performance's tracked as well. This probably won't come as a shock, but Hullbreacher, of every single individual card we've ever tracked and tested posted the highest single card expected win share increase of any card in MTG's history. By a lot. More than double the increase in win share percentage from the next closest card with a minimum 500 game sample size. In our data set it's confirmed at this point to be the most dominant performing card in cEDH history. Not to surprising it wound up banned.
Astonishingly (or perhaps not) the number 2 card on the list in terms of expected win share contribution in a minimum sample size of 500 games in the entire 5000+ game data set we've collected is Opposition Agent. 500 games on the nose, 206 wins, and a final expected win share contribution of +3.1%. Until these 2 cards were printed we'd never had a single card post even a +3.0% expected win share contribution and we'd seen some dominant numbers from cards like Extract, Timetwister, Wheel of Fortune, Force of Will, Ashiok, Dream Render, Paradox Engine, Carpet of Flowers, Thassa's Oracle and Deathrite Shaman over +2.5%.
According to the data I've collected so far, Oppo Agent and Hullbreacher are the two most dominant cards ever printed for cEDH and by very significant margins. It wasn't even a contest, these two just performed heads and shoulders above every other card in the format.
Almost certainly this data set is not perfect or ideal, it's a mere 500ish game sample size compared to data we might have for a card in Legacy or Vintage that might have 10,000 or even 100,000 games of tracked data over several decades that can be easily found on the internet, so take these findings with a grain of salt would be my recommendation. I know since I've started posting more and more statistics based approaches to evaluating cards and decks in cEDH that more folks have been doing the same (hard to argue with the benefit of the results and those sweet extra wins here and there based off an analytics approach) and I'd be curious to see what kind of data others have collected about these 2 cards since their release. I think it's pretty unlikely our group wound up with an aberrant data set, but it's the kind of result that is so over the top and even a bit unexpected (seriously? THESE are the two best all time performing cards for cEDH?) that even I'm a bit skeptical looking at the results of the data collection.
We've had a fair amount of time to experiment with these 2 bad boys by now, so I'm curious about the results of any one else who tracks data like this and what their data set might say about the performance of these 2 cards in cEDH. If you've got some interesting results to share let us know how they've been performing in your games.
I don't play cEDH but I love stats and this is fascinating! I like your note about the sample size being a potential limiting factor; have you ran your data's statistical power?
September 4, 2021 12:36 p.m.
Dunno how one would rate the power of a data collection process, but our group's process has seemed pretty effective to date. Impossible I think for such a process to be flawless or perfect, but that's why you wanna pull data from as many sources as you can and start cross referencing.
Glad you appreciate the analytics approach, if it can be a nifty tool for sports teams or businesses then why shouldn't it also be a nifty tool when applied to MTG formats?
September 4, 2021 12:49 p.m.
With these results do you think op agent should be banned?
September 4, 2021 1:01 p.m.
I thought the Agent would get banned the second I saw it spoiled.
That opinion hasn't changed, I don't think. But I am concerned, given these results that are so far beyond any other single cards in our collection of data, that they have been "over evaluated" if such a thing is possible. At this point I'm mostly looking for flaws in our data collection processes that could help contribute to such inflated totals.
We've had plenty of cards post +5% or +6% expected win share contributions after 100 game sample sizes before, but these figures dropped after it became clear they represented underplayed but powerful cards and concepts. When we'd start to include more testing in pods that contained 2-4 deck lists running those cards the expected contribution to win share percentages would drop to more normalized figures. This happened with Agent and Breacher as well, just not to the degree I would have expected.
September 4, 2021 1:16 p.m.
September 4, 2021 3:06 p.m.
Actually we do have data on Snow-Covered Island, lol.
3976 games played in a pod containing at least one deck list running at least 1 Snow-Covered Island. 11,966 total deck lists in those games. 2998 wins for lists including at least 1 Snow-Covered Island in those games. All factors accounted for, a 25.05% win rate, round up to be generous and call it 25.1% for an expected contribution towards win of +0.1%. Sample size here doesn't account for land count totals, but it's still useful information in it's own way.
No updates for Urza's Saga at the moment. It's liable to be a couple of months before I can acquire enough data for a large enough sample size.
September 4, 2021 4:37 p.m.
I hate Dockside Extortionist what's the stats on that card.
September 4, 2021 11:17 p.m.
Dockside Extortionist is a very strong performer with around +2% in a vacuum if I remember correctly. Haven’t checked it’s updated numbers lately but it’s been hovering in the +1.8% to +2.2% range ever since we hit the 500 games mark with it. Anything that posts about +1.5% or better in a sample size of 500 games is something we consider pretty significant in terms of strong consistent performance, so it’s certainly among the more powerful options out there.
September 4, 2021 11:24 p.m.
I've still found it's not the most dominant of cards, of which from all of the testing I've found though more limited than 500 games it appears Trade Secrets is by far the most dominant card. When legal it pushed over 11% increase in winrate for decks capable in using it, absolutely astonishing. The issue with the data is the validity of the games and how competitive they were, due to issues of the cEDH format being much smaller at the time and much fewer games of cEDH recorded. Those I could find I averaged out to 11% increase however.
That's a sidenote. I'm surprised by the sheer difference betweent the two, as I personally have found while a wheel and breacher is the most explosive either card can accomplish, oppo agent a Doomsday or chaining oppo agents has lead to more consistent value.
September 5, 2021 12:13 a.m.
Trade Secrets eh? Guess that one shouldn’t be too surprising.
As with any data set the way in which it’s manipulated can have a significant impact on the conclusions that are able to be drawn from it. Meta game, quality of competitors, even turn order are all factors that change what kind of conclusions can be drawn from the same set of data when accounted for as parameters. It’s the same for MTG as it is for fantasy baseball, sometimes that BaBIP metric is relevant in making a decision, sometimes it’s not.
The huge difference between them is definitely the thing that jumps out at me too. Without digging too deep I’m working from an assumption that it’s at least partly due to meta game bias (our group has previously recorded data about wheels for example and plays more wheels per pod on average because of that data, which should theoretically increase the effectiveness of Breacher in such a sample size). How much of that EWSC (expected win share contribution) can be attributed to this? Can’t say for certain without even more data, but it’s why I recommend some degree of skepticism. Hullbreacher is definitely a truly unique outlier as a piece of data, never seen anything like it personally.