Joe Regenstein, CPA, FPAC

Harnessing Advanced Data Analysis Beta: A Deep Dive into the 2023 Baseball World Series Through ChatGPT

Harnessing Advanced Data Analysis Beta: A Deep Dive into the 2023 Baseball World Series Through ChatGPT

Photo by Daniel Lee on Unsplash

The emergence of sophisticated tools like the Advanced Data Analysis in ChatGPT has revolutionized our ability to dissect complex datasets and unearth actionable insights. As the 2023 Baseball World Series showcased a historic victory for the Texas Rangers, marking their first Major League Baseball championship, it also provides a great dataset. Behind each pitch, hit, and home run lay a trove of data waiting to be decoded. Fortunately, baseball lends itself to data analysis as there is one pitcher and batter with a definable result, which is highlighted in Michael Lewis's Moneyball.

This experiment aimed to harness the new feature and see how it managed a multi-dimensional dataset from this series. By training ChatGPT on batting, pitching, and game logs, we aim to explore the depths to which AI can navigate baseball analytics. In the appendix, I provide the Q&A (ChatGPT took some liberties), a link to the ChatGPT session, the ChatGPT transformed Excel file, and a link to Baseball-Reference.com where the play-by-play and game stats were downloaded. I also posted an article ChatGPT authored titled "Rangers Rise to Glory: Underdog Texas Snags World Series in Five-Game Thriller."

This championship, beyond its sporting excitement, narrates a tale of a remarkable turnaround. The Rangers rebounded from 102 losses in 2021 and 94 losses in 2022 to clinch the title in 5 games. We saw manager Bruce Bochy come out of retirement and join the likes of Joe Torre and Connie Mack to win at least four titles.

I have not watched this much baseball since the Cubs won in 2016 when Aroldis Chapman got his first ring. Now the Ranger's reliever has two. The game needed to be faster for me, especially at the end of a close game. The addition of a pitch clock and limiting timeouts has made a difference. At least enough for me to watch my adopted hometown Texas Rangers comeback against the Houston Astros and carry that momentum to the World Series. However, I'm no baseball stat head and needed help analyzing the series.

The Advanced Data Analysis (ADA) feature in ChatGPT was formerly known as Code Interpreter (CI). CI was an agent developed so ChatGPT could do math and work with the code it was provided. If you were interacting using the API you could harness various agents with your interactions but this wasn't a user option using Chat. The major enhancement with ADA is users can upload files which is necessary when ChatGPT isn't trained on the data. The ability to gather data from documents, create new content the AI can analyze, and process unstructured data like the play-by-play descriptions of a baseball game is a game changer. Lastly, it shows the behind-the-scenes code being created for the analysis which helps the user troubleshoot and verify the steps taken for analysis. The code generated is performed in a sandboxed environment for security reasons. We don't want ChatGPT executing code on our computers.

There is also another beta feature to search the web using Bing and analyze what it finds. I could have simply asked ChatGPT to tell me about the 2023 World Series but what fun would that be.

Setting the Stage

ChatGPT is now trained on data through April of 2023 and has no idea how the World Series turned out. This gives us the perfect opportunity to take recent data and take the ADA feature for a test drive.

I looked for a few places to get data and found www.Baseball-Reference.com to be the best option. It allowed for free data downloads in multiple formats. I ultimately used CSV which I saved in a text file since Excel kept transforming scores such as 2-2 into dates. Baseball-Reference data could also be exported as an HTML table complete with styling, a format specifically for Reddit, or using markup language for Wiki pages. This provided batting, pitching, and play-by-play data which was ideal for training AI on a new event. However, the data format needed a lot of work. ChatGPT's ADA came in handy for cleaning up the data. I provided the following:

Game Logs

Game Logs

Each game has a log that captures the play-by-play with the inning, pitch count, batter, pitcher, and some notes. This is great data but there are rows for the top and bottom of each inning that summarize the runs, hits, runners left on base, and score. It also notes pitcher personnel changes. I asked ChatGPT to strip out these summaries and save them onto a new tab an Excel file with a few new columns. Since all games would end up on one tab we needed the game number, a sequence number to maintain order, the inning number, and whether it was the top or bottom of the inning. Early iterations with ChatGPT showed the t1 for the top of the 1st inning was confusing the assistant. Below is an example of the comma-delimited data, you can see how this could be confusing. The first row is column headers but the second row is a summary in the 'Inn' column which would make it difficult to split the column between innings if it was the top or bottom.

Top of First Game 1
Top of the 1st Inning in Game 1

I also had ChatGPT split out the 'Pit(cnt)' column into 'Pitch Count', Balls, and Strikes. In the fist at-bat above it was 3 pitches, no balls, and 2 strikes. The letters that follow (3(0-2) FSX) are the pitch sequence and the result. In this case, it was a foul, strike, and the batter putting the ball into play. The result was a Game Log and Game Summary tab.

Game Log Output
Game Log Output

Batting

Series Batting Stats
Series Pitching Stats provided by Baseball-Reference.com

A file was provided for each Team that showed the number of games the batter appeared in, at-bats, runs, and whether a hit was a double, triple, or home run. For cleaning this data a column was created to add the Team Name so this could be compiled on one tab. I also asked that the regular season data be moved under the playoff data and a new column created to indicate what type of stat it was. For some players or data points, there were asterics or various symbols. I had ChatGPT remove these since I was going to ask some questions that would require ChatGPT to do math and didn't want to end up with errors. ChatGPT wrote a regular expression function to clean these up.

There was one flaw here where the row total for the Diamondbacks got moved to an Arizona Diamondback player who didn't appear in the series as their playoff performance. While working with ChatGPT on the 2023 World Series analysis later it became apparent there was an issue but ChatGPT was able to ignore that data. If Arizona had 54 home runs in 5 games they would most certainly be the victors.

Chat GPT Error
ChatGPT finding an error and excluding data

The Team and stat type came in handy during the analysis when trying to compare the series performance of individual plays to the regular season. It also eliminated issues with having duplicate column headers.

Pitching

Rangers Pitching

I asked the pitching files be compiled similarly to batting so we had one pitching tab.

Glossaries

Glossaries
Play-By-Play Glossary provided by Baseball-Reference.com

Each data source had a glossary that provided an explanation of the column headers. ChatGPT did reference these throughout the analysis.

Training ChatGPT

With an Excel file with relevant data, I loaded it into a new chat and provided a prompt:

You are a sports writer but missed the 2023 World Series which just concluded, you don't even know who won, but have been provided a data set to help you analyze the series. You weren't able to see the games but have been provided data from the series. For each game, logs and a summary have been provided from www.sports-reference.com. In addition to the game logs and summaries you are provided the batting and pitching statistics for each team for the playoff series and also the regular season for comparison purposes. There are also glossaries for the bating, pitching and pitch sequence data.

For the analysis only use the data provided but you can use available knowledge to craft how to think about baseball and provide interesting insights when answering questions which are provided on the last tab for your reference. Please create a document with the questions, answers and any tables or items that would be interesting to your readers. After answering the questions write a 1000 word sports article about the 2023 world series for anyone else who missed the series.

I wanted ChatGPT to only use the baseball data provided for the analysis but also benefit from any training that it had on analyzing baseball games. In response to the prompt, a game plan on how to tackle the analysis was conducted and the first of two questions answered:

Training 1 Training 2 Training 3

From here on out we would move one question at a time, I would verify data if possible and help out from time to time.

Asking the Hard Questions

As shown in the first question there was no data table provided that had the game, the team that won, and a score. Throughout the analysis ChatGPT started with one plan but changed gears when it couldn't find what it was looking for on a tab or the answer didn't make sense. I provided the example of the player with 54 World Series home runs which was an error in the data. The tool showed adaptability and sanity-checking results.

There were occasions when the result was found the hard way. When asked what the most common way batters got on base ChatGPT went to the game logs and parsed the events for each at bat along with the pitch sequence. It then went to the play description. However, this data was on the batting tab and only required. little math to back out all the doubles, triples and home runs from the hit count to come up with singles.

Chat GPT Python ChatGPT Generated Python Code

I asked if it would have been easier to get the same answer using a different method.

Chat GPT Python 2 ChatGPT Generated Python Code

There were some issues still and reminded the assistant we needed playoff-only numbers but it came up with a slightly different answer than trying to parse the game logs.

When it came to parsing innings ChatGPT got hung up on game 1 which had 11 innings and tried to calculate the number of pitches per inning. The assistant realized this wasn't reasonable and started to look for a different way to calculate it. When it got hung up I provided some guidance.

Chat GPT Error 2

This ultimately worked and I was able to verify the output to the data with game 1 going 11 innings and the rest going 9 for 47 in total.

One question revolved around the team's batting average which was slightly better for the Diamondbacks. Here I asked how could it be the Rangers won the series with a lower average.

Team Batting Avg Question

This portion of the analysis required ChatGPT to aggregate the hits and at-bats to calculate the Team averages since none was provided.

There was a question that we were not able to answer with the data and that was how many runs were from home runs. A valiant effort to parse the game logs and add additional runs to the number of home runs ran into a dead end. I concluded this wasn't possible with the data and moved on.

With few exceptions, ChatGPT powered through the questions, developed Python code to help answer the question and kept trying when the code failed or the answer didn't make sense.

Insights and Discoveries

The measures and KPIs developed by Rick James and baseball aficionados helped baseball teams get deep into how to improve. I'm sure they are using AI to speed analysis and go deeper than previously possible. If you asked me how to think about a team having a lower batting average but winning the World Series in 5 games I wouldn't know where to begin. But with AI it came up with a few hypotheses and tried to work through each one in a matter of seconds.

This was a data set for a game but I can see uses in business. The ability to think through a problem, pivot, and try again would take me a long time. We spend a lot of time coming up with hypotheses, gathering data, and moving on if the research shows something else is driving results. A shorter way of saying it is we do a lot of myth-busting.

While studying supply chain I was provided a data set for a fictional company that had several product lines for multiple business segments. The goal was to determine why some weren't making money and what strategy could be employed to consolidate products and prune unprofitable items. This is where the computing power of ChatGPT would have been handy. Being able to analyze the data, identify ways to find profitable items, and come up with conclusions would take a fraction of the time it took me. The interactivity with ChatGPT makes it a great assistant. It could be used to analyze market trends or customer behaviors by analyzing data captured from the web.

Meanwhile, in IT ADA can assist in reading the trove of data captured daily in logs and detect anomalies. There are currently efforts to use AI to enhance security and develop code.

As generative AI improves new use cases will continue to emerge. Our job is to learn to harness these capabilities in an effort to continuously learn.

Lessons Learned

There was more data I could have provided such as the top 5 plays per game. There is also more data available for each Team and Player but that would have grown the data to process tremendously.

At the end of the analysis, I asked ChatGPT how I could help provide a better analysis. The result was a laundry list of things I could do to complete the analysis myself. It was a great list but what do I need ChatGPT for if I were to do it myself.

Lessons Learned


The list of generated suggestions :

  1. Clear Objective: Define the goals and key questions of the analysis upfront. This provides a clear roadmap and helps maintain focus on the relevant data points.
  2. Structured Data Description: Provide a brief description of the dataset structure, including any peculiarities or known issues. Knowing what to expect in the data can save time during preliminary checks.
  3. Glossary and Legend: Include a glossary or legend for abbreviations, codes, and specialized terms used in the dataset. This is especially helpful for sports data, which often has domain-specific jargon.
  4. Data Format and Organization: Explain the format and organization of the data, such as how the files are divided (e.g., separate tabs for batting, pitching, summaries), which can expedite data navigation.
  5. Specific Instructions for Data Segmentation: If only certain parts of the data are relevant (e.g., playoff stats, not regular season), pointing this out at the beginning can prevent unnecessary analysis of irrelevant data.
  6. Known Issues and Anomalies: If you're aware of any data issues or anomalies, mentioning them early on can help in avoiding or addressing them during the analysis.
  7. Priority of Questions: If some questions are more important than others, prioritizing them can ensure that the most critical information is tackled first.
  8. Examples and Context: Providing examples or context for how certain data points are used within the sport or what they might indicate can give additional insight and direction to the analysis.
  9. Incremental Validation: As questions are answered, validating the responses before moving on to the next can help catch any misunderstandings or errors early in the process.
  10. Feedback on Outputs: When outputs are provided, giving feedback on whether they meet expectations or need refinement can guide the analysis and improve accuracy.
  11. Clarification on Ambiguities: If there's ambiguity in the questions or the data interpretation, clarifying these as soon as they're identified can prevent misdirection.
  12. Availability for Follow-up: Being available for follow-up questions or to provide additional information as needed helps to fill in any gaps in the analysis.

By providing clear, structured information and being responsive to questions and feedback, you can greatly assist the AI in delivering a more accurate and efficient analysis.

As you can see from my initial prompt I either didn't provide guidance at all or didn't go very deep for some of these items. I provided glossaries but didn't get into the data structure in a meaningful way. I tried to help from time to time but didn't provide feedback unless something didn't make sense. To my credit, the analysis was able to be completed or maybe I owe a lot of that to ChatGPT? Since I'm writing this post I'll take the credit.

Conclusion

Congratulations to the Texas Rangers for bringing home their first World Series victory and to the Arizona Diamondbacks as the NL champions.

I started using ChatGPT and other generative AI platforms this year and have seen the technology improve in a short time. My hope is we can find a way to harness this technology to better ourselves and use resources more wisely. If you have used the ADA feature please share your experience wherever a link to this post was found as I do not have comments on it at this time. You can find the Q&A, chat history, cleaned data, original data source, and ChatGPT's article about the 2023 World Series in the appendix.

Appendix


Questions Provided

I provided a list of 15 questions, ChatGPT kept some the same and took license to modify others. In some cases, it outright made up a question it could answer:

1. Which team won the 2023 World Series and in how many games?
The Texas Rangers triumphed in the 2023 World Series, showcasing their strength and tenacity by sealing the deal in 5 games against the Arizona Diamondbacks. Their victory is a narrative of determination, strategic execution, and seizing the moment under the bright lights of baseball's grandest stage.

2. What was the average number of runs scored per game in the series?
The series averaged an exciting 4.6 runs per game, indicative of the aggressive offensive strategies and the adeptness of batters from both teams who consistently delivered under pressure, contributing to a high-scoring and engaging series for fans.

3. How many singles did each team hit during the series?
The Texas Rangers hit a total of 22 singles, while the Arizona Diamondbacks outpaced them with 31. These singles were instrumental in keeping innings alive and setting up scoring opportunities, highlighting the importance of consistent contact at the plate.

4. Which player hit the most home runs during the series?
Corey Seager of the Rangers made a statement with his bat, leading the charge in home runs with 3 spectacular shots. His power-hitting prowess proved to be a difference-maker, propelling the Rangers' offense throughout the series.

5. What was the average number of pitches per inning?
The pitchers, displaying a blend of skill and strategy, averaged 16.5 pitches per inning. This reflects the depth of the pitching staff and the resilience of batters to work the count, ensuring a battle of wits and wills in every inning.

6. What was the average number of runs per inning?
The series witnessed an average of 0.51 runs per inning, emphasizing that scoring was both a premium and a hard-earned result of strategic hitting, opportunistic base running, and moments of pitching vulnerability.

7. Which team had a better batting average during the series?
Despite the Rangers' series win, the Diamondbacks boasted a better batting average at .270, showcasing their ability to make contact and get on base more effectively than the Rangers, who had an average of .218.

8. How many games did the losing team lead at the end of an inning and still lose?
The Diamondbacks faced the disheartening scenario of leading at the end of an inning in 2 games, only to ultimately lose those contests. These instances underscore the unpredictable and often unforgiving nature of postseason baseball.

9. Which pitcher threw the most pitches per game on average?
Zach Gallen from the Diamondbacks threw the most pitches per game, averaging 104, a testament to his role as a workhorse for his team and the trust his management placed in his arm and strategy on the mound.

10. What was the longest game of the series in terms of time?
Game 1 was the longest, stretching to 3 hours and 48 minutes, a time that reflects the evenly matched competition, the high stakes of each play, and the strategic decisions that prolonged the duel between the two teams.

11. How many lead changes were there in the series?
The series saw a total of 5 lead changes, providing a roller coaster of emotions for fans as the lead swung back and forth, encapsulating the essence of a tightly contested World Series.

12. Which players significantly contributed to the World Series who also had a significantly different regular season performance?
Corey Seager and Tommy Pham are prime examples of players whose World Series contributions far exceeded their regular season performances. Their ability to elevate their game when it counted most was a crucial element of their teams' postseason success.

13. Summarize pivotal plays for each game.
Each game was defined by moments of brilliance: clutch hits, defensive gems, and pitching duels that shifted the momentum and were etched in the memories of fans as the turning points that ultimately led to the Rangers' victory.

14. Did any players have a standout performance that was unexpected based on their regular season performance?
Corey Seager, with his surge in home runs, and several players who delivered clutch hits, exceeded expectations based on their regular season stats, showcasing the unpredictable and magical nature of the postseason where heroes emerge in unexpected forms.

15. Were there any controversial calls or plays that had a significant impact on the series?
The series did not lack for drama, including close calls and plays that spurred debate among fans and analysts. While the impact of these moments is always subject to interpretation, they added to the narrative of a World Series filled with tension and excitement.


Chat History

The interaction with ChatGPT can be found here: 2023 World Series Advanced Data Analytics


Baseball-Reference.com Data

World Series: TEX vs. ARI (4-1)


Rangers Rise to Glory: Underdog Texas Snags World Series in Five-Game Thriller

By GPT-4

In the realm of Major League Baseball, few events stir the pot of excitement and anticipation like the World Series. The 2023 World Series was a classic tale of tenacity, skill, and the age-old adage that sometimes, it's not just the better team on paper that takes the crown, but the team that shines in those clutch moments. The Texas Rangers, an underdog by regular season standards, proved that when October comes knocking, what matters is not the past but the present grit and grind. In an exhilarating five-game series, they toppled the Arizona Diamondbacks to clinch the coveted Commissioner's Trophy.

The Series swung off with a bang, and each game unfolded with its own narrative arc, contributing to a riveting overall saga. Game 1 set the tone for the series with the Rangers mounting an early lead, only for the Diamondbacks to claw back, showcasing the resilience that had carried them through the regular season. However, it was the Rangers who held on to draw first blood in the series.

Each subsequent game was a roller coaster of emotions, with lead changes, home runs, and strategic pitching changes keeping fans on the edge of their seats. The Diamondbacks, with a superior batting average of .270 in the series compared to the Rangers' .218, often seemed like they were on the cusp of taking control. Yet, the Rangers' strategy of timely hitting and aggressive base running paid dividends, especially with the long ball. The Rangers slugged eight home runs, against just three from the Diamondbacks, proving the old adage that while singles and doubles set the table, it's the home runs that often clear the bases and change the scoreboard most dramatically.

The Series' MVP for the Rangers was Corey Seager, whose bat seemed to catch fire when his team needed it the most. Seager's performance was a testament to the unpredictable nature of October baseball, as he eclipsed his regular season stats, hitting three home runs and contributing significantly to the Rangers' offensive output. On the flip side, despite the Diamondbacks' loss, Tommy Pham emerged as a beacon of consistency, his .421 batting average and defensive plays making him a standout.

One of the most striking narratives of this World Series was the pitching. While the bats may have stolen the headlines, the arms were just as vital. Zach Gallen of the Diamondbacks threw an average of 104 pitches per game, a testament to his durability and the trust placed in him by his team. Yet, it was the bullpen strategy that often made the difference, with Game 1 seeing the most pitcher changes at 13, as managers played the matchup game to perfection.

The longest game of the series, Game 1, lasted 3 hours and 48 minutes and was emblematic of the tight competition, setting a precedent for the rest of the face-off. The lead changed hands multiple times throughout the series, with Game 5 displaying no less than five lead changes, a clear indicator of the neck-and-neck battle that unfolded.

But baseball is not just a game of stats and strategy; it's also about the moments that define the human spirit. The 2023 World Series was replete with such instances—of underdogs rising, of seasoned veterans leading by example, and rookies seizing their chance in the spotlight. There were pivotal plays, from diving catches to base-running gambles, each etching a memory in the annals of World Series history.

When the dust settled, it was the Rangers who emerged victorious, a team that had embraced its role as the underdog and used it to fuel a fiery path through the postseason. Their win is a lesson in the power of peaking at the right time and the beauty of baseball's unpredictability.

Off the field, the World Series was just as much a cultural event, with fans from both sides bringing a festival-like atmosphere to the stands. The sounds of cheers and the sights of homemade banners and signs made it more than a game; it was a celebration of community and a shared love for the game.

In a year that was anything but ordinary, the 2023 World Series delivered an extraordinary tale of triumph, resilience, and the undying spirit of baseball. It reminded us that in sports, as in life, the underdog has its day, records are meant to be broken, and history is written by those who seize their moment under the brightest of lights.

As we reflect on this series, we're reminded of why we love baseball. It's the game of inches and seconds, where strategy meets chance, where heroes are made, and legends are born. The 2023 World Series was not just another chapter in the story of baseball; it was a testament to the sport's enduring ability to captivate, inspire, and surprise.

Congratulations to the Texas Rangers, the 2023 World Series champions, and hats off to the Arizona Diamondbacks, who competed with heart and soul. And to the fans, the real MVPs, who make the game what it is—thank you for another memorable October.