addressalign-toparrow-leftarrow-rightbackbellblockcalendarcameraccwcheckchevron-downchevron-leftchevron-rightchevron-small-downchevron-small-leftchevron-small-rightchevron-small-upchevron-upcircle-with-checkcircle-with-crosscircle-with-pluscrossdots-three-verticaleditemptyheartexporteye-with-lineeyefacebookfolderfullheartglobegmailgooglegroupshelp-with-circleimageimagesinstagramlinklocation-pinm-swarmSearchmailmessagesminusmoremuplabelShape 3 + Rectangle 1ShapeoutlookpersonJoin Group on CardStartprice-ribbonShapeShapeShapeShapeImported LayersImported LayersImported Layersshieldstartickettrashtriangle-downtriangle-uptwitteruserwarningyahoo

Sports Analytics Fall Challenge Monthly Meetup

Agenda:

6:30-7:00 PM - Networking & Pizza

7:00-7:15 PM - Opening Remarks (Jake Mason) & Member Introductions

7:15-7:45 PM - Data Q&A (Randy Istre, Inside Edge)

7:45-8:15 PM - Major Drivers of MLB Attendance: The Pittsburgh Pirates 2010 Attendance Study (Prof. Ashley Stadler Blank, University of St. Thomas)

8:15-8:45 PM - Initial Data Exploration (Kevin Church)

8:45-9:00 PM - Team Matchmaking


About the Challenge:

Analyze This! & MinneAnalytics are excited to announce a new Sports Analytics challenge in partnership with Inside Edge, a Twin Cities-based sports scouting and data analytics organization with one of the most comprehensive databases of MLB stats in existence!

This competition will kick-off at 6:30 PM on October 12th at the UMN Hanson Hall and finish the afternoon of January 10th at the MinneAnalytics Sports Analytics Conference (held at the University of St. Thomas’ downtown Minneapolis campus). The format will be similar to our recently completed Science Museum of Minnesota Summer Data Challenge including prize money provided by MinneAnalytics.

Please stay tuned for further details.

Let the match-making begin!


Pedro, Mitch, Jake, Daniel, & Kevin


Join or login to comment.

  • Jack H.

    Do the point totals take into account stolen bases? Or just points derived from batting statistics?

    December 19

    • Jack H.

      so are there any data points available regarding base running? It just seems that if a SB is rewarded with the third highest point total (offensive players only), and there is very little in the data set that reflects any type of stolen base or base running measurement... well it could certainly have an impact

      December 21

    • Randy I.

      Jack - Good question. I'm afraid this file doesn't contain anything very helpful for projecting SBs specifically. We project those in a separate file, but unfortunately not this one, and it's too late to add new fields or files at this point.

      December 21

  • Randy I.

    Challenge participants - I uploaded the file Hitter_PitcherCategories_Map.xlsx to the data file folder. As requested at the last meetup, this file contains categories and descriptions for HGLY/PGLY. Good luck!

    December 16

    • Jon J.

      Hi Randy,

      At the last meetup you mentioned a co-worker who did a lot of research into daily fantasy sports. Can you post his resources (books, links, etc)?

      December 16

    • Randy I.

      Hi Jon - our colleague is Kenny Kendrena. He'll be at SportCon. I know that two of his resources are the books: "Trading Bases" by Joe Peta and "Fantasy Baseball for Smart People" by Jonathan Bales. There are also plenty of online resources available via search, of course. Kenny has also talked to a few "sharks," but they tend to be reluctant to give away their deepest secrets...

      December 16

  • Randy I.

    Hi Sports Analytics MLB Challenge participants!

    I sent this as email, but am also posting here:

    In advance of this Wednesday evening's meetup, I wanted to let all of you know a few things:

    First, I hope that you have seen on this meetup forum site that I uploaded auxiliary data files which update or add a few fields (OU, elevation, roof, & Hitter-Pitcher B-T). These were created in response to questions or requests by participants. Each field has a "2" appended to the field name & can replace the original field.

    Second, if you have questions or comments, please post them here so that everyone can see them. If you post questions in advance of Wed.'s meeting, that will give me a chance to look into them beforehand if I need to.

    Good luck, and I'll see you soon,

    December 12

  • RyanCaldwell

    Does anyone know if we are considered participants for the conference (Free) or if we have to pay to attend?

    December 11

    • Jake M.

      Unless you're a sales rep, SportCon is free to attend. Just secure the "participant" ticket type.

      December 11

  • Randy I.

    All - in response to Chengchao Lu's post below, I just uploaded a new auxiliary data file "OU2wB-T.csv" (& .xlsx) which adds the corrected field "Hitter-Pitcher B-T 2" to the previous auxiliary data file. Good luck!

    November 30

  • Randy I.

    I've just uploaded an auxiliary data file "AddedFieldsOU_Elev_Roof" in CSV and XLSX formats, indexed by RecordNum, that add more of the missing data for Over/Under, Elevation, and Roof, as per requests here. Remaining missing data for these fields is not available. Good luck!

    November 19

  • Chengchao L.

    Hi, I have a question about two fields--"Pitcher Side" and "Hitter-Pitcher B-T". I found some Pitchers whose "pitcher side" (left or right)do not match the one in "Hitter-Pitcher B-T". Plus, what do "BHP" and "BHB" mean?

    November 19

  • Luke H.

    Hi, I attended the Nov. 9 meeting and I am interested in signing the NDA and taking a look at the data. Please point me in the right direction or email me directly. Thanks!
    luke.hendricks (at) outlook.com

    November 16

    • Jake M.

      Hi, Luke. Here's the link to the NDA: https://drive.google.c...­. Sign that and send it off to Randy Istre of Inside Edge ([masked]). After Randy approves the NDA, you'll get access to the data, which is held in a Google Drive folder. Have fun!

      November 16

    • Luke H.

      Perfect, thanks.

      November 16

  • Jack H.

    How can a hitter's status indicate he is on the 15 day DL or 7 day DL, yet still be able to produce FanDuel points?

    Also, was there ever a conclusion on what designated the specific HIGLY/PIGLY categorizations? Example: a left handed batter with a high batting average also will likely have a high contact rate (high correlation between those two skill sets), what distinguishes that player being classified as a LHiAvg rather than a LHiContact?

    November 15

    • Randy I.

      Jack - On the DL status with Pts, I'll get back to you. On HGLY/PGLY, the basic algorithm we have in place now is run every 2 weeks. For every hitter (& pitcher), we first separate into R or L, then look top/down at the most important differentiators on a rolling 2-year timeframe. If they exceed our cutoff for that category (like 75th percentile), they go into that category. Power (Pop), speed, BAVG, Contact, etc., or no special category (LX or RX). We also put them into one of 3 Tiers (Top, Middle, Bottom). We're interested in re-working our algorithm this offseason, in case anyone would like to help. But to answer your question, a hitter goes into whichever category he "qualifies" for first (BAVG before Contact).

      November 16

    • Randy I.

      Jack - on the "DL w/ Pts" question - this appears to be a record where a guy was still on the DL (officially) when we generated the record, but came off later that day, played & scored points. It does tip you off, however, that he hasn't played for a while.

      1 · November 16

  • kingsley

    Will there be free parking there

    November 13

    • Kevin

      Could one of the students from the Carlson school please address the issue of free parking?

      November 14

  • Randy I.

    All - I also uploaded a PDF copy of my slides from the 11-9 presentation on DFS Lineup Optimization. Please read the description there first. Get it via More/Files above. Thanks! Randy Istre, Inside Edge

    November 14

  • Kevin

    Hi All, I've just uploaded a pdf copy of the slides from the Kickoff meeting on October 12th. Go to More>Files to find it. Also, there is an explanation of the MLB challenge in that same location.
    Kevin

    November 12

    • T

      Thanks!

      November 14

  • T

    Hello Organizers, How was the meeting. I couldn't participate yesterday because of personal case. Can I get the the presentations or the discussion in any format.
    How could I access the data to be analysed for the future meetup and its detail instructions.
    Thanks

    1 · November 10

  • Issaq

    Hello Organizers, Can you please share the presentations discussed during the first meet up?

    Thanks

    1 · November 10

  • Issaq

    Is there any WebEx for the upcoming event? I have class so couldn't make it so just want to check.

    November 5

    • Issaq

      That works!!. Thanks.

      November 8

    • Issaq

      Hi Pedro,

      November 10

  • Mitchell N.

    FYI, the room in Hanson Hall has changed from 1-106 to 1-111, in the southeastern corner of the building.

    November 9

  • A former member
    A former member

    1 · November 8

  • Pedro M.

    Fantastic article on the effect of baseball analytics in helping the Cubs win. Enjoy! "The Curious Have Won"

    "Theo Epstein overcame 108 years of history to build a championship team in Chicago. In the process, he ended baseball’s long-running analytics war by proving that an objective, data-driven approach can change the game" https://theringer.com/2016-world-series-chicago-cubs-theo-epstein-analytics-war-9f1248c44eb7#.w4itbwcy3

    November 5

  • Randy I.

    As per the email to all participants:

    We have now uploaded an entirely new data file, which we believe corrects the issues. This time, though, the data is from the 2015 MLB season instead of 2016, so that we could do a virtual "Reset" on everything. Notes:
    - The dile format and dictionary remain the same
    - both .xlsx and .csv files are uploaded.
    - FanDuel's scoring system changed between the 2015 and 2016 seasons, but we updated the Actual_PTS values to reflect 2016 points.
    - The test records (no DV values) are order-randomized.
    - The "PA Last15Days" field has NA values in it for the first 6 1/2 weeks, because in 2015, we didn't start collecting that in the file until then. You'll have to deal with that.

    Please delete the old data file (dated[masked]), and download the new file instead (dated[masked]), and use it going forward. My apologies again for having to start over. I certainly hope this is the last time! Enjoy Game 7 of the World Series tonight. Play Ball!

    November 2

  • Kevin

    We have removed the MLB Challenge data from the shared drive while Inside Edge works to prepare a completely new dataset, this time based on the MLB 2015 regular season. We expect to have the new data available in a few days. The format of the data, and hence the dictionary, will remain unchanged.

    A more detailed explanation of the data problems appears in a pdf document on the shared drive.

    October 28

  • Jon J.

    What is the difference between PA (plate appearances) and AB (at bats)? The PA columns are always greater than or equal to the AB columns. Why would there be a difference?

    October 27

    • Sean P.

      A batter will not receive credit for an at bat if their plate appearance ends under the following circumstances:

      He receives a base on balls (BB).[1]
      He is hit by a pitch (HBP).
      He hits a sacrifice fly or a sacrifice bunt (also known as sacrifice hit).
      He is awarded first base due to interference or obstruction, usually by the catcher.
      The inning ends while he is still at bat (due to the third out being made by a runner caught stealing, for example). In this case, the batter will come to bat again in the next inning, though the count will be reset to no balls and no strikes.
      He is replaced by another hitter before his at bat is completed (unless he is replaced with two strikes and his replacement completes a strikeout).

      1 · October 27

    • Randy I.

      Right on, Sean!

      October 27

  • Samuel M.

    Is there any way the data could also be posted as a .csv?

    October 24

    • Randy I.

      Yes - I just uploaded the CSV version.

      1 · October 25

    • Samuel M.

      Excellent thank you!

      October 25

  • Kevin

    Hello Everyone, we are currently investigating 3 issues with the data (Shawn x 1, Jared x 1 and Inside Edge x 1) and will get back to you as soon as possible. In the meantime, if you have something else fun to do this evening I would recommend holding off on analyzing the challenge dataset until you hear back from us. Thanks!

    October 25

  • George S.

    NDA sent - no response yet! Thanks!

    October 25

    • Randy I.

      Just responded - got the NDA. Will send link once we address the reported data issues.

      October 25

  • Shawn S.

    I hate to say it but I have reason to think the data still isn't accurate. I know at least 1 row where it's almost certainly not in a fairly important way (because it's the highest actual score of the season, by an NL player that's said to have batted 9th in the dataset), and so I suspect others.

    October 24

    • Shawn S.

      And if PitcherSide means the handedness of the starting pitcher (the dictionary is blank), that is also not correct.

      A number of other fields in that record look suspect based on who the player is also, but I haven't tried to verify those. I just noticed the BOP that clearly didn't look right, and then started looking at what else didn't make sense, and found a lot.

      October 25

    • Mitchell N.

      Shawn, does it appear that the rows and attributes have become mismatched in the data set? i.e., the data set became jumbled during the preparation process?

      October 25

  • Jared

    I have potentially spotted a serious leak in the dataset. The train dataset is sorted by week and points scored, and it seems that the test dataset may also be sorted in this manner (there is a significant correlation with order by week of the test data and the predicted score of my model).

    October 24

    • Randy I.

      I'm not understanding your concern, Jared. Are you saying that week has an impact on points? Probably true. But about 10% of records from each week were randomly selected as the test set, so how is that a leak?

      October 25

    • A former member
      A former member

      If the test dataset is also sorted by week and points scored, then the points scored for a test record could be inferred by its location within the records for that week.

      October 25

  • A former member
    A former member

    This might be a crazy idea, but if accuracy were dropped as a judging criterion, then there would be no need for a test set, and no need to worry about data leaks. Models could be judged holistically based on the methodology used, rather than the model outputs.

    October 25

    • Randy I.

      Not really that crazy if you ask me, David! But predictive accuracy is really the goal of the model, and what's important from a business viewpoint, so I'd like to keep it as an important criterion, unless there is compelling reason not to.

      October 25

  • George S.

    Where is the data for this? - is there a link? Thanks! did not see it in the more -> files

    October 25

  • Randy I.

    Update: the corrected data file is now available.

    As you know, we discovered a problem with the original data file for this challenge - it inadvertently contained duplicate records, along with some data that was not intended to be there and which could give an unfair advantage.

    We have asked (via email) everyone to DELETE that data file (dated[masked]), and download the new file instead (dated[masked]), and use it going forward. The data dictionary remains unchanged, and is still in the folder as well. You were given a link to that folder if you signed the NDA.

    Thanks for your cooperation, and good luck!

    October 21

  • Kevin

    FYI. We're still working to resolve this "dupe" problem, but suggest the teams suspend any analytic work as the dataset is likely to change significantly. As before, however, the Dictionary is good to go. Thanks !

    October 20

  • Kevin

    Update on the Dupe problem. IE has uncovered the source of the duplicate rows and is in the process of revising the analytic data-set. I have pulled the original "faulty" version from the shared drive. Randy will post the revised data as soon as possible. In the meantime, the Dictionary is unaffected if you want to familiarize yourself with the 110 factors available for modeling.

    October 18

  • Kevin

    Thanks, David. I've contacted Inside Edge about the duplicates and await a response.

    October 17

  • A former member
    A former member

    I count 26,162 duplicate rows when the RECORDNUM column is excluded. Is this expected? There are no duplicates in weeks 1-4 or 27.

    October 16

  • Kevin

    Hi Everyone! I just uploaded a 2-page document that explains the logistics and format of our Fall 2016 MLB Analytics Challenge. Click on "More>Files" to find it. Kevin

    1 · October 14

Our Sponsors

People in this
Meetup are also in:

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy