Web scraping historical NFL fantasy football stats, load into an Oracle Database & view in Power BI.

Andrew Hicks
Dec 4, 2020
6 min read

Updated: Dec 5, 2020

Have you ever wanted to gain a competitive edge when playing fantasy football against your friends? In my experience, I have often drafted based off a bias of names I knew and was often left in the dark on other players. Every sporting website provides its analytics of how they value one player but maybe you would like to create you own model but don't feel like copying and pasting webpage after webpage into a spreadsheet. Well, the good news is that there is a better way through the use of Python, a database and Microsoft Power BI.

In this project I explored two separate methodologies for extracting fantasy football data. The first, was downloading open source CSV files from Fantasy Football Data Pro's website. The second, was web scraping Pro Football Reference's website. We will explore the methodologies below in further detail.

Fantasy Football Data Pro

This website contains a download that contains both yearly and weekly views of historical fantasy stats by player. The yearly series of CSVs has stats from 1970-2019 while the weekly files has stats from 1999-2019. Here are the variables that these files contain:

• Player Name • Passing Attempts • Receiving Yards

• Position • Completions • Receiving TDs

• Team Name • Rushing Attempts • Fumbles Lost

• Passing Yards • Rushing Yards • PPR Fantasy Points

• Passing TDs • Receiving Catches • Standard Fantasy Points

• Interceptions • Receiving Targets • 0.5 PPR Fantasy Points

The issue with the methodology is that the downloaded content comes segmented in a series of spreadsheets that has a different file layer structure between the yearly files and the weekly files. This structure is illustrated below.

So how can we aggregate all of these spreadsheets together? Is there a way that when we update these spreadsheets that we wouldn't have to rework this step manually? There sure is a way in Python to stream-line this process and here is how.

Starting the the yearly dataset, we can loop through and extract data from each of these files pretty seamlessly because they are all under a single parent folder.

path_yearly = r'C:\Fantasy Football\yearly'
SysFiles_Yearly = glob.glob(path_yearly + "/*.csv") # Loads all files in a particular folder based off the set path.
List_Yearly = []
x = 1970
for filename in SysFiles_Yearly:
    df = pd.read_csv(filename)
    df.insert(0, 'Year', x)
    i = 1
    df.drop(df.columns[i], axis=1,inplace=True)
    x = x + 1
    List_Yearly.append(df)
Summary_Yearly = pd.concat(List_Yearly, axis=0, ignore_index=True)
Summary_Yearly

Running this script output the below table. This is an aggregation of all fifty files that the folder contains.

The weekly dataset is slightly more complicated because of the folder-file structural layout as depicted earlier. Not only do you need to loop through the seventeen weekly spreadsheets but you must also loop through twenty-one parent folders for each year. This can also be done using the below script.

y = 1999
List_Weekly_Yearly = []

while y < 2020:
    path_weekly = r'C:\Fantasy Football\weekly\%s' %y 
 # The [%s' %y allows you to insert a variable into a string; in this case an interval URL]
    SysFiles_Weekly = glob.glob(path_weekly + "/*.csv") # Loads all files in a particular folder based off the set path.
    List_Weekly = []
    x = 1
 for filename in SysFiles_Weekly:
        df = pd.read_csv(filename)
        df.insert(0, 'Week', x)
        df.insert(0, 'Year', y)
        i = 1
        df.drop(df.columns[i], axis=1,inplace=True)
        x = x + 1
        List_Weekly.append(df)
    Summary_Weekly = pd.concat(List_Weekly, axis=0, ignore_index=True)
    List_Weekly_Yearly.append(Summary_Weekly)
    y = y + 1
Summary_Weekly_Yearly = pd.concat(List_Weekly_Yearly, axis=0, ignore_index=True)
Summary_Weekly_Yearly

And here is its output.

Now you have an aggregated view of each of these in a Python data frame, you can analyze it in this format, export it back into a spreadsheet or, do what I will illustrate with the next source, import it into Oracle Database so that you query directly against it in Oracle SQL Developer, Excel or Microsoft Power BI through an ODBC.

Pro Football Reference

This website is full of data that would provide great value in any fantasy football analytical model. What is the best way to extract it though? It would be counterproductive to copy and paste it because not only is it spread across multiple pages but there is the issue of having to rework it manually when new data becomes available.

Though there are many variables we could pull in across this website, I am only going to focus on the following ones: • Record ID • Home or Away Game • Player Position

• Game Year • Opponent • Season

• Game Date • Fantasy Points

• Team • Player Name

The script that I am using below is an adjusted script that Steven Morse wrote where I have it loop through multiple pages, scaping fifty years of fantasy football stats and exporting it into a single CSV file. This takes a long time to run so if you plan on running it, best to let it run over night.

from bs4 import BeautifulSoup
import pandas as pd
import requests

List_Year = []
year = 1970

while year < 2021:
# --------------------------------------------------------------------- 
# Source: https://stmorse.github.io/journal/pfr-scrape-python.html
    url = 'https://www.pro-football-reference.com'
    maxp = 300
 
 # grab fantasy players
    r = requests.get(url + '/years/' + str(year) + '/fantasy.htm')
    soup = BeautifulSoup(r.content, 'html.parser')
    parsed_table = soup.find_all('table')[0]  

    df = []

 # first 2 rows are col headers
 for i,row in enumerate(parsed_table.find_all('tr')[2:]):
     if i % 10 == 0: print(i, end=' ')
     if i >= maxp: 
         print('\n' + str(year) + ' Complete.')
         break
 
     try:
         dat = row.find('td', attrs={'data-stat': 'player'})
         name = dat.a.get_text()
         stub = dat.a.get('href')
         stub = stub[:-4] + '/fantasy/' + str(year)
         pos = row.find('td', attrs={'data-stat': 'fantasy_pos'}).get_text()

         # grab this players stats
         tdf = pd.read_html(url + stub)[0]    

         # get rid of MultiIndex, just keep last row
         tdf.columns = tdf.columns.get_level_values(-1)

         # fix the away/home column
         tdf = tdf.rename(columns={'Unnamed: 4_level_2': 'Away'})
         tdf['Away'] = [1 if r=='@' else 0 for r in tdf['Away']]

         # drop all intermediate stats
         tdf = tdf.iloc[:,[1,2,3,4,5,-3]]
 
         # drop "Total" row
         tdf = tdf.query('Date != "Total"')
 
         # add other info
         tdf['Name'] = name
         tdf['Position'] = pos
         tdf['Season'] = year

         df.append(tdf)
     except:
 pass

    df = pd.concat(df)
# ---------------------------------------------------------------------
    df.insert(0, 'Year', year)
    year = year + 1
    List_Year.append(df)
Summary_Yearly = pd.concat(List_Year, axis=0, ignore_index=True)
Summary_Yearly

Summary_Yearly.to_csv(r'C:\fantasy.csv')

When I ran this last, it produced a CSV with 106,356 records. I found issue with some of the data though. The data from 1970-1993 was incomplete. Because of this, I decided to focus my project from 1994 though today. Below is an illustration of the first eleven rows of this spreadsheet.

Database

From here we can upload this data directly into an Oracle Database so we can query against it in Oracle SQL Developer and use it in other applications through an ODBC. In my previous publication I demonstrated how this can be done automatically, but in this example, since the file is already in a CSV format, I decided to manually imported the data into Oracle.

First you need to run the following script in order to create the table. From there you can right-click on the table and import the data.

CREATE TABLE PROFREF (
    ID NUMBER,
    YEAR NUMBER,
    GAMENUM NUMBER,
    GAMEDATE VARCHAR2(20),
    TEAM VARCHAR2(4),
    AWAY NUMBER,
    OPP VARCHAR2(4),
    FANPT FLOAT,
    NAME VARCHAR2(40),
    POSITION VARCHAR2(4),
    SEASON NUMBER
)

Once imported you'll have a table that looks like this.

Dashboard

From here bringing the table into Microsoft Power BI is pretty easy assuming you have your TNS_NAMES setup properly. In Microsoft Power BI, select get data and chose the ODBC option. From here you will be able to write your applicable SQL script as illustrated below and import the data from the database.

Now that you have the data in a dashboard which enables you to analyze it further. My final report showed the total number of fantasy football points by player, position and team from 1994 through 2019. This provided me insight that I had previously been unware of and enables me to make smarter analysis in draft selection based off historical trends.

Final Thoughts

There certainly is room for improvement on this project. Some scripts can be streamlined and processes can be automated. Additional data can be brought in to enrich the model so that predictive analytics can be run against it. Perhaps, with enough effort in developed a model could be crafted that would provide draft recommendations live during the draft itself that changes based off what is currently on the board. I have many other projects on my list, but I would like to come back to this at some point and see how I can improve it further.