Working GitHub Repository is stored here.
Process Documentation
This project began as a way for me to dive deeper into the tools AWS offered for Data Engineering. Soon that turned into a goal of building a full data pipeline from data extraction to data visualization.
Getting the Data
I first started with further building my skills in Python for data analytics. I searched for APIs that would help me complete this project, and I found this medium post [here] that helped give me a jumpstart. This led me to the NBA API which provided with the player/team name, date of game, and every shot taken, made or missed. In order for me to search for these players, I needed a list of active rosters. After some google searching, I found this API that let me get that information. With these two APIs, and some other helpers, I was able to get all the data needed.
Building the Database
After getting the data, I needed a free/cheap database solution to store the data. Initially, I thought I would explore AWS RDS as the preferred solution. However after creating a sample table and leaving it overnight, I realized the RDS was definitely not free or cheap. Looking around for other options, I found AWS DynamoDB, a NoSQL database solution. The fact that it was always free up to a certain amount of storage was a definite selling point. However, the prospect of learning a NoSQL solution for the first time was enticing as well.
Learning that I could save the JSON output for the APIs as a JSON struct in DynamoDB instead of turning everything into columns, figuring out primary/foreign keys etc. was inspiring. It saved me a lot of headache and time, which I always appreciate. It also allowed me to flex my python skills a little bit as well. First I created a roster table (github) with the player name as the partition key. Then, I used one of the APIs I found the populate the table with the active 2022 rosters (github). With that done, I could now get the shot detail for each player from the NBA API. I created the ShotsDetails table (github), and then by querying the NBA API with each player from the roster table, I was able to get the shots details for each player and save it (github). I was able to save the key (player name-team name) as one column, and all the other data in the other column. This is where I ran into my first major issue. DynamoDB did not allow for a whole season's worth of shot information to be saved in one row. In order to fix this issue, I tried to think as someone who would use this data. In my opinion, a NBA player goes through his fair share of ups and downs in the season. So, a full season worth of data might not be fully indictive of how a player might perform in the next game. On the other hand, by restricting that timeline to the most recent month of games, the user might be able to get a better view into the player's performance. I decided that I would only keep a month behind today's worth of data for each player. That fixed the issue.
Updating the Data
The next part I wanted to tackle was figuring out how I would update the ShotsDetails table daily. AWS Lambda turned out to be a great solution. Lambda allows user to attach a programming script to a trigger. When the trigger triggers, the script is run. In my case the script updated the table daily, by deleting data that is more than a month behind, and adding that day's data (github). The trigger was an AWS CloudWatch event that was scheduled the trigger once every day. I connect the trigger and script and was able to update my data daily.
Here was my second major issue. In theory, this solution works really well. However, the NBA website blocks access to any cloud provider. This solution, while interesting to learn, was absolutely useless. In order to remedy this issue, I decided to use windows task scheduler. Three tasks were required. First I had to turn my local desktop on, which can be done through the MSI bios. Then I had to run the script that updated the tables and turn off my desktop, which can be down through setting up automated tasks though task scheduler. I have not been able to completely figure this solution yet.
Loading and Querying the Data
The next step was figuring out how I can get the data to be "query-able" through SQL. I could directly connect the DynamoDB table to Power BI, but that, in my opinion, was no fun. This is where I learned about AWS Glue. Glue is a fully managed ETL (Extract, Transform, Load) service that can read any AWS service. I set up a Glue Crawler to read the DynamoDB tables and save the metadata on the Glue Data Catalog. Then I set up a Glue Job that would read the data from DynamoDB, and using the schema that the data catalog provided, save the data in AWS S3 as parquet files. S3 is AWS's file management service. Lastly, I set up a trigger that would run the job everyday, after the tables updated.
Now I needed to figure out how to query the data saved in S3. AWS Athena was the tool I used to do so. Athena is a server-less service that allows you to use SQL to query data in S3. Using Athena I created tables and views that I needed to setup my visualizations.
Creating the data model and the visualizations
Next, I setup the data model in Power BI. To do so, I first downloaded the ODBC connector (here). Second, I had to configure my DSN through the Windows ODBC Data Sources, and add the Athena data source there. Lastly, I had to connect the DSN through Power BI.
After doing that I was able to import the tables and views I had to set up into a Power BI data model. Using DAX, I created calendar functions to be able to filter by time, and other helper functions to set-up my visualization. Lastly, I used the provided Power BI visualizations to visualize the data.
Final Data Pipeline Setup:
Note: This project is not complete, this post is a work in progress. I will update as I make progress
Comments