We often find ourselves trying to answer a variety of questions in which we scavenge the internet for data that can be utilized to improve our analytical models. When extracting such data there is the novice way of doing it, mainly copying and pasting, and then there is the smart way of doing it, using web extraction tools such as Python's BeautifulSoup, Selenium or Request packages. Below, I demonstrate how to extract data from the Bureau of Labor Statistic's website and upload its results into Oracle's SQL Database 19c for direct query access in Oracle SQL Developer.
Why use Oracle SQL Database 19c & Oracle SQL Developer you might ask. There were many times while troubleshooting my way through the installation and configuration process that I asked that same question. In many companies, if you asked what the most utilized data storage and reporting platform was, you would get the same answer: Microsoft Excel. Often you will see backend systems, such as Essbase or BPC, feeding data into Excel, but a large amount of information is still being stored and processed for reporting. Excel is a great application, but it has its limitations and is not meant to be a database replacement. I have experience with other applicable platforms such as Microsoft Access and Microsoft SQL Server, but I have the most experience with Oracle SQL Developer since I have used it professionally.
Installing Oracle Database 19c can be a daunting task. Either everything will go smoothly or you might go through a learning experience of constant troubleshooting like I did. After going through the process multiple times, I'd recommendation that you make a complete system restore point afterwards.
After installing Oracle Database 19c, you will need to create a user and grant that user privileges in SQL plus. Next you will be able to open up Oracle SQL Developer and add a new database connection using the user credentials you assigned in SQL plus.
After finalizing the Oracle setup, it is time to connect Python to it. For this you will need to install the cx_Oracle package into you Python library. You will then need to run the below script modified to your machine and user.
It is important to be diligent in maintaining security of usernames and passwords. In this case the underlying risk is minimal, but for other uses that might not be the case. It is not secure to have such variables hardcoded in your script. Anyone who has access to it will have visibility to your credentials and if your store such scripts on online repositories, such as GitHub, it will also be visible. You could call an input command, use your system's environmental variables, like I did here, or setup a keyring, which is the most secure way of handling it.
Once completed, it is possible to run SQL scripts in Python that will transgress over into the Oracle SQL Developer application. before that we need to look at the table we are trying to extract.
In the below Python script, I created the table structure that matches the required format for this project.
Once run, the results can be seen in Oracle SQL Developer.
For web scraping, there are three main Python packages that are useful depending on what the task requires. The Request package worked well for this particular project but for others, Selenium for web crawling through JavaScript and HTML elements and BeautifulSoup for extracting data might be more appropriate. Here is the script that worked for this particular table.
This only pulls the table into Python and more steps are required to insert the imported table into a SQL table. With the cx_Oracle package you need to run SQL statements in string like execute commands in order to get it to process, followed by a commit command. Here is the script that aggregated the scraped data and inserted it into the Oracle table.
Once run, the results can be seen in Oracle SQL Developer.
And there you have it! A scraped table from the web in an Oracle database. Now, wouldn't it be great if you could maintain this table without any routine work? Well there is a way to do that too. My scripts were written in Jupyter Notebooks format with the extension of ipynb. Aggregating all the code together and converting it into a Python file, with the py extension, is required if you want to use the Windows Task Scheduler application that comes with your computer. You will need to link the location of this Python file and your Python application and set an time in which you wish it to run in order to complete this process.
For further details about this project, please see my work sample presentation.
©2020 by Andrew Hicks, MSBA
Comments