Can't Keep Me From My Data
One of the reasons the web is really the best platform is the ability to inspect the HTML behind every page. Because the content has to display in a web browser and that browser has developer tools, you can get the HTML behind any page fairly easily.
While my work is primarily about analyzing data, my role sits well within higher education administration. As an administrator, many of the tools that I use have a front end that is a web application. Even our email, documents and presentations can be accessed through a web interface.
I'm sure none of this is surprising or news to anyone. Everybody uses these tools and probably doesn't think about it at all. But it does mean you can always get at the data being displayed to you.
I was faced with a web application that had exactly the data I wanted, but no way to download it. So I immediately open the web inspector and discover that it's not a table. Instead, it's using CSS Grid to create a table. However, the cells each have a particular class so it's easy to use Beautiful Soup to find all those tags.
The one wrinkle I learned pretty quickly was that not all the data is loaded into the web page. I think the entries are dynamically added to the DOM as you scroll to them. That meant I had to run my script several times, scrolling through the list.
That meant I had several csv files named dataX.csv
where X is an integer.
These files could also contain duplicate entries, which I wanted to get rid of.
I found the following command that worked perfectly:
(head -n 1 data1.csv && tail -n +2 -q data*.csv | sort -u) > merged_data.csv
The head
command grabs the header from the first file.
(Because they were generated by my script, the files all have the same header.)
The tail
command strips out the header from all the files.
Then we sort and remove duplicates.