Darling if you only knew, half as much as everybody thinks you do…
— Donald Jay Fagen / Walter Carl Becker
Yes, yet another Python project named after a song.
If you have never read this blog before:
I have been using Python to collect data from the Internet since the first of the year. I took a few more classes about the cloud and machine learning and I developed a system for collecting, processing, and storing data about what’s hot and what’s not.
Okay, so that is a little glib, but that is the crux of the issue.
The idea is to get a picture of what people in the news, on social media, and across the web are talking and thinking about. I am pursuing the zeitgeist.
I Got The News 2.0
I Got The News was the first of these data collection projects, so the code is the oldest. Which in this case also means the worst.
As I have been combining all these different data collection systems into one digital eco-sphere I have been modernizing the IGTN collection system. My code is WAY better now than it was 6 months ago 🙂
I’ve written previously about Pandas and of course this new code base still uses it extensively. I’ve developed quite a library of Pandas One-Liners.
Lets say, for example, that you want to remove words from a stop list in a dataframe column named term:
df = df[~df['term'].isin(stop_list)]
What if you want to remove all the rows that have zeros in x number of columns?
df = df[np.count_nonzero(df.values, axis=1) > len(df.columns)-3]) #x = 3
Those 2 simple examples show why Pandas is a superior data manipulation tool compared to a spreadsheet or similar. Anything you can do with a general-purpose, high-level programming language you can do to your data via Pandas.
Why use Pandas as all? Why not just use Python? That second line of code is one good reason – we can effect the entire dataframe without looping through it. I’ll admit I still resort to looping through a df if I cannot figure out the Pandas way to do it but it’s so much faster if you don’t have to loop.
Especially as your data grows…
The Google Cloud Platform
As my data has grown I have realized that my hard drive is no match for all this data that I have collected over the past few months. I wrote an entire blog post about storing data on the Google Cloud over here…
Suffice to say that I am now well acquainted with storing and moving data from the Google Cloud. Even getting postgresql up and running – but that is a story for it’s own upcoming post.
Moving my data to the cloud has been a revelation. It’s so fast that I forget that it’s not local sometimes – and I can do almost anything with it on the cloud that I can locally.
Since I’m handing out tips in this post – if you do decide to move some of your data to the google cloud in a bucket and want anyone else to be able to access it:
Create a new user called: allUsers In the Roles drop down, select the Storage sub-menu: Click the Storage Object Viewer option.
I have forgotten that more than once and it almost drove me crazy. Of course I thought it was my code…
This Project Needs a Name
As I consolidate all of my data collection scripts into one system, I’m thinking that this overall project needs a name. Instead of each part being named after random song.
I need something that sounds legit.