This post belongs to the series "My last 2 years in happiness: A Machine Learning experiment", a series of posts exploring Machine Learning, sentiment analysis, data transformation and visualization; and more. This one in particular introduces the series, explains the motivation behind it, and gets started with the project.
Part 1: Introduction and Setup
Anyone who's relatively close to my life and recent events would know that I've been traveling pretty frequently for the last two years. I've accumulated a lot of experiences during that period, known a lot of new places and people, and I've felt incredibly happy and sad as well; I've even been in love, for Christ-sake! (although this is probably the first and only time I'll admit it publicly, like ever).
In most ways, now I'm at the happiest I've ever been, so it's great timing not only for this post, but for reflecting on the past while experimenting with something new.
Quick aside on nomenclature
Naming things is for sure one the hardest things to do for us humans, I had not one, but two, important decisions to make right before starting this series, and I want to walk you through the rationale behind them. This is completely optional aside and you're more than free to skip past it if you'd like to.
The title of this post: I'm a huge fan of emoji, so I was close to titling this post as:
😹 || 😿 ✈ 2y: a ML 🔬 - P1
So close... but then my first editor pointed out this thing called SEO, the second one was bugging me about Unicode, so I finally settled for a more normal-looking one after the third's comments were pretty much PG-13. (For those of you concerned that I'm running a Fortune 500 just for blogging, my editorial team is usually yours-truly right after I stop having stupid thoughts 😂).
The name of this project: I'm calling this project HappyCat. I'm not a cat person, and by no means I'm implying that I'm a cat myself (happy or not), it just so happens that since the delivery mechanism of this post happens to be the electronic cat database a.k.a the Internet, it was an obvious choice, I hope you approve, and if you happen to own a brand ending in or sounding like cat, say NaviCat or AutoCa(T) be nice and don't sue me. (On a serious note: if you haven't watched Uncle Bob's keynote on architecture, I highly recommend you do at some point).
Ok, this is getting out of control, I hope you can get past my humor (or lack thereof) across this post, but I assure you this is not an strictly personal one and for 80% of my audience it's surely not the reason you're reading it, so... take it as a refresher before we get so bor-y and enterpris-y. What is it about, then?
You're reading this most likely because you're a software developer or manager wanting to know a tiny bit more about Machine Learning, and this is tagged as such. Machine Learning (ML from now on throughout this series) is an albeit scary topic, granted, but I'm a strong believer that it should be an essential one as ecosystems and businesses improve, and that it will become mainstream, because -although the most intricate problems solved by it you might never run into- most businesses need intelligence, recommendation, analytics, etc., and as developers it's our duty to enable a business' success and we need our skills to stay current to be able to provide.
I devised this post as a way for my skills to stay current, hopefully it will align with yours, and one of the best ways to get introduced to or expand knowledge of new technologies, practices and tools is by doing something fun. This is not an introduction to the topic, as it's not a research, this is a practical way of using the technologies available to analyze a particular problem.
Speaking of which, is the part about happiness that interests me, and I'm fond of the concept of quantified self, up to the point where I log almost everything into Evernote (where have I been and places I've loved -Foursquare-, calendar events, tweets, people I've met, you name it), use a life-tracking platform and have automated setups with IFTTT and Zapier and all sorts of diary-like entries lying around the digital scene (although not a real journal just yet), I wondered how can I leverage my setup in order to get analysis and meaningful data. Bottom line, once you get an automated/manual setup where you reflect most of your actual life digitally, it opens the door for digital tools to make something out of it.
The goal of this little project is to be able to visualize how happy or sad I've been over the last two years based on different input using sentiment analysis and a set of tools to train, predict, categorize, cluster and visualize results.
My plan is to be able to analyze what I've been tweeting, what kind of music I've been sharing (a.k.a #NowPlaying, Rdio, Last.FM, etc), visualize and graph the results, any maybe expand the timeline to realtime.
The full source code will be available on GitHub once this series conclude, and hopefully it would have even more data sources, functionality and easy ways of customization, it will almost certainly allow to be run on different users. Comments, suggestions and contributions, as well as general usage are more than welcome, but you should be aware of its license though, it's completely open source but the most important part is:
"I (the author), this project (the software) or any contributions (derived work) are by no means to be held accountable or responsible for having you (the user) fire an employee, shutdown a business, or break with your S.O. based solely on the output of the software in scenarios that show that they weren't happy 99% of the time".
You've been warned 🙈. Let's dive in.
Tools of the Trade
As I mentioned earlier, we will be using Scala for most of this project. Advantages and disadvantages, as well as language comparisons are outside the scope of this post, everyone who has at least seen a tweet of mine knows I'm in love with Ruby, and we will be using that too, but... One of the things I keep saying is that, as a developer, you should be able to choose the best tool for the job, and do it confidently. The scientistic side is not really a long-lived citizen in the Ruby community as it is on the Java or Python ecosystems, among other reasons to shy away from it.
The ML engine I'm using, which I've known for a while and recently surfaced in a conversation I was having with @pikitgb about his crazy stock-market analysis ideas (don't ask), is indeed written in Scala. Plus, the language's data manipulation facilities are superb, and there's an incredible amount of work put in their collections, including parallelization, hence in my mind, Scala would be an awesome choice for this project.
It would be foolish to try to reinvent the wheel and implement algorithms, data streaming, storage and all the supporting mechanics of such an analysis from scratch, the project is already complicated to begin with. I've decided to leverage a tool that gives me the plumping so I can focus on my particular problem, and fortunately it does exist in the form of Prediction.IO, which uses Spark, Hadoop/HBase, Spray, and ElasticSearch under the hood (you can mix and match HBase/ElasticSearch with PostgreSQL or MySQL, but this is my preferred stack).
Prediction.IO would allow the core analysis, basically for everything that there doesn't exist external tools or APIs for, and our custom analysis.
Instead of trying to describe this awesome tool, and falling short, I think the authors have done a pretty good job at What's PredictionIO, please take a look at it, it's certainly recommended reading.
Ever since I discovered Docker there's very little that I install directly into my operating system (which gets provisioned itself with Boxen, details on a future post), and this project wouldn't be an exception, specially with such a large component. More standard installation methods for every platform are described in the project's Installation page, but here we will use an image that will run inside of our beloved container-based architecture.
I could have created an image that replicated the installation steps from those pages, but luckily for me, @mingfang already did, so taking advantage of that:
[email protected] ~ $ j code [email protected] ~/Code $ g clone mingfang/docker-predictionio Cloning into 'docker-predictionio'... remote: Counting objects: 155, done. remote: Total 155 (delta 0), reused 0 (delta 0), pack-reused 155 Receiving objects: 100% (155/155), 23.10 KiB | 1024 bytes/s, done. Resolving deltas: 100% (61/61), done. Checking connectivity... done. [email protected] ~/Code $ cd docker-predictionio [email protected] ~/Code/docker-predictionio (master)$ ./build Sending build context to Docker daemon 95.74 kB Sending build context to Docker daemon Step 0 : FROM ubuntu:14.04 ... Step 1 : ENV DEBIAN\_FRONTEND noninteractive ... Step 2 : RUN apt-get update ... Step 3 : RUN apt-get install -y runit ... Step 4 : CMD /usr/sbin/runsvdir-start ... Step 5 : RUN apt-get install -y vim less net-tools inetutils-ping curl git telnet nmap socat dnsutils netcat tree htop unzip sudo software-properties-common jq ... ...
Avid readers might notice that my
git command is actually hub, else you'll need to prefix with the whole github stream of chars 😆.
runsvdir-start& inside the container (if not ran by us), now able to access the dashboard via HTTP at
dockerhost is the IP of your Docker container's host, namely
localhost on environments where Docker runs natively, such as Linux; or your VM's address in other setups).
This, for all intents and purposes, should suffice until the latter stages when we start deploying engines such as sentiment analysis.
Exist is a life-tracking platform, as they put it they collect data from the services you already use and find trends and correlations in the results. Start by connecting your fitness tracker, and add other services like your calendar for greater context on what you're up to.
It allows you to integrate several services into a general dashboard, some of those that I really fell like paramount, such as RescueTime, and provides meaningful insights and correlations regarding your digital life, it also allows you to track mood.
The platform will set you back $6/m after a free trial, but it has become a recent and very welcomed addition to my toolbox. Of course, the trick here would be to use it programmatically, and hence I've created an API client for it and several custom integrations, some of which will be shown throughout this series.
Configuring our Ruby client it's pretty simple, although subject to change in its implementation details since the developers are migrating to OAuth2, there are several supported means of configuring it but the simplest one it's to export the environment variable
EXIST_API_TOKEN with the one you get from their web interface, probably using something like direnv.
Twitter & Twitter app
I won't describe how to create a Twitter app, since it's pretty much standard these days. This is a short section only reminding us that the policies of Twitter do require an app to access your own tweets ;).
Setting up project and dependencies
If you're following this along (quick reminder: all code will be pushed to GitHub), you can create a Scala project with your tool of choice, I use IntelliJ for Scala development, but it doesn't really matter, you just need to get to start things off:
- Scala 2.11.6
- twitter4j ~> 4.0
- twitter4j-stream for accessing streaming tweets (not strictly necessary at the beginning, but something it will become handy if we do this realtime).
Configuring Twitter4j is very simple, and they support several methods. For sake of simplicity, I'll paste a close representation of my
twitter4j.properties at the root of my project:
debug=false oauth.consumerKey=<REDACTED> oauth.consumerSecret=<REDACTED> oauth.accessToken=<REDACTED> oauth.accessTokenSecret=<REDACTED> includeRTs=false http.useSSL=true
This concludes part one, we're ready to start coding and we've introduced the project and its goals. Stay tuned for Part 2.