Awhile ago, I found myself trying to make a decision on which of several restaurants to eat at. They were all highly rated in Yelp, but surely there might be more insights I could pull from their reviews. So I decided to Splunk them!
Yelp has an API but, I am sorry to say that it is awful. It will only let you download 3 reviews for any venue. That’s it! What a crime.
So… I had to crawl Yelp venue pages to get reviews. I am not proud of this, but I was left with no other other option.
Python has been my go-to language lately, so I decided to solve the problem of review acquisition with Python. I used the Requests module to fetch the HTML code, and the Beautiful Soup module to extract reviews and page links from the HTML.
I recently noticed that something was using up lots of RAM on my Mac, as it would periodically slow down. I had some suspects, but rather than regularly checking in Activity Monitor, I thought it would be more helpful if I had a way to monitor usage of RAM by various processes over time.
Due to previous success with my Splunk Lab app, I decided to use it as the basis for building out a RAM monitoring app. The data acquisition part, however, was trickier. The output of the UNIX ps app isn’t very structured, and I had some problems parsing that data, especially in situations where there were spaces in filenames and arguments to those commands.
So I wrote a replacement for PS. It turns out that Python has a module called psutil, which lets you programmatically examine the process tree on your Mac. I ended up writing an app called Better PS, and it writes highly structured data on each current process to disk, which is then ingested by Splunk.
In a previous post, I wrote about using Splunk to monitor network health and connectivity. While building that project, I thought it would be nice if I could build a more generic application which could be used to perform ad hoc data analysis on pre-existing data without having to go through a complicated process each time I wanted to do some analytics.
So I built Splunk Lab! It is a Dockerized version of Splunk which, when started, will automatically ingest entire directories of logs. Furthermore, if started with the proper configuration, any dashboards or field extractions which are created will persist after the container is terminated, which means they can be used again in the future.
A typical use case for me has been to run this on my webserver to go through my logs on a particularly busy day and see what hosts or pages are generating the most traffic. I’ve also used this when a spambot starts hitting my website for invalid URLs.
This will download the container, start it up, and mount the appropriate directories. The containerized version of Splunk looks recursively for logs in /logs/, stores its data in /data/, and stores dashboards that are created in /app/. (Note that if you try to use “password” as your password, the container will refuse to start for safety reasons!)
First things first, let’s verify our data was loaded and do some field extractions!