Awhile ago, I found myself trying to make a decision on which of several restaurants to eat at. They were all highly rated in Yelp, but surely there might be more insights I could pull from their reviews. So I decided to Splunk them!
Yelp has an API but, I am sorry to say that it is awful. It will only let you download 3 reviews for any venue. That’s it! What a crime.
So… I had to crawl Yelp venue pages to get reviews. I am not proud of this, but I was left with no other other option.
Python has been my go-to language lately, so I decided to solve the problem of review acquisition with Python. I used the Requests module to fetch the HTML code, and the Beautiful Soup module to extract reviews and page links from the HTML.
I recently noticed that something was using up lots of RAM on my Mac, as it would periodically slow down. I had some suspects, but rather than regularly checking in Activity Monitor, I thought it would be more helpful if I had a way to monitor usage of RAM by various processes over time.
Due to previous success with my Splunk Lab app, I decided to use it as the basis for building out a RAM monitoring app. The data acquisition part, however, was trickier. The output of the UNIX ps app isn’t very structured, and I had some problems parsing that data, especially in situations where there were spaces in filenames and arguments to those commands.
So I wrote a replacement for PS. It turns out that Python has a module called psutil, which lets you programmatically examine the process tree on your Mac. I ended up writing an app called Better PS, and it writes highly structured data on each current process to disk, which is then ingested by Splunk.
One of my activities outside of the office consists of staffing furry conventions. One of those conventions is Anthrocon, a furry convention held in downtown Pittsburgh every June/July. At that particular convention, I manage the website and their social media properties.
Yesterday, we opened general hotel reservations, and that resulted in a huge rush of members booking hotel rooms. 1,000 rooms were booked in the first 15 minutes! This was completely expected, and we kept track of how things played out on social media, and also took a survey of members who booked hotel rooms to see how things went. In this post, we’re going to share what we learned based on those survey results and Twitter activity.
First, did people who booked a hotel room get the hotel that they wanted?
For nearly 70% of you, the answer is yes. This makes us happy, but we would like to see the number higher—ideally 100% of our attendees would get a room in the hotel of their choice. This is something we continue to work on each year by adding new hotels and getting bigger room blocks in existing hotels.
In my case, over my Christmas vacation, I checked into a Mom and Pop hotel, or rather a motel! It was about 24 rooms all in a row, occupying a single floor. Since they were on a budget, their Internet offering consisted of what appeared to be 5 or 6 Linksys routers set up every few rooms. You’d simply connect to the closest access point and have Internet.
But there was a problem: determining which access point was closest to me! The signal strength indicator on my computer showed several of them were 3/3 bars so that wasn’t much help. I tried connecting to the first one, but had virtually no Internet connectivity.
In the above example, I added in the TARGETS environment variable, and was sure to include 192.168.1.1, which was the IP for each router (they were all the same). Then I set Splunk “real-time mode” and periodically checked that tab as I was working. This is what I saw:
In a previous post, I wrote about using Splunk to monitor network health and connectivity. While building that project, I thought it would be nice if I could build a more generic application which could be used to perform ad hoc data analysis on pre-existing data without having to go through a complicated process each time I wanted to do some analytics.
So I built Splunk Lab! It is a Dockerized version of Splunk which, when started, will automatically ingest entire directories of logs. Furthermore, if started with the proper configuration, any dashboards or field extractions which are created will persist after the container is terminated, which means they can be used again in the future.
A typical use case for me has been to run this on my webserver to go through my logs on a particularly busy day and see what hosts or pages are generating the most traffic. I’ve also used this when a spambot starts hitting my website for invalid URLs.
This will download the container, start it up, and mount the appropriate directories. The containerized version of Splunk looks recursively for logs in /logs/, stores its data in /data/, and stores dashboards that are created in /app/. (Note that if you try to use “password” as your password, the container will refuse to start for safety reasons!)
First things first, let’s verify our data was loaded and do some field extractions!
I’ve been using Splunk professionally over the last several years, and I’ve become a big fan of using it for my data processing needs. Splunk is very very good about ingesting just about any kind of event data, optionally extracting fields at search time, and providing tools to graph that data, find trends, and see what is really happening on your platform. This is important when your platform consists of thousands of servers, as it does at my day job!
While Splunk can handle events in timestamp key=value key2=value2 format, it also has support for dozens of standardized formats such as syslog, Apache logs, etc. If your data is in a customized format, no problem! Splunk can extract that data at either index or search time! Finally, there’s the Search Processing Language, which is like SQL but for your event data. With SPL, you can run queries, generate graphs, and combine them all programatically.
So yeah, I’m a huge fan of Splunk. One thing I use it for out of the of office is to graph the health of my Internet connection. This is useful both for when I’m at home and when I am traveling–I just feed the output of ping into Splunk and then I can get graphs of packet loss and network latency.
Let’s just jump into an example screen–here’s what I saw when I was a friend’s place and I uploaded a video to YouTube:
With the release of SEPTA’s new app, I’ve suddenly been flooded with questions about their API. People wanted to know how stable it was.
Well, I don’t work for SEPTA, which means I don’t have insight into their operations, but I can perform some analytics based on what I have, which is approximately 18 months of Regional Rail train data, read every minute by SEPTA Stats.
This is all of the data that I have in Septa Stats currently:
Events Since Inception: 26,924,887 events
First Event: Mar 1, 2016 12:00:01 AM
Last Event: Nov 16, 2017 10:33:53 PM
That’s way more events than minutes in that timeframe, and the reason for that is each API query is split into a separate event for each train. So if an API call returns status for 20 trains, that gets split into 20 different events. This is done because Splunk has a much easier time working with JSON that isn’t a giant array. 🙂