Metro Image credit Washington Post

( Update 8/28/16 11pm: the plots and code have been updated after reddit user /u/Grimace06 caught an oversight. I had neglected to account for train directionality in computing headways and wait times.

Update 8/29/16 4pm: Fix concerning blue line pointed out by reddit user /u/KevinMCombes. )

For those unfamiliar, Metro is the common abbreviation for the Washington Metropolitan Area Transit Authority (WMATA). It is the transit authority for the Washington, DC region and operates Metrorail, Metrobus, and MetroAccess, the first of which is the topic of this post. Inn mid-July WMATA released a new API giving real-time train position information. Prior to this, it was possible, though difficult, to estimate train positions with some heuristics as the people at were doing rather successfully. I attempted something similar but was never able to achieve results that were accurate and complete. The new train positions API made it much easier to get the train positions and also gave unique identifiers to the trains, allowing API users to more accurately follow a particular train through time.

Reddit user /u/Stuck_In_the_Matrix has kindly been archiving the responses of the train positions API queries online for people to download. Given this archived data, I thought it might be interesting (perhaps even informative) to analyze the data.. This idea was inspired by Erik Bernhardsson's excellent blog entry NYC subway math, which used the MTA API to get some data to look at waiting times.

First things first...

Step one is usually (always) obtaining the data. The WMATA data archive I mentioned above has a ton of files and many of them are related to the buses which I didn't need. To facilitate downloading the data I wrote a simple python script using requests and BeautifulSoup. Here's an example of the data that we get.

CarCount CircuitId DestinationStationCode DirectionNum LineCode SecondsAtLocation ServiceType TrainId retrieved_on
0 6 1140 E01 1 YL 0 Normal 218 1472029202
1 6 2036 C15 2 YL 0 Normal 226 1472029202
2 6 2486 N06 2 SV 5 Normal 015 1472029202
3 6 2911 G05 1 SV 5 Normal 215 1472029202
4 6 3157 G05 1 SV 0 Normal 186 1472029202

Unfortunately, the data did not have all the information that I wanted as it's lacking whether or not the given train is at a station or a trip identifier for the train. This meant that, as usual, I had to add data from other sources (another WMATA API in this case), to get the information I needed. To get station info we can query the metro API and find out which circuits belong to stations. By trip identifier I mean a unique identifier for a trip from start to end. A red line train from Shady Grove to Glenmont could be one trip, but if that train turns around and starts going the other way, it's a new trip. Trip identifiers can be generated by a combination of the unique train ID and its destination. If a train ID and destination combination has not been seen, that's a new trip. If a train ID has been seen, but the destination is now different, that's a new trip.

Finally I drop all rows of the dataframe where a train is not at a station and make a few other adjustments and get some nicer data. Now it looks like something like this.

CarCount CircuitId DestinationStationCode DirectionNum LineCode SecondsAtLocation ServiceType TrainId retrieved_on Track SeqNum StationCode StartTime TripId DateTime Date Time
0 6 2574 J03 2 BL 5 Normal 327 1469175470 2 418 G05 2016-07-22 04:17:50 2016-07-22_0 2016-07-22 04:17:50 2016-07-22 04:17:50
1 8 3370 N06 2 SV 5 Normal 54 1469175885 2 81 N02 2016-07-22 04:24:45 2016-07-22_1 2016-07-22 04:24:45 2016-07-22 04:24:45
2 8 3359 N06 2 SV 0 Normal 54 1469175970 2 70 N03 2016-07-22 04:24:45 2016-07-22_1 2016-07-22 04:26:10 2016-07-22 04:26:10
3 8 3352 N06 2 SV 5 Normal 54 1469176050 2 63 N04 2016-07-22 04:24:45 2016-07-22_1 2016-07-22 04:27:30 2016-07-22 04:27:30
4 8 1148 C15 2 YL 30 Normal 66 1469176365 2 12 C14 2016-07-22 04:32:45 2016-07-22_2 2016-07-22 04:32:45 2016-07-22 04:32:45

Use the data

The first thing I wanted to do was to be able to graphically display the trip information, because that was information that I now had. After looking at a few different ones, I settled on August 9th which shows some problems with the red line. (Click the [RD] button to see the red line.)

As you can see, the red line between Shady Grove and Twinbrook was kind of fucked chaotic that day. The red line was single tracking between Shady Grove and Twinbrook from August 9 through August 21, 2016 which is likely what caused the issues you see on the red line. You'll also note the lower density of the lines in the middle of the day. This corresponds to the increased headways during the non rush hours. The other lines that day looked pretty much ok. You can find them here. You might notice some lines appearing or disappearing. This could be a problem with my method of generating trip IDs or related to the fact that trains can go into or out of service in the middle of a line.

We can also, like Erik, plot the distribution of headways and the distribution of wait times.

Here I've looked at trip times between 7am and 7pm. And keep in mind that in some cases there may be several lines that work for you. This isn't accounted for here – I've assumed that you're waiting for a train on particular line in a particular direction. This also ignores the fact that not all trains cover the whole line. That would be relatively easy to do though by just sorting the trips by destination station.

Violin plots work better as a comparison mechanism (plus they look nice). Violin plots combine elements of regular old boxplots with kernel density estimates. They allow one to see statistics like median and interquartile range as well as the distribution of the data at the same time. This makes them a good type of plot for comparing data distributions.

At this scale, most of the lines look pretty similar in terms of their wait times, so let's zoom in to get a good look.

From these plots we can gather the the blue and orange lines have the worst waiting times in general, but with medians around only 6 or 7 minutes. The silver line has a longer tail which means you're more likely to see some longer wait times on that line than on the others. These times are really not all that bad; perhaps WMATA is vilified more than it deserves.

The headways are shorter during rush hour times during the week on all lines except blue, so we should expect to see some variability in waiting times throughout the day on all other lines. We should also expect to not see such a pattern during the weekend, since the trains should run with consistent headways throughout the day. Indeed this is essentially what we find. Note that the x-axes have different ranges because the trains run during different hours depending on if it is a weekday or weekend.

You'll notice the wait times increasing on the red line early in the morning and late at night during the weekend. I'm not super familiar with the metro, but maybe someone more well acquainted with the system (and the red line in particular) can account for this.

Now we can investigate how much longer you'll have to wait given that you've already waited a certain amount of time. For example, you may wonder if you have waited 10 minutes, how much longer should you expect to wait? I've pooled all the lines together for this because there still isn't a ton of data and this process slims it a bit.

This really doesn't look as bad as the NYC plot from the previosuly linked blog post. The additional waiting time is pretty consistent all the way up until around 30 minutes. If your train hasn't come for 30 minutes, you may be in for a much longer wait. However, most of the time, your train will be there within the next few minutes, so it's probably best to continue waiting. For the most part, the waiting distribution curves don't have very wide tails–the distribution is essentially memoryless.

There are surely many more things to look at, but this might be it for now. Let me know if you have any ideas.

Code available on github.


comments powered by Disqus