Research Update

Content Warning: Boring Research Update


Exascale Monitoring and Prediction

For those of you not aware of the project, we are using data mining to increase knowledge about HPC resource monitoring systems with a goal of increasing efficiency of monitoring operations and predicting failures.

Dorian, Mueen, and I met with LANL folks this week to get our project off of the ground. They seem rather motivated to do this and it feels like we’ll have some great stuff come of this (eventually).


My LANL contact is working on find out how to handle administrativa - accounts, access, security issues, etc. My main request from him was for a sample of any of the data we’ll be working with so we can inform our decisions on data engineering early on. I won’t be able to move much on this until this stuff is resolved.

In previous weeks, I’ve acquainted myself with the current tools in use there, at least at a high level - Splunk, Zenoss, RabbitMQ - by reviewing some of LANL’s open-source code and various posters /talks they have given on their use. I also got to know Alireza’s previous work and read papers related to our project (I might suggest one to you, Dorian, as a reading group paper).


I am following up with -contact person-, as he and I planned, on Monday. I will deliver him a summary of our discussions as I took it as well as early justification for using the suite of tools i proposed. -contact person-’s future information will determine my work for the next week, which could go in two directions:

  • He sends a sample of data and/or gives us some sort of remote access

    • I’ll examine data and asses what the current format allows us to do
  • He says we can’t have anything (yet)

    • I’ll sigh heavily.