Time Series Data Structures

Every project I’ve ever worked on included some form of temporal data. Even with that constant, each time the format was different – and seemingly arbitrary differences with varying success.

For sake of discussion, we need to put out some definitions.

Term Definition
<<timestamp>> a time value of any sort, for our discussion consider it whatever date and/or time related value you’re concerned with
<<start>> a time value for the OLDEST or MOST HISTORIC time frame of the sequence
<<end>> a time value for the NEWEST or MOST FUTURISTIC time frame of the sequence
<<dt>> a length of time the value is effective for
<<sequence ‘x’>> a unique title, name, or identifier for a given time sequence (in this case ‘x’)
<<resolution>> an expected frequency of data for one or more sequences
<<max resolution>> an upper bounds of the data frequency, for sequences of non-constant frequency
<<value>> some string, numeric, or complex structure that is a value you are interested in

Here are a just a few example formats. Everything has its pros and cons, and the list is in no way exhaustive.

Simple t/v sequence

Pretty straight forward – but you had better remember the context of your call. There are plenty of downsides here. We know a time and value, but no idea what sequence we are looking at, what time boundaries we’ve requested, or what frequency the data should be in.

[
  { "t": <<timestamp>>, "v": <<value>> },
  { "t": <<timestamp>>, "v": <<value>> },
  { "t": <<timestamp>>, "v": <<value>> },
  ... and so on ...
]

Sequence with context

We get some context of our data back – mainly stats about each series. We can also call for multiple series at once – getting back as many series as we want. There is a trade off between convinces of access and data redundancy in the structure itself.

[
  {
    id: "Series 'x'",
    start: <<start>>,
    end: <<end>>,
    stats: {
      min: <<value>>,
      max: <<value>>,
      avg: <<value>>,
      sample_count: <<value>>,
      ... and so on ...
    },
    [
      { t: <<timestamp>>, v: <<value>> },
      { t: <<timestamp>>, v: <<value>> },
      { t: <<timestamp>>, v: <<value>> },
      ... and so on ...
    ]
  }
  ... and some more sequences ...
]

Fixed resolution multi-series

A more complicated structure and compact for lack of redundancy. A downside of this format is the expectation for a fixed resolution of each sequence. If we have varying resolutions, this format gets a little ridiculous.

{
  sequences: {
    <<series 'x'>>: {
      stats: ... stats from previous example ...,
      values: [
        <<value>>,
        <<value>>,
        <<value>>,
        ... and so on ...
      ]
    },
    <<series 'y'>>: ... same pattern ...,
    <<series 'z'>>: ... same pattern ...,
    ... and more named series ...
  },
  start: <<start>>,
  end: <<end>>,
  resolution: <<resolution>>
}
Written by mackay on November 23, 2013 Categories: API Tags: , , ,
Comments Off on Time Series Data Structures

Tagging Infrastructure

If you don’t have a tagging meta-structure on your core resources, implement one now. Tagging has to be the simplest, easiest, most search-able-iest approach to putting a structure to your otherwise slapdash bundle of data. Pretend you are 13, lets play a game straight out of Dorky and Thirteen:

* Do you want to organize your data?
* Would you like to easily find interesting elements of your data?
* Do you think having a structure to your data is pretty kickin’ rad?

Well, buddy, if you answered ‘yes’ to all three of the above questions and you *haven’t* implemented a tagging system in your data, what the hell is wrong with you?!

See, wasn’t that fun?

Show Me the Tagging!

Now that you are clearly convinced that including some form of tagging is the best thing ever, how about a simple example? Take a simple collection of data as such:

{
  "uri":"/a_thing/12",
  "title":"Wonderful Stuff in Here, Man!",
  "picture":"https://superstuff.com/amazing/img.jpg",
  "body": <<best stuff since sliced bread for 54k of characters>>
}

Well, that was great. Apparently this is the coolest resource ever, but we won’t know unless we read through the 54k of body. Sure, we could do it if we wanted to search that monstrosity, but what would happen if we added tags?

{
  "uri":"/a_thing/12",
  "title":"Wonderful Stuff in Here, Man!",
  "picture":"https://superstuff.com/amazing/img.jpg",
  "body": <<best stuff since sliced bread for 54k of characters>>,
  "tags": [
    "NSA",
    "Aliens",
    "Osama Bin Laden",
    "Applebee's",
    "Secret Government Takeover",
    "Breaking Bad"
  ]
}

Now that’s more like it! We know it involves a person, a couple organizations, terrible food, and one of my favorite television shows. We could render this data, we could search for this data with query params, we could cluster the data. A user could add their own tag without changing the actual content of the body. There’s a lot to do with tagging. I’ll leave some up to wikipedia to talk about tagging and folksonomies.

Bring Out Your Tags

Generate your own? Sure. Have someone do it for you, now that is even better.

Some DIY libraries to give you a leg up…

https://github.com/apresta/tagger
https://pypi.python.org/pypi/topia.termextract

3rd party tools to do some heavy lifting…

www.opencalais.com
http://developer.yahoo.com/contentanalysis
www.alchemyapi.com

Lastly, sources of data to build a folksonomy off of…

http://dbpedia.org/About
www.freebase.com
musicbrainz.org

Written by mackay on November 14, 2013 Categories: API Tags: , , ,
Comments Off on Tagging Infrastructure

Projects and Risk Granularity

You track software project risk items on an excel spreadsheet. You know you do, don’t lie to me. Even if you have a handful of SaaS and in-house Project Management® suites, you still keep that one little private spreadsheet of risks and plans.

So how detailed do you go? How many items do you add into that sheet? And how much value do you get out of communicating those items at the end of the project?

How Much Detail is Too Much Detail in Risk Tracking

We need to start tracking risks before we are able to communicate them. So what level do we identify at? Depending on the sizes of your projects, your approach may be different. For me, tracking risk at each deliverable is perfect. The project level (and program) is too high; context for the risk gets muddled. The task level too granular; risks are repeated and seem arbitrary. At the deliverable, risk items fit with the story.

Level Pro Con
Program Tied to the full vision No context outside of the “100k foot level”
Project Tighter vision; a more defined “need” Still cross over many functional contexts
Deliverable Concrete need; avoiding “vision” abstraction Possible to get into implementation details
Task Can’t get closer to a set of work than this Lacking the tie to an end-game goal

How many to track. Again, your approach may differ; I limit myself to three. This is an absolutely arbitrary rule of thumb; if more than three are needed, consider re-examining the deliverable to see if splitting it up makes sense. Having a small set of risks keeps a focus when it is time to communicate to the team.

So How Much Should be Communicated

Communicate as much as a team can handle, and then a little less. The team has other concerns such as the deliverable at hand, testing, builds breaking, network oddities, and so on. With only so much time in the day, every moment taken away adds frustration. Keeping with platitudes: a little less is a whole lot more.

Never forget the concerns of context switching and partial attention. When a set of ranked items gets large enough, it eventually breaks down into “important” and “not important”. At this point some items are just ignored. The rest either spread attention too thin or create wasted time. Blowing past the schedule makes the whole exercise a moot point. The closer that list size is to zero, and mitigating as much yourself, the smoother the teams’ deliveries will be.

And the Moral of the Story Is…

Keep your risk lists small and your communications even smaller. For the most part, keep it all to yourself. Communicate only one risk per deliverable until that risk is as acted upon as is valuable. Then, move on.

Context switches are killers, and they are all over the place. The team has enough to worry about without adding another fistful of distractions. Having one item to be mindful of per deliverable, a team can be successful. They can be exceptionally mediocre spreading focus across ten.

Written by mackay on November 5, 2013 Categories: Projects Tags: , ,
Comments Off on Projects and Risk Granularity