[rrs-users] Re: rrs variable-length months

Tom Flavel tom at printf.net
Mon Apr 18 06:44:20 GMT 2005

On 18/04/2005 01:08:02, Jim C. Nasby wrote:
> I'm cc'ing rrs-users at rrs.decibel.org, since others might be interested
> in this. You might want to subscribe.

Have done

> On Thu, Apr 14, 2005 at 06:11:53AM +0100, Tom Flavel wrote:
> > Hi,
> > 
> > I'm looking to log bandwidth and similar types of data, and RRS looks
> > like it'll do roughly what I need.
> > 
> > My data comes from SNMP: I wrote a function for pgsql which goes and
> > gets the data (internally it finds the IPs and passwords for SNMP, etc
> > from other tables), and materialises it as a table, so you can "SELECT
> > getsnmpdata()" and be presented with the various rows. This is entirley
> > irrelevant, but I thought you may be interested.
> Actually, that's an interesting approach. I'd like to include your code
> if I can. I guess it's time for a demo directory or something...

Certinaly. I'll be releasing the entire system when it's ready, but you
can have just that code, if you want.

That approach has advamtages for security for things like credit card
transactions, too - SELECT doatransaction() would return success or failure,
and do the actual transaction server-side. This means the table itself
dosent need to GRANT SELECT, so there's no way a credit card number could
ever be retrieved over the connection. (I have some SECURITY DEFINER
function which returns the numbers in the form "***********xxx")

> > Now, I'd like to take 95th percentiles of the data per-month. Month
> > lengths are different per-month. My thoughts are, that if the RRD has
> > more than enough rows, I can just ignore the latter couple for shorter
> > months (right?).
> Actually, 'last month' isn't a very accurate label. It should really be
> last 30 days. Right now, the only way could could actually do a real
> 'last month' would be to pull directly from the source data one month at
> a time, which wouldn't be terribly useful.

See below, re source_status

> If you take a look at the rrs table for the last month entry, it's
> defined to keep 168 buckets. Each bucket is 4 hours long, although the
> time_per_bucket field is only used by the code if the rrs doesn't have a
> parent. If you want to know how long each bucket *really* is, you need
> to trace back through the rrs's parents until you come to the top level.
> In this case, the 'last month' rrs (rrs_id 6 normally) feeds off rrs_id
> 4 (last day), which feeds off rrs_id 1 (last hour), which is the top
> level. Each bucket in rrs_id 1 is 1 minute long. Each bucket is rrs_id 4
> is the aggregate of 30 buckets of rrs_id 1, or 30 minutes. Each bucket
> in rrs_id 6 is the aggregate of 8 buckets of rrs_id 4. 8 * 30 minutes =
> 4 hours.

That's how I thought it worked; thanks for clarifying

> Eventually I'd like to have the ability to do an actual last month, but
> I haven't figured out any good way to do it.

See below, re source_status

> How were you thinking of doing a 95th percentile calculation? I think
> that would be an interesting example as well.

I was hoping to make an aggregate function. I think I can have it store
each number as an array, in the current state for the aggregate, and
then sort and return the 5%th element when it's finished. I'm a little
unsure about how to access arrays via postgresql's SAPI, though, or if
they're intended to be sorted.

As far as I understand, an array is arranged consecutivley in memory, so
I think qsort() will suffice.

Maybe inserting in order would be better, or maybe just using the state
for storing the OID of some tempoary table would turn out more effecient;
suggestions welcome.

> > I assume RRS keeps the number of the most recently updated row
> > somewhere, but I cant for the life of me find it... or does the buckets
> > approach (which I am not entirley sure I follow) mean something entirley
> > unrelated is happening there?
> Take a look at the source_status table. It's got the last end time run
> for each source and rrs. You can use that info to find the specific
> bucket.

I thought simply LIMITing the retrieved data by however many days are in the
given month (which would need to be passed in as some parameter, which means
a special-case function prototype :/), assuming the data has enough rows for
all possible month lengths?

Or, if we can just SELECT WHERE start...end, then great. I'm still not
entirley sure how that's structured, or where I'm supposed to be
looking, though.

> > Any pointers are much appreciated, or advice for other approaches.
> > 
> > 
> > (I'll submit patches when I'm done, if you want them: I'm guessing
> > somebody else may want this ability, too)
> BTW, something you should be aware of in version 0.4. It's got code to
> throttle updates when it's running behind. Previously it could take
> hours or more for an update to happen if it was far behind. 0.4 will
> limit how much data is processed. But the problem is that if you add a
> new RRS to an existing system, the update of that new RRS won't work
> properly and it will miss a lot of data. If you need this functionality
> you should probably stick to 0.3 for now (or help me fix the update :P).

Thanks for the warning... I didn't realise it'd miss data for
newly-added RRSes. I'll be creating and removing them frequently, as
customers join and leave (hopefully not many of the latter :)

Ideally, I'd like to fix that, if I can. My plpgpsql skills are
non-existant, though: everything I've written is either pure SQL or C.


More information about the rrs-users mailing list