Skip to content

Count and Mean aggregates#7201

Draft
blaginin wants to merge 2 commits intodevelopfrom
db/count-and-mean
Draft

Count and Mean aggregates#7201
blaginin wants to merge 2 commits intodevelopfrom
db/count-and-mean

Conversation

@blaginin
Copy link
Copy Markdown
Member

@blaginin blaginin commented Mar 29, 2026

No description provided.

Signed-off-by: blaginin <dima@spiraldb.com>
Co-authored-by: Claude <noreply@anthropic.com>
@blaginin blaginin self-assigned this Mar 29, 2026
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: blaginin <dima@spiraldb.com>
/// Return the count of non-null elements in an array.
///
/// See [`Count`] for details.
pub fn count(array: &ArrayRef, ctx: &mut ExecutionCtx) -> VortexResult<u64> {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is counting things in an array, should it return a usize?

false
}

fn accumulate(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to register a generic aggregate kernel to reduce count-non-null to be Array.validity().sum(), then we avoid decompressing all the data.

}

fn partial_struct_dtype() -> DType {
DType::Struct(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you static LazyLock this entire dtype inside this function?

@gatesn gatesn added the changelog/feature A new feature label Mar 30, 2026
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be decimal for decimals?

Comment on lines +53 to +56
pub struct MeanPartial {
sum: f64,
count: u64,
}
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will compute with a pretty large error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

$$\mu_n = \sum_{i=0}^n x_i/n \iff \mu_{n+1} = \mu_n + \frac{x_{n+1} - \mu_n}{1+n}$$

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need only hold the partial mean and count

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can incrementally push the next value into this partial, but you cannot combine two partials afaik?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For combining you do. $$\mu_{AB} = \frac{n_A \mu_A + n_B \mu_B}{n_A + n_B}$$

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought in that formula there's even more division and hence more rounding errors?

I think there are some smart algorithms, but I haven't seen a simple implementation ever being a problem in df

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% agree. Let's do what DataFusion odes

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF sum is T and ours is f64

Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error comes from numbers of different scales being combined, not from number of OPS.

Since we are storing these on disk we cannot "just change it later".

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DF only supports AVG for floats, decimals, and durations. And coerces all floats to f64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants