Rework Scheme estimation in compressor#7230
Conversation
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
Polar Signals Profiling ResultsLatest Run
Previous Runs (1)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.011x ➖ datafusion / vortex-file-compressed (1.011x ➖, 0↑ 0↓)
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.601x ❌, 0↑ 15↓)
datafusion / vortex-compact (1.146x ➖, 0↑ 4↓)
datafusion / parquet (1.327x ❌, 0↑ 8↓)
duckdb / vortex-file-compressed (1.168x ➖, 0↑ 3↓)
duckdb / vortex-compact (1.132x ➖, 0↑ 2↓)
duckdb / parquet (1.204x ➖, 0↑ 6↓)
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.376x ❌, 0↑ 5↓)
datafusion / vortex-compact (1.051x ➖, 0↑ 2↓)
datafusion / parquet (1.153x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.141x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.123x ➖, 0↑ 2↓)
duckdb / parquet (1.053x ➖, 0↑ 0↓)
Full attributed analysis
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.995x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.007x ➖, 0↑ 0↓)
duckdb / parquet (1.009x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.917x ➖, 6↑ 0↓)
datafusion / vortex-compact (0.923x ➖, 6↑ 0↓)
datafusion / parquet (0.989x ➖, 0↑ 1↓)
datafusion / arrow (0.890x ✅, 11↑ 0↓)
duckdb / vortex-file-compressed (0.935x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.951x ➖, 2↑ 0↓)
duckdb / parquet (0.955x ➖, 7↑ 2↓)
duckdb / duckdb (0.999x ➖, 0↑ 2↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.983x ➖, 2↑ 0↓)
datafusion / vortex-compact (1.028x ➖, 0↑ 1↓)
datafusion / parquet (1.020x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.030x ➖, 0↑ 2↓)
duckdb / vortex-compact (0.999x ➖, 0↑ 1↓)
duckdb / parquet (1.019x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.919x ➖, 24↑ 0↓)
datafusion / vortex-compact (0.943x ➖, 5↑ 0↓)
datafusion / parquet (0.935x ➖, 11↑ 0↓)
duckdb / vortex-file-compressed (0.954x ➖, 10↑ 1↓)
duckdb / vortex-compact (0.958x ➖, 12↑ 3↓)
duckdb / parquet (0.958x ➖, 4↑ 0↓)
duckdb / duckdb (0.956x ➖, 13↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.742x ✅ unknown / unknown (0.811x ✅, 47↑ 1↓)
|
🚨🚨🚨❌❌❌ SQL BENCHMARK FAILED ❌❌❌🚨🚨🚨Benchmark |
Benchmarks: CompressionVortex (geomean): 0.996x ➖ unknown / unknown (0.979x ➖, 1↑ 2↓)
|
Summary
Tracking Issue: #7216
Better information from estimates
New
CompressionEstimatetype invortex-compressor/src/estimate.rsNote that this is not just a refactor, there is subtle logic that has changed in a few places (that I think is better, not actually sure).
I also would like to add a variant called
Exactthat returns the fully compressed array in the case that we can only determine if a scheme is a candidate by compressing the whole thing without any errors; the only case where we want to do this isSequenceArray(and maybe there's an argument to do this forConstantArraytoo, but the semantics aroundConstantArrayshould be even more special regardless, imo).TODO
API Changes
TODO
Testing
TODO