buffered strategy to not use eof for the final chunk#7219
buffered strategy to not use eof for the final chunk#7219onursatici merged 2 commits intodevelopfrom
Conversation
Signed-off-by: Onur Satici <onur@spiraldb.com>
Merging this PR will not alter performance
Comparing Footnotes
|
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.186x ❌ datafusion / vortex-file-compressed (1.186x ❌, 0↑ 9↓)
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.017x ➖, 1↑ 1↓)
datafusion / vortex-compact (0.994x ➖, 1↑ 0↓)
datafusion / parquet (1.029x ➖, 1↑ 2↓)
datafusion / arrow (1.052x ➖, 1↑ 4↓)
duckdb / vortex-file-compressed (1.016x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.045x ➖, 0↑ 0↓)
duckdb / parquet (0.968x ➖, 6↑ 2↓)
duckdb / duckdb (1.061x ➖, 0↑ 6↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.997x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.016x ➖, 0↑ 1↓)
datafusion / parquet (0.985x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.002x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.986x ➖, 0↑ 0↓)
duckdb / parquet (1.004x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.936x ➖, 29↑ 0↓)
datafusion / vortex-compact (0.964x ➖, 21↑ 1↓)
datafusion / parquet (0.910x ➖, 41↑ 0↓)
duckdb / vortex-file-compressed (0.922x ➖, 41↑ 1↓)
duckdb / vortex-compact (0.993x ➖, 0↑ 1↓)
duckdb / parquet (0.974x ➖, 6↑ 0↓)
duckdb / duckdb (0.986x ➖, 4↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.918x ➖, 9↑ 0↓)
datafusion / vortex-compact (0.961x ➖, 3↑ 0↓)
datafusion / parquet (0.904x ➖, 12↑ 0↓)
datafusion / arrow (0.892x ✅, 10↑ 0↓)
duckdb / vortex-file-compressed (0.850x ✅, 20↑ 0↓)
duckdb / vortex-compact (0.879x ✅, 16↑ 0↓)
duckdb / parquet (0.937x ➖, 3↑ 0↓)
duckdb / duckdb (0.991x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.114x ➖, 1↑ 6↓)
datafusion / vortex-compact (1.269x ➖, 0↑ 10↓)
datafusion / parquet (1.027x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.018x ➖, 0↑ 2↓)
duckdb / vortex-compact (1.037x ➖, 0↑ 0↓)
duckdb / parquet (1.065x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.464x ❌, 0↑ 5↓)
datafusion / vortex-compact (0.986x ➖, 0↑ 0↓)
datafusion / parquet (1.061x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.895x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.933x ➖, 0↑ 0↓)
duckdb / parquet (1.065x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.939x ➖, 2↑ 0↓)
duckdb / vortex-compact (0.925x ➖, 1↑ 0↓)
duckdb / parquet (0.976x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.948x ➖, 5↑ 0↓)
datafusion / parquet (0.930x ➖, 10↑ 1↓)
duckdb / vortex-file-compressed (1.044x ➖, 0↑ 7↓)
duckdb / parquet (1.015x ➖, 0↑ 1↓)
duckdb / duckdb (0.999x ➖, 1↑ 1↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.905x ➖ unknown / unknown (0.973x ➖, 5↑ 1↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.126x ➖, 0↑ 4↓)
datafusion / vortex-compact (1.069x ➖, 1↑ 3↓)
datafusion / parquet (0.990x ➖, 2↑ 3↓)
duckdb / vortex-file-compressed (1.033x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.081x ➖, 0↑ 1↓)
duckdb / parquet (1.102x ➖, 0↑ 1↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 0.975x ➖ unknown / unknown (0.949x ➖, 23↑ 3↓)
|
Signed-off-by: Onur Satici <onur@spiraldb.com>
Summary
fixes buffered layout writer so it doesn't write the final chunk on the eof pointer. Eof should only be used for data that the writer wants to place at the end of the file. Buffered writer was writing regular buffered data to there which did mess up ordering of some segments.
Previously struct writer was using a transposed stream without spawning a task per column, on that world buffering was deadlocky. That is changed for a while to spawn now, so we should be deadlock safe.
I did try converting all clickbench files repeatedly, as well as the public bi datasets and randomly generated wide tables but I couldn't deadlock this.
fixes #7234
fixes #7236
Testing
add a vortex file test that asserts the dict layout segments are in the right order, as well as zone maps across columns