Unix table column/row manipulation using column names

pick is an expressive low-memory command-line tool for manipulating text file tables. Entire scripts can be replaced by concise command line invocations.

Each line/row is treated as a list of variables identified by column names. You can

perform computations to make new variables
filter rows based on boolean clauses operating on variables
for output columns take any subset from existing and newly computed variables

Some examples:

pick -c < data.txt                  # Count rows and validate table file.

pick foo bar                        # Pick columns named foo and bar (header included)

pick -A @tim/gt/0                   # Pick all columns, subset to rows where tim > 0

pick -A qat::foo:bar,add            # Add new column qat, sum of columns foo and bar

pick -A @tim/gt/0 qat::foo:bar,addd # Pick all columns, subset rows, add new column qat

pick -h ::end:start,sub^1,add       # Compute new column with inclusive interval length, omit header

pick -k ::3:4,sub^1,add             # As above, in the absence of column headers

pick digits::foo^'(\d+)',get        # Extract digits from column foo, store in new column 'digits'

pick -A @foo/gt/1 foo:=bar,abs      # Select on foo = abs(bar) without outputting foo

pick --kdict-KEEP=dict.txt @tim~isin~KEEP  # subset rows using a dictionary

Pick is a standalone script that depends only on a standard perl installation - it should work on any standard Linux (or Unix) OS and all Mac OS releases with developer tools installed.

wget https://raw.githubusercontent.com/micans/pick/main/pick

pick is robust and intuitive by supporting column names as handles. It is lightweight as it processes data per-line without the need to load the table into memory. It is expressive in that short command lines are sufficient to get at the data.

Pick works very well as a pre-filter to datamash - a tool that can compute statistics over columns, optionally grouped over a second (column) variable, and tools such as csvtk, which can (among many more things) sort tables using column names.

Additionally pick has extensive support for SAM format such as printing alignments and outputting alignment-derived quantities like coverage and base mismatch information. A reference FASTA file can be specified, which is then used by pick to automatically slot in reference sequences where needed. Alignments can be queried and printed limited to an interval of interest (by reference coordinates).

In comparison to the tools below pick is more limited in scope in some/many ways, instead complemented by tools such as datamash and sort. It is probably a lot slower than most tools mentioned below. Distinct advantages are its powerful and concise compute and filter primitives.

Note

For your benefit, miller (unix command mlr) is an amazing widely-used command-line tool for handling tables (using column names also). It is available in most Linux distributions as a supported package. Also on my watch-list are csvtk, sq, qsv, tsv-utils - all browsed from this overview.

I'm also looking at polars. Its ability to process streams, syntax, speed, efficiency and Rust implementation all look great, as is its elevation of the dataframe as a pivotal computable unit. I need to investigate whether pick command lines can be expressed as polars code.

In simple to middling cases pick can avoid both the need for a script (R, awk, Python, Ruby et cetera) and having to load the entire data set into memory. I use it in conjunction with UNIX tools such as comm, join, sort and datamash to simplify file-based computational workflows and make them more robust and understandable by promoting the use of column names as handles (as opposed to column indexes as used with cut and awk). You can use column names or column indexes to

Select columns
Change columns (using computation and string operations)
Combine columns into new columns (using computation and string operations)
Filter or fork rows on boolean clauses computed on old and new columns
Split/demultiplex rows to different output files based on (computed) labels in columns
Select multiple columns using ranges or regular expressions
Take the same action on multiple columns using a lambda expression
Import dictionaries to test membership or to map values

There is no downside, except, as ever, it comes with its own syntax for computation. For plain column selection and row filtering this syntax is not needed though; pick command lines look pleasant enough for common use cases.

Computation syntax is minimalist and terse, employing a stack language with just three types (variables, constants and operators). Accordingly, there is almost nothing to learn. In order to work efficiently as a command line tool, the pick computation language does away with whitespace entirely and uses punctuation to distinguish the three types.

Full documentation is in the Pick User Manual

Name		Name	Last commit message	Last commit date
Latest commit History 523 Commits
align		align
doc		doc
examples		examples
test		test
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
THESPACEINBETWEEN.md		THESPACEINBETWEEN.md
ideas.txt		ideas.txt
pick		pick
quants		quants

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unix table column/row manipulation using column names

About

Uh oh!

Releases 2

Packages

Uh oh!

Languages

License

micans/pick

Folders and files

Latest commit

History

Repository files navigation

Unix table column/row manipulation using column names

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Languages

Packages