Next: , Previous: , Up: Top   [Contents][Index]


1 Overview

The datamash program (http://www.gnu.org/software/datamash) performs calculation (e.g. sum,, count, min, max, skewness, standard deviation) on input files. A simple example: sum up the values in the first column of the input:

$ seq 10 | datamash sum 1
55

datamash can group input data and perform operations on each group. It can sort the file, and read header lines. An example: Finding the average score in statistics course of college students, grouped by their college major:

The input file has three fields: Name,Major,Score:
$ cat scores.txt
Name        Major            Score
Bryan       Arts             68
Isaiah      Arts             80
Gabriel     Health-Medicine  100
Tysza       Business         92
Zackery     Engineering      54
...
Sorting the input file and group by the second column (Major),
then calculating the mean score (third column) and sample-standard-deviation:
$ datamash --sort --headers --group 2 mean 3 sstdev 3 < scores.txt
GroupBy(Major)     mean(Score)   sstdev(Score)
Arts               68.9474       10.4215
Business           87.3636       5.18214
Engineering        66.5385       19.8814
Health-Medicine    90.6154       9.22441
Life-Sciences      55.3333       20.606
Social-Sciences    60.2667       17.2273

datamash is designed for interactive exploration of textual data, and for automating tasks in shell scripts.

datamash has a rich set of statistical functions to quickly assess information in textual input files. An example of calculating basic statistic (mean, 1st quartile, median, 3rd quarile, IQR, sample-standard-deviation, and p-value of Jarque-Bera test for normal distribution:

$ datamash -H mean 1 q1 1 median 1 q3 1 iqr 1 sstdev 1 jarque 1 < FILE
mean(x)   q1(x)  median(x)  q3(x)   iqr(x)  sstdev(x)  jarque(x)
45.32     23     37         61.5    38.5    30.4487    8.0113-09

Next: , Previous: , Up: Top   [Contents][Index]