It’s about time there was something aside from silly poorly thought out philosophy on this blog.
siv is a program I’ve been writing since around April, when I wrote this post, as part of a larger project called the sreutils. It’s the first real program I’ve ever written, which I guess is why it took such a long time to finish it.
It does what I call Multi-layer regular expression matching. siv takes up to 10 regular expressions, a number of input sources and some flags, and reads through each input source doing a recursive depth-first-search. First it looks for exp0, then within what exp0 matched it looks for exp1 and so on up to exp9. Most importantly, unlike grep, sed or the other UNIX core utilities, siv doesn’t break input into an array of lines, it just reads an unstructured stream of bytes. Any structure in the output is based on the regular expressions used in the search.
This means that siv can do many things that grep can’t.
Say for example you have a latex bibliography file, like this one
@book{gibson,
author = "Gibson, J. J.",
title = "The Ecological Approach to Visual Perception",
year = 1986,
publisher = "Psychology Press"
}
@book{collingwood,
author = "Collingwood, R. G.",
title = "The Principles of Art",
year = 1938,
publisher = "Clarendon Press"
}
@inbook{ridley,
author = "Ridley, Aaron",
title = "Expression in Art",
editor = "Levinson, Jerrold",
booktitle = "The Oxford Handbook of Aesthetics",
year = 2003,
publisher = "Oxford University Press",
chapter = 11
}
and you want to extract all the @book entries. With siv, this can be done with the command
$ siv '^@book{.*,\n.*}$' references.bib
@book{gibson,
author = "Gibson, J. J.",
title = "The Ecological Approach to Visual Perception",
year = 1986,
publisher = "Psychology Press"
}
@book{collingwood,
author = "Collingwood, R. G.",
title = "The Principles of Art",
year = 1938,
publisher = "Clarendon Press"
}
If instead we wanted to extract the publisher field of each @book entry, we could say
$ siv -t 1 -e '^@book{.*,\n.*}$' -e 'publisher = .+$' references.bib
publisher = "Psychology Press"
publisher = "Clarendon Press"
The flag -t selects which match is to be printed, starting from 0. In the command above -t is selecting the content matched by the second regular expression, corresponding to the publisher field of the book entry.
A more impressive display of siv’s capabilities is its ability to do rudimentary parsing.
The expression
^([A-Za-z_][A-Za-z_*0-9]* ?)+\** [A-Za-z_][A-Za-z_0-9]*\([^\n]\)[ \n]{\n.+^}$
codes for a C function, whose header is all on one line, and whose body may
begin with an open curly brace on that same line or on the next one.
Other applications include parsing HTML/XML
$ curl https://joaodear.xyz | siv -t 1 -e '<head>.*</head>' -e '<meta.*>'
Extracting CSS rules for a given tag
$ siv '^([^\n]+, )*nav(, [^\n]+)* {\n.*^}$' style.css
And even fulfilling the role of grep itself
$ siv -e '^.*$' -e 'siv's not grep'
Now that siv is finally in a working state I’ll be testing it more in everyday use to get a better understanding of its strengths and weaknesses. I encourage anyone who’s interested to clone the sreutils repo and give it a try.