How do you parse input?

ceci nest pas une pipeOver the years I've written hundreds of small scripts and apps for various tasks I need doing. Although these are often quite useful, I rarely release or share them.

I'm not some kind of horrible code miser, mind you, it's just that my scripts are always hacked together in such a way that they work for my specific input data only.

Recently I've realised that the main hurdle in writing more generic utilities is the way I'm parsing input: once I've got the input data in the correct format the rest of my code is clean and robust. If the input feed is nice (say, JSON data for a JavaScript project, or space-separated values for a shell script) then it's easy - but the most "hacky" parts of my programs tend to be huge chains of AWKs and GREPs, or convoluted string manipulation functions, or multi-step transformations that could easily be smashed.

As an example: in a recent bash script I used CURL to fetch a m3u playlist. I wanted to extract piece of information from each of the RTSP URL (which contained an ID and a format code). I only wanted the IDs under 10000. Here's a bit that pulls out the IDs and sorts them:

  curl --location | 
  grep rtsp:// | 
  awk 'BEGIN { FS="[=&]+" } ; $4 < 10000 { print $4 }' | 
  sort -g | 

It works, but it's brittle: if something falls down in the middle of the chain, I won't know why. And Worse still, if my requirements change and I want a different set of information then I have to start reverse-engineering it or start from scratch. For this example, I now need to also extract the format code, and find the title field that is contained in the #EXTINF line that precedes the URL. Errm.

So, what's the answer? Is the question I'm asking.

My input parsing is letting down the team - where should I turn? Learn about language parsers and write one for each project? Shut up and keep writing sed-soup? Any suggestions?