15 June 2008
Yeah, I did
Posted by sanford under: Where It Will .
Once, in a misspent other life, I bought a scanner. Visions of old family photos to share and getting rid of a box of old documents, danced in my head. Preservation of family treasures and efficiency were good enough reasons. Ultimately, I gave my sister the pictures and shredded the documents. Kept the scanner, though. I bought a new bookcase to set it on. The bookcase would be used and the scanner wasn’t taking up space. It worked.
The scanner came with software. Copying, image editing, and optical character recognition. Copying was obvious and good. Something needed to be copied at least every other day. It was fun using the image editor, flipping colors, looking at a scanned picture of my sister facing left, then right, then upside down. Blurring, sharpening, painting her a red eye, then removing it. That lasted for a few hours. I kept some of the experiments to press my sister to get her own scanner. She could have fun enhancing the years of family history in those photos.
Character recognition was new to me. It was fed a document, one destined for shredding, and little yellow squares and red lines filled the scanned page. Tweaks and corrections, more tweaks and corrections, and some “a”s still came out as “*”. In the quiet hours of early morning desperation slipped toward a weary acceptance. Some good reason for the scanner would come up. In the mean time, a bookcase is always useful, sooner or later.
Standing on the back patio, working the last cigarette before bed, a great idea came, as they do in the early morning, after a number of beers, or when love strikes. I could use it to save money.
Prior failures at justification were pushed aside by a sputtering glow of enthusiasm. I dug up a grocery receipt, smoothed it, and fit it carefully on the glass. The data from different stores could be scanned, converted, and imported into a spreadsheet. A price history could be created. The most economical stores could be mapped. My sister would love it.
So what where the issues? Stores named items differently. That was okay. A skate in the park. After enough data had been captured and aligned, the list would plateau and the cost to manage would become small. The nickels and dimes would add up. The scanner would be paid for. Pathetic, perhaps. One can learn something from anything. After another hour or two, the first receipt was on disk.
The receipt from the next store killed the idea. Item names were frustratingly different. So was the layout. To align them, items and prices had to be flipped. Then the items, old and new, were sorted. The sort didn’t quite work. Naming was different enough to make the order relationships an uncomfortable approximation. The names had to be normalized. Aligning the same items took more time than anticipated. Neither receipt was identical, of course. Normalizing happened when new items were added. The number of steps for a few pieces of paper was numbing.
A view of the future had me hunched over tiny pieces of paper for hours, once a week. I could see fighting the OCR, making tiny adjustments to the data, working to fit in a new receipt type, and perhaps writing scripts that had to be tweaked for each difference. The receipts ended up in a box in the closet for some future bored ambition.
This scene, reduced to a few points, and without reference to my sister, has been used to describe one of the challenges in log management and analysis. Variations in log content, completeness, and form for the same set of data or data types require detailed effort that may not survive the next need. Every business has a variation. For this simple example, a focused effort might provide a solution. There are more programs than businesses, I suspect. Software has no restrictions on creativity and complexity. Logs and the messages they contain can be as varied as the people that create them. I’ve sometimes thought working with logs is akin to a pre-computer information management systems, before training on how to construct, manage, and file was common place.
That comparison only introduces a few of the issues with log management. Even today, a broad range of people presume a simplicity and directness in logs. Knowledgeable folk have brought solutions to the problem, then found more interesting things to work on, leaving the challenge unfulfilled. The complexity of large scale log management is a reflection of the complexity and growth of computing in the last 40 years.
The purpose of this fly spec on the blog sphere is to put down some of the thoughts a solution, or set of solutions, might be built on. At the least, I hope it’s useful to those trying to make sense of it.
One Comment so far...
Dimitri McKay Says:
23 June 2008 at 3:49 pm.
I did too, Sanford. Now that scanner sits beside me, the power plug missing. It hasn’t scanned in years. The logs on the other hand, although written in dozens of cryptic formats, binary to ascii, flat file to syslog, those I find myself hunched over, attempting to create some magic search filters that will satisfy that POC customer into believing that the answer he was looking for can be resolved with a boolean or regex expression.
Tax receipts or Sendmail logs, I hunch over them with the same frustration and cold desperation.
Perhaps I need a new scanner.