Skip to content

csgrep: --add-input-lines option for --mode=json#242

Open
Jany26 wants to merge 1 commit intocsutils:mainfrom
Jany26:input-lineno-in-json
Open

csgrep: --add-input-lines option for --mode=json#242
Jany26 wants to merge 1 commit intocsutils:mainfrom
Jany26:input-lineno-in-json

Conversation

@Jany26
Copy link

@Jany26 Jany26 commented Mar 24, 2026

This PR adds a new CLI option --add-input-lines that provides an extra input_line field in DefEvents in the JSON ouptut. The tokenizer captures the line number but only emits it during JSON writing. Change is backwards compatible (csgrep/0134-0135 tests should cover that). If needed, I can add additional tests, but the line number information is not utilized anywhere else.

Context for the proposed change (our use case).
In Log Detective, we utilize csgrep to extract compiler errors from failed RPM log files to provide additional context for LLM analysis. The extracted log snippets should have a line number pointing to the log file for easy lookup. Currently there is no easy way for us to extract input line numbers from CSGrep output.

If --add-input-lines option is used, DefEvents output as json will
have "input_line" referencing the line number of the input file.

Signed-off-by: Jan Matufka <jmatufka@redhat.com>
@kdudka kdudka self-assigned this Mar 24, 2026
@kdudka kdudka self-requested a review March 24, 2026 11:20
Copy link
Member

@kdudka kdudka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Jany26 I like the proposed feature. A few technical comments inline...

std::string fileName;
int line = 0;
int column = 0;
int inputLine = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also (optionally) record inputFile. Otherwise the number is ambiguous in case csgrep reads multiple input files.

class ErrFileLexer {
public:
ErrFileLexer(std::istream &input):
ErrFileLexer(std::istream &input, bool addInputLines = false):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easier if the ErrFileLexer constructor took InStream &. This class is internal to the parser-cov module, so the constructor can be changed easily.

class Tokenizer: public ITokenizer {
public:
Tokenizer(std::istream &input):
Tokenizer(std::istream &input, bool addInputLines = false):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, the Tokenizer constructor can take InStream & to simplify the code.


// known per-event subnodes
// 'input_line' is whitelisted, but currently there is no use-case for it
// so it is not re-read
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cannot we preserve non-zero input_line values already recorded in JSON files?
The use-case would be:

csgrep --mode=json file1.json file2.json > all.json

if (0 < evt.vSize)
evtNode["v_size"] = evt.vSize;
if (0 < evt.inputLine)
evtNode["input_line"] = evt.inputLine;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to record inputFile as well. Otherwise the number is meaningless when multiple text files are read.

("file-glob", "expand glob patterns in the names of input files")
("ignore-case,i", "ignore case when matching regular expressions")
("ignore-parser-warnings", "if enabled, parser warnings about the input files do not affect exit code")
("add-input-lines", "if enabled, events in mode=json will also contain input_line numbers from the original input file/stream")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --add-input-lines option name does not sound intuitive for what it does. Without the context I would understand it as if some input text was being added rather than line numbers. As mentioned above, we should record the file names, too. What about naming it --record-input-locations instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I also rename the JSON fields to input_line_location and input_file_location?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use input_file and input_line. I see location as an abstraction over file, line, column. Something like: https://github.com/gcc-mirror/gcc/blob/5cd3889135d77bf951e4ffe169868b453c36257d/libcpp/include/line-map.h#L1293

@Jany26
Copy link
Author

Jany26 commented Mar 24, 2026

ad pending s390x builds: fedora-copr/copr#4219 (comment)

@kdudka
Copy link
Member

kdudka commented Mar 24, 2026

ad pending s390x builds: fedora-copr/copr#4219 (comment)

I think we can simply disable the s390x CI jobs. I do not remember they would ever catch a bug that the other jobs missed. I already did this for csmock: csutils/csmock@62de9a5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants