Skip to content

feat: .ksh extension + shebang-based language detection for extension-less scripts (#235, #237)#276

Merged
tirth8205 merged 1 commit intotirth8205:mainfrom
azizur100389:feat/ksh-and-shebang-detection
Apr 18, 2026
Merged

feat: .ksh extension + shebang-based language detection for extension-less scripts (#235, #237)#276
tirth8205 merged 1 commit intotirth8205:mainfrom
azizur100389:feat/ksh-and-shebang-detection

Conversation

@azizur100389
Copy link
Copy Markdown
Contributor

Summary

Two parser improvements that expand file coverage to extension-less Unix scripts and Korn shell files. +288 lines, 18 new tests.

Closes #235, closes #237.

Supersedes individual PRs #236 and #238 (which are being closed in favor of this bundled PR).


Feature 1: .ksh extension → bash parser (#235)

Register .ksh with tree-sitter-bash alongside .sh / .bash / .zsh. Korn shell is close enough to bash syntactically that tree-sitter-bash handles it correctly.

Context: @tirth8205 explicitly invited this in the close comment on PR #230:

The .ksh extension in particular looks worth adding — I didn't include it in #227.

Tests: test_detects_language extended with .ksh; test_ksh_extension_parses_as_bash — end-to-end parse that asserts identical function set and edge counts to .sh.

Feature 2: Shebang-based language detection (#237)

detect_language() was extension-only — any file with no extension was silently skipped. This misses git hooks, CI scripts, bin/ entry points, and installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps interpreter basenames to already-registered languages:

Interpreter Language
bash, sh, zsh, ksh, dash, ash bash
python, python2, python3, pypy, pypy3 python
node, nodejs javascript
ruby, perl, lua, Rscript, php (respective)

New _detect_language_from_shebang(path) reads first 256 bytes. Handles:

  • Direct form: #!/bin/bash
  • env indirection: #!/usr/bin/env bash
  • env -S flag: #!/usr/bin/env -S node --experimental-vm-modules
  • Trailing flags: #!/bin/bash -e
  • CRLF line endings, binary content, strict UTF-8 decoding

detect_language() falls back to shebang probe only for files with suffix == "". Files with a known extension are never re-read.

Tests (16 new): every mapped interpreter, env -S flag, trailing flags, missing shebang, empty file, binary content, unknown interpreter, extension-override guard, and end-to-end parse_file() producing function nodes.


Files changed (3 files, +288 / -1)

  • code_review_graph/parser.py.ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE table + _detect_language_from_shebang() + detect_language() fallback
  • tests/test_multilang.py.ksh detection + end-to-end ksh parsing test
  • tests/test_parser.py — 16 shebang detection tests

Test results

  • 17/17 targeted tests pass (1 ksh + 16 shebang)
  • ruff check on all changed files: clean (1 pre-existing F841 in an unrelated test)

…-less scripts (tirth8205#235, tirth8205#237)

Two parser improvements that expand code-review-graph's file coverage
to extension-less Unix scripts and Korn shell files.

Feature 1: .ksh extension → bash parser (tirth8205#235)
-----------------------------------------------
Register .ksh (Korn shell) with tree-sitter-bash alongside the existing
.sh / .bash / .zsh entries shipped in v2.3.0.  Korn shell is close enough
to bash syntactically that tree-sitter-bash handles the structural
features the graph captures correctly.

Context: in the close comment on PR tirth8205#230, @tirth8205 explicitly flagged
this as worth adding: "The .ksh extension in particular looks worth
adding — I didn't include it in tirth8205#227."

Tests: test_detects_language extended with .ksh assertion;
test_ksh_extension_parses_as_bash — end-to-end regression test that
copies sample.sh to a temp .ksh file, parses it, and asserts identical
function set and edge counts.

Feature 2: shebang-based language detection (tirth8205#237)
--------------------------------------------------
detect_language() was extension-only — any file with no extension returned
None and was silently skipped.  This misses a huge category of production
files: git hooks, CI scripts, bin/ entry points, installers.

New SHEBANG_INTERPRETER_TO_LANGUAGE table maps common interpreter
basenames to languages already registered:
  bash/sh/zsh/ksh/dash/ash -> bash
  python/python2/python3/pypy/pypy3 -> python
  node/nodejs -> javascript
  ruby, perl, lua, Rscript, php

New _detect_language_from_shebang(path) static method reads the first
256 bytes, handles direct form (#!/bin/bash), env indirection
(#!/usr/bin/env bash), env -S flags, trailing flags (#!/bin/bash -e),
CRLF, binary content, and strict UTF-8 decoding.

detect_language() now falls back to the shebang probe for files with
no extension (suffix == "").  Files with a known extension are never
re-read — extension-based detection stays authoritative.

Tests (16 new in test_parser.py): every interpreter mapping, env -S flag,
trailing flags, missing shebang, empty file, binary content, unknown
interpreter, extension-does-not-get-overridden, and end-to-end
parse_file producing function nodes from an extension-less bash script.

Files changed
-------------
- code_review_graph/parser.py — .ksh mapping + SHEBANG_INTERPRETER_TO_LANGUAGE
  table + _detect_language_from_shebang() + detect_language() fallback
- tests/test_multilang.py — .ksh detection + end-to-end ksh parsing test
- tests/test_parser.py — 16 shebang detection tests
@azizur100389 azizur100389 force-pushed the feat/ksh-and-shebang-detection branch from 24fdad8 to fc5fc95 Compare April 14, 2026 23:03
@tirth8205 tirth8205 merged commit e6e3144 into tirth8205:main Apr 18, 2026
9 checks passed
Minidoracat added a commit to Minidoracat/code-review-graph that referenced this pull request Apr 18, 2026
Upstream's current `main` no longer exports `_SHEBANG_PROBE_BYTES` /
`SHEBANG_INTERPRETER_TO_LANGUAGE` (tirth8205#276's shebang detection was
reverted in one of the chore merge commits).  An earlier pass of this
rebase preserved the shebang fallback in `detect_language`, which then
NameError'd at import time.

Narrow this PR's `detect_language` back to its actual scope — Blade
compound-extension check before the plain extension lookup.  Restoring
tirth8205#276 is upstream's job, not this branch's.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(parser): shebang-based language detection for extension-less scripts feat(bash): add .ksh extension to bash parser

2 participants