小玩意

发表于 2024-02-09 分类于杂记 Waline：阅读次数：本文字数： 339 阅读时长 ≈ 1 分钟

介绍一点用处不大的小玩意（主要用于 Linux）

除夕这天，想要看看博文字符统计，但是简单搜搜没发现合适的，因此借助 Copilot 写了个，同时决定可以以后把类似的小玩意整合起来，本篇博文作为介绍，具体代码放在了 GitHub 上。

`stats.py`

用来统计文件中字符出现的次数。
有指定显示数目（默认显示全部）、使用正则表达式匹配、使用内置字符集匹配（与前一个选项互斥）、指定查找文件后缀（默认全部文件）、递归查找目录（默认不递归）、忽略空白字符（默认忽略）、忽略大小写（默认大小写敏感）、输出结果（默认输出到标准输出）等功能。

一开始用 Bash，写得又臭又长，折腾半天还是有点问题。

stats.sh（需要 ripgrep, perl 等）GitHub

#!/bin/bash

# DEPRECATED: Use `stats.py` instead.

# 定义变量来存储参数的值
number=0
regex=""
library=""
reverse=false
recursive=false
ignore_space=true
ignore_case=false
debug=false
file_formats=()
output_file=""
temp_file=$(mktemp)

# 定义 Perl 正则表达式字符集库
declare -A libraries=(
  [c]="[:graph:]" # 所有可打印字符
  [cp]="[:print:]"  # 所有可打印字符和空格字符
  [cn]="\\p{Script=Han}"  # 所有常用汉字
  [en]="a-zA-Z"  # 所有英文字母
  [alnum]="[:alnum:]"  # 字母和数字字符
  [num]="[:digit:]"  # 所有数字
  [sp]="[:space:]"  # 空白字符
  [punc]="[:punct:]"  # 标点字符
)

# 定义一个函数来显示使用方法
usage() {
  echo -e "Usage: $0 [-n number] [-e regex] [-l library] [-f format] [-o output] [-d] [-r] [-R] [-S] [-i] [-h] file | dir [file | dir ...]

Options:
  -n number   Display the top number of results.
  -e regex    Use the specified regular expression.
  -l library  Use the specified character set library. This option cannot be used with the -e option.
  -f format   Process files of the specified format(s).
  -o output   Write the results to the specified output file.
  -d          Debug mode.
  -r          Reverse the order of the results.
  -R          Process directories recursively.
  -S          Show whitespace characters.
  -i          Ignore case.
  -h          Display this help message.

Arguments:
  file        The file(s) to process.
  dir         The directory to process.

Libraries:
  c           All printable characters.
  cp          All printable and space characters.
  cn          All common Chinese characters.
  en          All English alphabetic characters.
  alnum       Alphabetic and numeric characters.
  num         Numeric characters.
  sp          Space characters.
  punc        Punctuation characters."
}

process_file() {
  file=$1
  echo "$file" >> "$temp_file"
  # 根据是否设置了 -e 或 -l 选项来决定如何处理文件
  if [ -n "$regex" ]; then
    # 如果设置了 -e 选项，则使用正则表达式来过滤字符
    perl -C -ne "while (/$regex/g) {print \"\$&\n\"}" "$file"
  elif [ -n "$library" ]; then
    # 如果设置了 -l 选项，则只显示字符集库中的字符
    perl -C -ne 'while (/(['${libraries[$library]}'])/g) {print "$1\n"}' "$file"
  else
    # 如果没有设置 -e 或 -l 选项，则显示所有字符
    cat "$file"
  fi
}

# 使用 getopts 循环来处理命令行参数
while getopts "n:e:l:f:o:drRSih" opt; do
  case $opt in
    n) number=$OPTARG
       # 检查 number 是否为正整数
       if ! [[ "$number" =~ ^[0-9]+$ ]]; then
         echo "Error: -n option requires a positive integer argument." >&2
         exit 1
       fi ;;
    e) regex=$OPTARG
        # 检查 -e 和 -l 选项是否同时使用
        if [ -n "$library" ]; then
          echo "Error: -e and -l options cannot be used together." >&2
          exit 1
        fi ;;
    l) library=$OPTARG
       # 检查 -e 和 -l 选项是否同时使用
       if [ -n "$regex" ]; then
         echo "Error: -e and -l options cannot be used together." >&2
         exit 1
       fi
       # 检查指定的字符集库是否存在
       if ! [[ -v libraries["$library"] ]]; then
         echo "Error: Unknown library -$library" >&2
         exit 1
       fi ;;
    d) debug=true ;;
    r) reverse=true ;;
    R) recursive=true ;;
    S) ignore_space=false ;;
    i) ignore_case=true ;;
    f) IFS=',' read -ra formats <<< "$OPTARG"
       for format in "${formats[@]}"; do
         file_formats+=("$format")
       done ;;
    o) output_file=$OPTARG ;;
    h) usage
       exit 0 ;;
    \?) echo "Invalid option -$OPTARG" >&2
        usage
        exit 1 ;;
  esac
done

# 使用 shift 命令来移除已处理的参数
shift $((OPTIND -1))

# 检查是否提供了至少一个文件参数
if [ $# -eq 0 ]; then
  echo "Error: At least one file argument is required." >&2
  usage
  exit 1
fi

# 处理每个输入的文件或目录
(
  for input in "$@"; do
    # 如果输入是目录
    if [ -d "$input" ]; then
      if $recursive; then
        # 如果开启了递归选项，使用 find 命令递归地查找并处理目录中的文件
        while IFS= read -r -d '' file
        do
          # 检查文件是否符合扩展名限制
          if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${file##*.} " ]]; then
            process_file "$file"
          fi
        done < <(find "$input" -type f -print0)
      else
        # 如果没有开启递归选项，只处理该目录下的文件
        for file in "$input"/*; do
          if [ ! -f "$file" ]; then
            continue
          fi
          # 检查文件是否符合扩展名限制
          if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${file##*.} " ]]; then
            process_file "$file"
          fi
        done
      fi
    # 如果输入是文件
    elif [ -f "$input" ]; then
      # 检查文件是否符合扩展名限制
      if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${input##*.} " ]]; then
        process_file "$input"
      fi
    else
      echo "Error: $input is not a valid file or directory." >&2
    fi
  done
) | {
  # 根据是否设置了 -S 选项来决定是否显示空白字符
  if $ignore_space; then
    tr -d '[:space:]'
  else
    cat
  fi
} | {
  # 根据是否设置了 -i 选项来决定是否忽略大小写
  if $ignore_case; then
    tr '[:upper:]' '[:lower:]'
  else
    cat
  fi
} | rg -o .| sort | uniq -c | {
  # 根据是否设置了 -r 选项来决定结果的排序顺序
  if $reverse; then
    sort -k1n
  else
    sort -k1nr
  fi
} | {
  i=1  # 初始化序号
  while IFS=' ' read -ra line; do
    # 使用数组切片来忽略开头的空字段
    count=${line[0]}
    char=${line[1]}  # 获取第二个字段，即字符
    printf "%d\t%s\t%d\n" "$i" "$char" "$count"
    ((i++))  # 增加序号
  done
} | {
  if [ "$number" -gt 0 ]; then
    # 如果设置了数量选项，则只显示指定数量的结果
    head -n "$number"
  else
    cat
  fi
} > "${output_file:-/dev/stdout}"

# 显示调试信息
if $debug; then
  echo -e "
Debug Information
=================
Number:\t$number
Regex:\t$regex
Library:\t$library
Reverse:\t$reverse
Recursive:\t$recursive
Ignore Space:\t$ignore_space
Ignore Case:\t$ignore_case
File Formats:\t${file_formats[@]}
Processed Files:" | tee --append ${output_file:-/dev/null}
  cat "$temp_file" | tee --append ${output_file:-/dev/null}
fi

rm "$temp_file"  # 删除临时文件

这个脚本还是有点问题的，目前我知道的就有 '␣'（一个正常空格）不会正常显示。

折腾了很久还是无法解决，最终放弃了，让 Copilot 写了个 Python 版本的（再也不写 Bash 了，语法丑陋难懂，还又臭又长）。

stats.py（初版，最新版请点击右侧 GitHub 链接）GitHub

#!/usr/bin/python3

import argparse
import os
import re
import sys
from collections import Counter
from fnmatch import fnmatch

# 定义不同字符集
libraries = {
    "c": r"[^\W]",  # 所有可打印字符
    "cp": r"[^\W]|[\s]",  # 所有可打印字符和空格字符
    "cn": r"[\u4e00-\u9fff]",  # 所有常用汉字
    "en": r"[a-zA-Z]",  # 所有英文字母
    "alnum": r"[a-zA-Z\d]",  # 字母和数字字符
    "num": r"[\d]",  # 所有数字
    "sp": r"[\s]",  # 空白字符
    "punc": r"[^\w\s]",  # 标点字符
}

def process_file(file, regex, library, ignore_space, ignore_case, verbose):
    if verbose:
        print(f"Processing file: {file}")
    with open(file, 'r', encoding='utf-8') as f:
        content = f.read()
        if ignore_space:
            content = re.sub(r'\s', '', content)
        if ignore_case:
            content = content.lower()
        if regex:
            matches = re.findall(regex, content)
        elif library:
            matches = re.findall(libraries[library], content)
        else:
            matches = list(content)
        return matches

def main():
    parser = argparse.ArgumentParser(description='Count the occurrences of characters in files.',
                                     epilog='''libraries:
  c           All printable characters.
  cp          All printable and space characters.
  cn          All common Chinese characters.
  en          All English alphabetic characters.
  alnum       Alphabetic and numeric characters.
  num         Numeric characters.
  sp          Space characters.
  punc        Punctuation characters.''',
  formatter_class=argparse.RawTextHelpFormatter)
    parser.add_argument('-n', '--number', metavar="number", type=int, default=0, help='The number of most common characters to display.')
    parser.add_argument('-e', '--expression', metavar="regex", type=str, default="", help='The regular expression to match.')
    parser.add_argument('-l', '--library', metavar="library", type=str, choices=libraries.keys(), help='The character set library to use.')
    parser.add_argument('-f', '--format', metavar="format", type=str, default="", help='The file formats to process.')
    parser.add_argument('-o', '--output', metavar="output", type=str, default="", help='The output file.')
    parser.add_argument('-r', '--reverse', action='store_true', default=False, help='Reverse the order of the output.')
    parser.add_argument('-R', '--recursive', action='store_true', default=False, help='Recursively process directories.')
    parser.add_argument('-S', '--show-space', action='store_true', default=False, help='Show whitespace characters.')
    parser.add_argument('-i', '--case-sensitive', action='store_true', default=False, help='Ignore case when matching.')
    parser.add_argument('-v', '--verbose', action='store_true', default=False, help='Display verbose output.')
    parser.add_argument('paths', nargs='+', help='The files or directories to process.')
    args = parser.parse_args()

    if args.expression and args.library:
        print("Error: --expression and --library options cannot be used together.")
        sys.exit(1)

    if args.library and args.library not in libraries:
        print(f"Error: Unknown library --{args.library}")
        sys.exit(1)

    file_formats = args.format.split(',') if args.format else []

    results = []
    processed_files = []  # 新增一个列表用于存储处理过的文件
    for path in args.paths:
        if os.path.isfile(path):
            if not file_formats or any(fnmatch(path, f'*.{fmt}') for fmt in file_formats):
                results.extend(process_file(path, args.expression, args.library, not args.show_space, not args.case_sensitive, args.verbose))
                processed_files.append(path)  # 处理完一个文件后，将其添加到列表中
        elif os.path.isdir(path):
            for root, dirs, files in os.walk(path):
                for name in files:
                    if not file_formats or any(fnmatch(name, f'*.{fmt}') for fmt in file_formats):
                        results.extend(process_file(os.path.join(root, name), args.expression, args.library, not args.show_space, not args.case_sensitive, args.verbose))
                        processed_files.append(os.path.join(root, name))  # 处理完一个文件后，将其添加到列表中
                if not args.recursive:
                    break

    counter = Counter(results)
    most_common = counter.most_common(args.number if args.number > 0 else None)
    most_common.sort(key=lambda x: (x[1], x[0]) if args.reverse else (-x[1], x[0]))

    f = open(args.output, 'w', encoding='utf-8') if args.output else sys.stdout
    for i, (char, count) in enumerate(most_common, start=1):
        escape_dict = {" ": r"\s", "\n": r"\n", "\t": r"\t", "\r": r"\r", "\f": r"\f", "\v": r"\v", "\b": r"\b"}
        char = escape_dict.get(char, char)
        print(f"{i}\t{char}\t{count}", file=f)
    if args.output:
        f.close()

    # 在所有文件处理完后，输出处理过的文件列表
    print("Processed files:")
    for file in processed_files:
        print(file)

if __name__ == "__main__":
    main()

./stats.py -h 会显示使用帮助。

usage: stats.py [-h] [-n number] [-e regex] [-l library] [-f format] [-o output] [-r] [-R] [-S] [-i] [-v] paths [paths ...]

Count the occurrences of characters in files.

positional arguments:
  paths                 The files or directories to process.

options:
  -h, --help            show this help message and exit
  -n number, --number number
                        The number of most common characters to display.
  -e regex, --expression regex
                        The regular expression to match.
  -l library, --library library
                        The character set library to use.
  -f format, --format format
                        The file formats to process.
  -o output, --output output
                        The output file.
  -r, --reverse         Reverse the order of the output.
  -R, --recursive       Recursively process directories.
  -S, --show-space      Show whitespace characters.
  -i, --case-sensitive  Ignore case when matching.
  -v, --verbose         Display verbose output.

libraries:
  c           All printable characters.
  cp          All printable and space characters.
  cn          All common Chinese characters.
  en          All English alphabetic characters.
  alnum       Alphabetic and numeric characters.
  num         Numeric characters.
  sp          Space characters.
  punc        Punctuation characters.

Bash 版本的 Usage 有点不同，但因为不再维护所以没有更新

Usage: ./stats.sh [-n number] [-e regex] [-l library] [-f format] [-o output] [-d] [-r] [-R] [-S] [-i] [-h] file | dir [file | dir ...]

Options:
  -n number   Display the top number of results.
  -e regex    Use the specified regular expression.
  -l library  Use the specified character set library. This option cannot be used with the -e option.
  -f format   Process files of the specified format(s).
  -o output   Write the results to the specified output file.
  -d          Debug mode.
  -r          Reverse the order of the results.
  -R          Process directories recursively.
  -S          Show whitespace characters.
  -i          Ignore case.
  -h          Display this help message.

Arguments:
  file        The file(s) to process.
  dir         The directory to process.

Libraries:
  c           All printable characters.
  cp          All printable and space characters.
  cn          All common Chinese characters.
  en          All English alphabetic characters.
  alnum       Alphabetic and numeric characters.
  num         Numeric characters.
  sp          Space characters.
  punc        Punctuation characters.