小玩意
除夕这天,想要看看博文字符统计,但是简单搜搜没发现合适的,因此借助 Copilot 写了个,同时决定可以以后把类似的小玩意整合起来,本篇博文作为介绍,具体代码放在了 GitHub 上。
stats.py
用来统计文件中字符出现的次数。
有指定显示数目(默认显示全部)、使用正则表达式匹配、使用内置字符集匹配(与前一个选项互斥)、指定查找文件后缀(默认全部文件)、递归查找目录(默认不递归)、忽略空白字符(默认忽略)、忽略大小写(默认大小写敏感)、输出结果(默认输出到标准输出)等功能。
一开始用 Bash,写得又臭又长,折腾半天还是有点问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | #!/bin/bash # DEPRECATED: Use `stats.py` instead. # 定义变量来存储参数的值 number=0 regex="" library="" reverse=false recursive=false ignore_space=true ignore_case=false debug=false file_formats=() output_file="" temp_file=$(mktemp) # 定义 Perl 正则表达式字符集库 declare -A libraries=( [c]="[:graph:]" # 所有可打印字符 [cp]="[:print:]" # 所有可打印字符和空格字符 [cn]="\\p{Script=Han}" # 所有常用汉字 [en]="a-zA-Z" # 所有英文字母 [alnum]="[:alnum:]" # 字母和数字字符 [num]="[:digit:]" # 所有数字 [sp]="[:space:]" # 空白字符 [punc]="[:punct:]" # 标点字符 ) # 定义一个函数来显示使用方法 usage() { echo -e "Usage: $0 [-n number] [-e regex] [-l library] [-f format] [-o output] [-d] [-r] [-R] [-S] [-i] [-h] file | dir [file | dir ...] Options: -n number Display the top number of results. -e regex Use the specified regular expression. -l library Use the specified character set library. This option cannot be used with the -e option. -f format Process files of the specified format(s). -o output Write the results to the specified output file. -d Debug mode. -r Reverse the order of the results. -R Process directories recursively. -S Show whitespace characters. -i Ignore case. -h Display this help message. Arguments: file The file(s) to process. dir The directory to process. Libraries: c All printable characters. cp All printable and space characters. cn All common Chinese characters. en All English alphabetic characters. alnum Alphabetic and numeric characters. num Numeric characters. sp Space characters. punc Punctuation characters." } process_file() { file=$1 echo "$file" >> "$temp_file" # 根据是否设置了 -e 或 -l 选项来决定如何处理文件 if [ -n "$regex" ]; then # 如果设置了 -e 选项,则使用正则表达式来过滤字符 perl -C -ne "while (/$regex/g) {print \"\$&\n\"}" "$file" elif [ -n "$library" ]; then # 如果设置了 -l 选项,则只显示字符集库中的字符 perl -C -ne 'while (/(['${libraries[$library]}'])/g) {print "$1\n"}' "$file" else # 如果没有设置 -e 或 -l 选项,则显示所有字符 cat "$file" fi } # 使用 getopts 循环来处理命令行参数 while getopts "n:e:l:f:o:drRSih" opt; do case $opt in n) number=$OPTARG # 检查 number 是否为正整数 if ! [[ "$number" =~ ^[0-9]+$ ]]; then echo "Error: -n option requires a positive integer argument." >&2 exit 1 fi ;; e) regex=$OPTARG # 检查 -e 和 -l 选项是否同时使用 if [ -n "$library" ]; then echo "Error: -e and -l options cannot be used together." >&2 exit 1 fi ;; l) library=$OPTARG # 检查 -e 和 -l 选项是否同时使用 if [ -n "$regex" ]; then echo "Error: -e and -l options cannot be used together." >&2 exit 1 fi # 检查指定的字符集库是否存在 if ! [[ -v libraries["$library"] ]]; then echo "Error: Unknown library -$library" >&2 exit 1 fi ;; d) debug=true ;; r) reverse=true ;; R) recursive=true ;; S) ignore_space=false ;; i) ignore_case=true ;; f) IFS=',' read -ra formats <<< "$OPTARG" for format in "${formats[@]}"; do file_formats+=("$format") done ;; o) output_file=$OPTARG ;; h) usage exit 0 ;; \?) echo "Invalid option -$OPTARG" >&2 usage exit 1 ;; esac done # 使用 shift 命令来移除已处理的参数 shift $((OPTIND -1)) # 检查是否提供了至少一个文件参数 if [ $# -eq 0 ]; then echo "Error: At least one file argument is required." >&2 usage exit 1 fi # 处理每个输入的文件或目录 ( for input in "$@"; do # 如果输入是目录 if [ -d "$input" ]; then if $recursive; then # 如果开启了递归选项,使用 find 命令递归地查找并处理目录中的文件 while IFS= read -r -d '' file do # 检查文件是否符合扩展名限制 if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${file##*.} " ]]; then process_file "$file" fi done < <(find "$input" -type f -print0) else # 如果没有开启递归选项,只处理该目录下的文件 for file in "$input"/*; do if [ ! -f "$file" ]; then continue fi # 检查文件是否符合扩展名限制 if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${file##*.} " ]]; then process_file "$file" fi done fi # 如果输入是文件 elif [ -f "$input" ]; then # 检查文件是否符合扩展名限制 if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${input##*.} " ]]; then process_file "$input" fi else echo "Error: $input is not a valid file or directory." >&2 fi done ) | { # 根据是否设置了 -S 选项来决定是否显示空白字符 if $ignore_space; then tr -d '[:space:]' else cat fi } | { # 根据是否设置了 -i 选项来决定是否忽略大小写 if $ignore_case; then tr '[:upper:]' '[:lower:]' else cat fi } | rg -o .| sort | uniq -c | { # 根据是否设置了 -r 选项来决定结果的排序顺序 if $reverse; then sort -k1n else sort -k1nr fi } | { i=1 # 初始化序号 while IFS=' ' read -ra line; do # 使用数组切片来忽略开头的空字段 count=${line[0]} char=${line[1]} # 获取第二个字段,即字符 printf "%d\t%s\t%d\n" "$i" "$char" "$count" ((i++)) # 增加序号 done } | { if [ "$number" -gt 0 ]; then # 如果设置了数量选项,则只显示指定数量的结果 head -n "$number" else cat fi } > "${output_file:-/dev/stdout}" # 显示调试信息 if $debug; then echo -e " Debug Information ================= Number:\t$number Regex:\t$regex Library:\t$library Reverse:\t$reverse Recursive:\t$recursive Ignore Space:\t$ignore_space Ignore Case:\t$ignore_case File Formats:\t${file_formats[@]} Processed Files:" | tee --append ${output_file:-/dev/null} cat "$temp_file" | tee --append ${output_file:-/dev/null} fi rm "$temp_file" # 删除临时文件 |
这个脚本还是有点问题的,目前我知道的就有 '␣'(一个正常空格)不会正常显示。
折腾了很久还是无法解决,最终放弃了,让 Copilot 写了个 Python 版本的(再也不写 Bash 了,语法丑陋难懂,还又臭又长)。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | #!/usr/bin/python3 import argparse import os import re import sys from collections import Counter from fnmatch import fnmatch # 定义不同字符集 libraries = { "c": r"[^\W]", # 所有可打印字符 "cp": r"[^\W]|[\s]", # 所有可打印字符和空格字符 "cn": r"[\u4e00-\u9fff]", # 所有常用汉字 "en": r"[a-zA-Z]", # 所有英文字母 "alnum": r"[a-zA-Z\d]", # 字母和数字字符 "num": r"[\d]", # 所有数字 "sp": r"[\s]", # 空白字符 "punc": r"[^\w\s]", # 标点字符 } def process_file(file, regex, library, ignore_space, ignore_case, verbose): if verbose: print(f"Processing file: {file}") with open(file, 'r', encoding='utf-8') as f: content = f.read() if ignore_space: content = re.sub(r'\s', '', content) if ignore_case: content = content.lower() if regex: matches = re.findall(regex, content) elif library: matches = re.findall(libraries[library], content) else: matches = list(content) return matches def main(): parser = argparse.ArgumentParser(description='Count the occurrences of characters in files.', epilog='''libraries: c All printable characters. cp All printable and space characters. cn All common Chinese characters. en All English alphabetic characters. alnum Alphabetic and numeric characters. num Numeric characters. sp Space characters. punc Punctuation characters.''', formatter_class=argparse.RawTextHelpFormatter) parser.add_argument('-n', '--number', metavar="number", type=int, default=0, help='The number of most common characters to display.') parser.add_argument('-e', '--expression', metavar="regex", type=str, default="", help='The regular expression to match.') parser.add_argument('-l', '--library', metavar="library", type=str, choices=libraries.keys(), help='The character set library to use.') parser.add_argument('-f', '--format', metavar="format", type=str, default="", help='The file formats to process.') parser.add_argument('-o', '--output', metavar="output", type=str, default="", help='The output file.') parser.add_argument('-r', '--reverse', action='store_true', default=False, help='Reverse the order of the output.') parser.add_argument('-R', '--recursive', action='store_true', default=False, help='Recursively process directories.') parser.add_argument('-S', '--show-space', action='store_true', default=False, help='Show whitespace characters.') parser.add_argument('-i', '--case-sensitive', action='store_true', default=False, help='Ignore case when matching.') parser.add_argument('-v', '--verbose', action='store_true', default=False, help='Display verbose output.') parser.add_argument('paths', nargs='+', help='The files or directories to process.') args = parser.parse_args() if args.expression and args.library: print("Error: --expression and --library options cannot be used together.") sys.exit(1) if args.library and args.library not in libraries: print(f"Error: Unknown library --{args.library}") sys.exit(1) file_formats = args.format.split(',') if args.format else [] results = [] processed_files = [] # 新增一个列表用于存储处理过的文件 for path in args.paths: if os.path.isfile(path): if not file_formats or any(fnmatch(path, f'*.{fmt}') for fmt in file_formats): results.extend(process_file(path, args.expression, args.library, not args.show_space, not args.case_sensitive, args.verbose)) processed_files.append(path) # 处理完一个文件后,将其添加到列表中 elif os.path.isdir(path): for root, dirs, files in os.walk(path): for name in files: if not file_formats or any(fnmatch(name, f'*.{fmt}') for fmt in file_formats): results.extend(process_file(os.path.join(root, name), args.expression, args.library, not args.show_space, not args.case_sensitive, args.verbose)) processed_files.append(os.path.join(root, name)) # 处理完一个文件后,将其添加到列表中 if not args.recursive: break counter = Counter(results) most_common = counter.most_common(args.number if args.number > 0 else None) most_common.sort(key=lambda x: (x[1], x[0]) if args.reverse else (-x[1], x[0])) f = open(args.output, 'w', encoding='utf-8') if args.output else sys.stdout for i, (char, count) in enumerate(most_common, start=1): escape_dict = {" ": r"\s", "\n": r"\n", "\t": r"\t", "\r": r"\r", "\f": r"\f", "\v": r"\v", "\b": r"\b"} char = escape_dict.get(char, char) print(f"{i}\t{char}\t{count}", file=f) if args.output: f.close() # 在所有文件处理完后,输出处理过的文件列表 print("Processed files:") for file in processed_files: print(file) if __name__ == "__main__": main() |
./stats.py -h 会显示使用帮助。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | usage: stats.py [-h] [-n number] [-e regex] [-l library] [-f format] [-o output] [-r] [-R] [-S] [-i] [-v] paths [paths ...] Count the occurrences of characters in files. positional arguments: paths The files or directories to process. options: -h, --help show this help message and exit -n number, --number number The number of most common characters to display. -e regex, --expression regex The regular expression to match. -l library, --library library The character set library to use. -f format, --format format The file formats to process. -o output, --output output The output file. -r, --reverse Reverse the order of the output. -R, --recursive Recursively process directories. -S, --show-space Show whitespace characters. -i, --case-sensitive Ignore case when matching. -v, --verbose Display verbose output. libraries: c All printable characters. cp All printable and space characters. cn All common Chinese characters. en All English alphabetic characters. alnum Alphabetic and numeric characters. num Numeric characters. sp Space characters. punc Punctuation characters. |
Bash 版本的 Usage 有点不同,但因为不再维护所以没有更新
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | Usage: ./stats.sh [-n number] [-e regex] [-l library] [-f format] [-o output] [-d] [-r] [-R] [-S] [-i] [-h] file | dir [file | dir ...] Options: -n number Display the top number of results. -e regex Use the specified regular expression. -l library Use the specified character set library. This option cannot be used with the -e option. -f format Process files of the specified format(s). -o output Write the results to the specified output file. -d Debug mode. -r Reverse the order of the results. -R Process directories recursively. -S Show whitespace characters. -i Ignore case. -h Display this help message. Arguments: file The file(s) to process. dir The directory to process. Libraries: c All printable characters. cp All printable and space characters. cn All common Chinese characters. en All English alphabetic characters. alnum Alphabetic and numeric characters. num Numeric characters. sp Space characters. punc Punctuation characters. |