小玩意

除夕这天,想要看看博文字符统计,但是简单搜搜没发现合适的,因此借助 Copilot 写了个,同时决定可以以后把类似的小玩意整合起来,本篇博文作为介绍,具体代码放在了 GitHub 上。

stats.py

用来统计文件中字符出现的次数
指定显示数目(默认显示全部)、使用正则表达式匹配使用内置字符集匹配(与前一个选项互斥)、指定查找文件后缀(默认全部文件)、递归查找目录(默认不递归)、忽略空白字符(默认忽略)、忽略大小写(默认大小写敏感)、输出结果(默认输出到标准输出)等功能。

一开始用 Bash,写得又臭又长,折腾半天还是有点问题。

stats.sh(需要 ripgrep, perl 等)GitHub
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
#!/bin/bash

# DEPRECATED: Use `stats.py` instead.

# 定义变量来存储参数的值
number=0
regex=""
library=""
reverse=false
recursive=false
ignore_space=true
ignore_case=false
debug=false
file_formats=()
output_file=""
temp_file=$(mktemp)

# 定义 Perl 正则表达式字符集库
declare -A libraries=(
[c]="[:graph:]" # 所有可打印字符
[cp]="[:print:]" # 所有可打印字符和空格字符
[cn]="\\p{Script=Han}" # 所有常用汉字
[en]="a-zA-Z" # 所有英文字母
[alnum]="[:alnum:]" # 字母和数字字符
[num]="[:digit:]" # 所有数字
[sp]="[:space:]" # 空白字符
[punc]="[:punct:]" # 标点字符
)

# 定义一个函数来显示使用方法
usage() {
echo -e "Usage: $0 [-n number] [-e regex] [-l library] [-f format] [-o output] [-d] [-r] [-R] [-S] [-i] [-h] file | dir [file | dir ...]

Options:
-n number Display the top number of results.
-e regex Use the specified regular expression.
-l library Use the specified character set library. This option cannot be used with the -e option.
-f format Process files of the specified format(s).
-o output Write the results to the specified output file.
-d Debug mode.
-r Reverse the order of the results.
-R Process directories recursively.
-S Show whitespace characters.
-i Ignore case.
-h Display this help message.

Arguments:
file The file(s) to process.
dir The directory to process.

Libraries:
c All printable characters.
cp All printable and space characters.
cn All common Chinese characters.
en All English alphabetic characters.
alnum Alphabetic and numeric characters.
num Numeric characters.
sp Space characters.
punc Punctuation characters."
}

process_file() {
file=$1
echo "$file" >> "$temp_file"
# 根据是否设置了 -e 或 -l 选项来决定如何处理文件
if [ -n "$regex" ]; then
# 如果设置了 -e 选项,则使用正则表达式来过滤字符
perl -C -ne "while (/$regex/g) {print \"\$&\n\"}" "$file"
elif [ -n "$library" ]; then
# 如果设置了 -l 选项,则只显示字符集库中的字符
perl -C -ne 'while (/(['${libraries[$library]}'])/g) {print "$1\n"}' "$file"
else
# 如果没有设置 -e 或 -l 选项,则显示所有字符
cat "$file"
fi
}

# 使用 getopts 循环来处理命令行参数
while getopts "n:e:l:f:o:drRSih" opt; do
case $opt in
n) number=$OPTARG
# 检查 number 是否为正整数
if ! [[ "$number" =~ ^[0-9]+$ ]]; then
echo "Error: -n option requires a positive integer argument." >&2
exit 1
fi ;;
e) regex=$OPTARG
# 检查 -e 和 -l 选项是否同时使用
if [ -n "$library" ]; then
echo "Error: -e and -l options cannot be used together." >&2
exit 1
fi ;;
l) library=$OPTARG
# 检查 -e 和 -l 选项是否同时使用
if [ -n "$regex" ]; then
echo "Error: -e and -l options cannot be used together." >&2
exit 1
fi
# 检查指定的字符集库是否存在
if ! [[ -v libraries["$library"] ]]; then
echo "Error: Unknown library -$library" >&2
exit 1
fi ;;
d) debug=true ;;
r) reverse=true ;;
R) recursive=true ;;
S) ignore_space=false ;;
i) ignore_case=true ;;
f) IFS=',' read -ra formats <<< "$OPTARG"
for format in "${formats[@]}"; do
file_formats+=("$format")
done ;;
o) output_file=$OPTARG ;;
h) usage
exit 0 ;;
\?) echo "Invalid option -$OPTARG" >&2
usage
exit 1 ;;
esac
done

# 使用 shift 命令来移除已处理的参数
shift $((OPTIND -1))

# 检查是否提供了至少一个文件参数
if [ $# -eq 0 ]; then
echo "Error: At least one file argument is required." >&2
usage
exit 1
fi

# 处理每个输入的文件或目录
(
for input in "$@"; do
# 如果输入是目录
if [ -d "$input" ]; then
if $recursive; then
# 如果开启了递归选项,使用 find 命令递归地查找并处理目录中的文件
while IFS= read -r -d '' file
do
# 检查文件是否符合扩展名限制
if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${file##*.} " ]]; then
process_file "$file"
fi
done < <(find "$input" -type f -print0)
else
# 如果没有开启递归选项,只处理该目录下的文件
for file in "$input"/*; do
if [ ! -f "$file" ]; then
continue
fi
# 检查文件是否符合扩展名限制
if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${file##*.} " ]]; then
process_file "$file"
fi
done
fi
# 如果输入是文件
elif [ -f "$input" ]; then
# 检查文件是否符合扩展名限制
if [ ${#file_formats[@]} -eq 0 ] || [[ " ${file_formats[@]} " =~ " ${input##*.} " ]]; then
process_file "$input"
fi
else
echo "Error: $input is not a valid file or directory." >&2
fi
done
) | {
# 根据是否设置了 -S 选项来决定是否显示空白字符
if $ignore_space; then
tr -d '[:space:]'
else
cat
fi
} | {
# 根据是否设置了 -i 选项来决定是否忽略大小写
if $ignore_case; then
tr '[:upper:]' '[:lower:]'
else
cat
fi
} | rg -o .| sort | uniq -c | {
# 根据是否设置了 -r 选项来决定结果的排序顺序
if $reverse; then
sort -k1n
else
sort -k1nr
fi
} | {
i=1 # 初始化序号
while IFS=' ' read -ra line; do
# 使用数组切片来忽略开头的空字段
count=${line[0]}
char=${line[1]} # 获取第二个字段,即字符
printf "%d\t%s\t%d\n" "$i" "$char" "$count"
((i++)) # 增加序号
done
} | {
if [ "$number" -gt 0 ]; then
# 如果设置了数量选项,则只显示指定数量的结果
head -n "$number"
else
cat
fi
} > "${output_file:-/dev/stdout}"

# 显示调试信息
if $debug; then
echo -e "
Debug Information
=================
Number:\t$number
Regex:\t$regex
Library:\t$library
Reverse:\t$reverse
Recursive:\t$recursive
Ignore Space:\t$ignore_space
Ignore Case:\t$ignore_case
File Formats:\t${file_formats[@]}
Processed Files:" | tee --append ${output_file:-/dev/null}
cat "$temp_file" | tee --append ${output_file:-/dev/null}
fi

rm "$temp_file" # 删除临时文件

这个脚本还是有点问题的,目前我知道的就有 '␣'(一个正常空格)不会正常显示。

折腾了很久还是无法解决,最终放弃了,让 Copilot 写了个 Python 版本的(再也不写 Bash 了,语法丑陋难懂,还又臭又长)。

stats.py(初版,最新版请点击右侧 GitHub 链接)GitHub
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/usr/bin/python3

import argparse
import os
import re
import sys
from collections import Counter
from fnmatch import fnmatch

# 定义不同字符集
libraries = {
"c": r"[^\W]", # 所有可打印字符
"cp": r"[^\W]|[\s]", # 所有可打印字符和空格字符
"cn": r"[\u4e00-\u9fff]", # 所有常用汉字
"en": r"[a-zA-Z]", # 所有英文字母
"alnum": r"[a-zA-Z\d]", # 字母和数字字符
"num": r"[\d]", # 所有数字
"sp": r"[\s]", # 空白字符
"punc": r"[^\w\s]", # 标点字符
}

def process_file(file, regex, library, ignore_space, ignore_case, verbose):
if verbose:
print(f"Processing file: {file}")
with open(file, 'r', encoding='utf-8') as f:
content = f.read()
if ignore_space:
content = re.sub(r'\s', '', content)
if ignore_case:
content = content.lower()
if regex:
matches = re.findall(regex, content)
elif library:
matches = re.findall(libraries[library], content)
else:
matches = list(content)
return matches

def main():
parser = argparse.ArgumentParser(description='Count the occurrences of characters in files.',
epilog='''libraries:
c All printable characters.
cp All printable and space characters.
cn All common Chinese characters.
en All English alphabetic characters.
alnum Alphabetic and numeric characters.
num Numeric characters.
sp Space characters.
punc Punctuation characters.''',
formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument('-n', '--number', metavar="number", type=int, default=0, help='The number of most common characters to display.')
parser.add_argument('-e', '--expression', metavar="regex", type=str, default="", help='The regular expression to match.')
parser.add_argument('-l', '--library', metavar="library", type=str, choices=libraries.keys(), help='The character set library to use.')
parser.add_argument('-f', '--format', metavar="format", type=str, default="", help='The file formats to process.')
parser.add_argument('-o', '--output', metavar="output", type=str, default="", help='The output file.')
parser.add_argument('-r', '--reverse', action='store_true', default=False, help='Reverse the order of the output.')
parser.add_argument('-R', '--recursive', action='store_true', default=False, help='Recursively process directories.')
parser.add_argument('-S', '--show-space', action='store_true', default=False, help='Show whitespace characters.')
parser.add_argument('-i', '--case-sensitive', action='store_true', default=False, help='Ignore case when matching.')
parser.add_argument('-v', '--verbose', action='store_true', default=False, help='Display verbose output.')
parser.add_argument('paths', nargs='+', help='The files or directories to process.')
args = parser.parse_args()

if args.expression and args.library:
print("Error: --expression and --library options cannot be used together.")
sys.exit(1)

if args.library and args.library not in libraries:
print(f"Error: Unknown library --{args.library}")
sys.exit(1)

file_formats = args.format.split(',') if args.format else []

results = []
processed_files = [] # 新增一个列表用于存储处理过的文件
for path in args.paths:
if os.path.isfile(path):
if not file_formats or any(fnmatch(path, f'*.{fmt}') for fmt in file_formats):
results.extend(process_file(path, args.expression, args.library, not args.show_space, not args.case_sensitive, args.verbose))
processed_files.append(path) # 处理完一个文件后,将其添加到列表中
elif os.path.isdir(path):
for root, dirs, files in os.walk(path):
for name in files:
if not file_formats or any(fnmatch(name, f'*.{fmt}') for fmt in file_formats):
results.extend(process_file(os.path.join(root, name), args.expression, args.library, not args.show_space, not args.case_sensitive, args.verbose))
processed_files.append(os.path.join(root, name)) # 处理完一个文件后,将其添加到列表中
if not args.recursive:
break

counter = Counter(results)
most_common = counter.most_common(args.number if args.number > 0 else None)
most_common.sort(key=lambda x: (x[1], x[0]) if args.reverse else (-x[1], x[0]))

f = open(args.output, 'w', encoding='utf-8') if args.output else sys.stdout
for i, (char, count) in enumerate(most_common, start=1):
escape_dict = {" ": r"\s", "\n": r"\n", "\t": r"\t", "\r": r"\r", "\f": r"\f", "\v": r"\v", "\b": r"\b"}
char = escape_dict.get(char, char)
print(f"{i}\t{char}\t{count}", file=f)
if args.output:
f.close()

# 在所有文件处理完后,输出处理过的文件列表
print("Processed files:")
for file in processed_files:
print(file)

if __name__ == "__main__":
main()

./stats.py -h 会显示使用帮助。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
usage: stats.py [-h] [-n number] [-e regex] [-l library] [-f format] [-o output] [-r] [-R] [-S] [-i] [-v] paths [paths ...]

Count the occurrences of characters in files.

positional arguments:
paths The files or directories to process.

options:
-h, --help show this help message and exit
-n number, --number number
The number of most common characters to display.
-e regex, --expression regex
The regular expression to match.
-l library, --library library
The character set library to use.
-f format, --format format
The file formats to process.
-o output, --output output
The output file.
-r, --reverse Reverse the order of the output.
-R, --recursive Recursively process directories.
-S, --show-space Show whitespace characters.
-i, --case-sensitive Ignore case when matching.
-v, --verbose Display verbose output.

libraries:
c All printable characters.
cp All printable and space characters.
cn All common Chinese characters.
en All English alphabetic characters.
alnum Alphabetic and numeric characters.
num Numeric characters.
sp Space characters.
punc Punctuation characters.
Bash 版本的 Usage 有点不同,但因为不再维护所以没有更新
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Usage: ./stats.sh [-n number] [-e regex] [-l library] [-f format] [-o output] [-d] [-r] [-R] [-S] [-i] [-h] file | dir [file | dir ...]

Options:
-n number Display the top number of results.
-e regex Use the specified regular expression.
-l library Use the specified character set library. This option cannot be used with the -e option.
-f format Process files of the specified format(s).
-o output Write the results to the specified output file.
-d Debug mode.
-r Reverse the order of the results.
-R Process directories recursively.
-S Show whitespace characters.
-i Ignore case.
-h Display this help message.

Arguments:
file The file(s) to process.
dir The directory to process.

Libraries:
c All printable characters.
cp All printable and space characters.
cn All common Chinese characters.
en All English alphabetic characters.
alnum Alphabetic and numeric characters.
num Numeric characters.
sp Space characters.
punc Punctuation characters.