常用正则表达式

1. 判断中文

def is_chinese(uchar):
    """判断一个unicode是否是汉字"""
    return '\u4e00' <= uchar <= '\u9fa5'


def is_chinese_string(string):
    """判断是否全为汉字"""
    return all(is_chinese(c) for c in string)

2. 中英韩日字符

函数	说明
\u4e00-\u9fa5	汉字的unicode范围
\u0030-\u0039	数字的unicode范围
\u0041-\u005a	大写字母unicode范围
\u0061-\u007a	小写字母unicode范围
\uAC00-\uD7AF	韩文的unicode范围
\u3040-\u31FF	日文的unicode范围

3. 过滤掉非中、英、数字字符

1
2
3

def clean(x: str):
    str_text = re.sub(u"([^\u4e00-\u9fa5\u0030-\u0039\u0041-\u005a\u0061-\u007a])", "", x)
    return str_text

4. 正向反向提取

这个在BELLE项目提取chatGPT生成的数据里面有涉及到。

1
2
3

intruction_pattern = re.compile(r"(?<=(?:" + '|'.join(['指令:', '指令：']) + "))[\s\S]*?(?=" + '|'.join(['输入:', '输入：']) + ")")
input_pattern = re.compile(r"(?<=(?:" + '|'.join(['输入:', '输入：']) + "))[\s\S]*?(?=" + '|'.join(['输出:', '输出：']) + ")")
output_pattern = re.compile(r"(?<=(?:" + '|'.join(['输出:', '输出：']) + "))[\s\S]*?(?=$)")

资源

正则表达式

BLCL的博客小馆

常用正则表达式
点击返回顶部
币安理财存U年化收益40% ！！牛市躺着就是收钱，闲置资金记得放理财！立即注册立享收益！！

首页

关于

归档

常用正则表达式

1. 判断中文

2. 中英韩日字符

3. 过滤掉非中、英、数字字符

4. 正向反向提取

资源