notes

python正则表达式基础练习

# python正则表达式基础

正则文档

字符	描述
\d	匹配一个数字字符。等价于[0-9]
\D	匹配一个非数字字符。等价于[^0-9]
\n	匹配一个换行符。等价于\x0a
\s	匹配任何空白字符，包括空格、制表符、换页符等等。
\S	匹配任何非空白字符。
\w	匹配包括下划线的任何单词字符。等价于“`[A-Za-z0-9_]`”。
?	匹配前面的子表达式零次或一次。
+	匹配前面的子表达式一次或多次。
*	匹配前面的子表达式零次或多次。
^	匹配输入字符串的开始位置。
$	匹配输入字符串的结束位置。
\	转义
x\|y	匹配x或y
[xyz]	字符集合。匹配所包含的任意一个字符。
[^xyz]	负值字符集合。匹配未包含的任意字符
[a-z]	字符范围。匹配指定范围内的任意字符。
[^a-z]	负值字符范围。匹配任何不在指定范围内的任意字符。

警告

导入正则模块，注意当前文件夹/文件不要命名为re，防止冲突

import re

可以使用re.complie()编译一个正则表达式，创建一个正则对象，可以多次使用，提高效率

re.search()方法只会返回第一个匹配的结果，需要匹配多个结果使用re.findall()方法

import re

pattern = re.compile("\d{1,}")
text = '111w222w333'
match = re.search(pattern,text)
print(match.group())
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6
7
8

# 括号分组

import re

pattern = re.compile("(114514)(.*)?(1919810)")
text = "114514aaaaaaaa1919810"
match = re.search(pattern,text)
print(match.groups())

1
2
3
4
5
6

# 管道符分组

利用管道符分组，匹配其中任意一个

import re

pattern = re.compile("'|\.|class|base|subclasses|init|globals|classeval|assert|import|exec|popen|subpress|read")
text = "''.__class__.__base__.__subclasses__()[137].__init__.__globals__['popen']('ls').read()"
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6

# 问号实现可选匹配

在匹配文本中，某段字符是否存在都可以时，使用问号表示这部分可选，例如匹配url中的s

import re

pattern = re.compile("http(s)?://[\w.]+")
text = "hi,welcome to https://the0n3.top !!!Nice to meet u!!!"
match = re.search(pattern,text)
print(match.group())

1
2
3
4
5
6

# 星号实现0次或多次匹配

*表示0次或多次

import re

pattern = re.compile("([a-z]+(1145141919810)*[a-z]+)")
text = "abcabc"
text1 = "abc1145141919810abc"
match = re.findall(pattern,text)
match1 = re.findall(pattern,text1)
print(match)
print(match1)

1
2
3
4
5
6
7
8
9

# 加号实现1次或多次匹配

+表示至少一次或多次

import re

pattern = re.compile("([a-z]+(1145141919810)+[a-z]+)")
text = "abcabc"
text1 = "abc1145141919810abc"
match = re.findall(pattern,text)
match1 = re.findall(pattern,text1)
print(match)
print(match1)

1
2
3
4
5
6
7
8
9

# 花括号实现指定次数匹配

{m,n}，第一个参数表示至少多少次，第二个参数表示最多多少次，可以省略其中一个参数

import re

pattern = re.compile("abc{1,3}")
text = "abcabcabc"
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6

# 贪心匹配和非贪心匹配

贪心匹配常用.*来表示，会尽可能多的匹配，在这里，第一个14也是符合的，但是他会继续往后匹配，判断是否仍然存在符合的字符

import re

pattern = re.compile("abc.*14")
text = "abc114514"
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6

非贪心匹配（惰性匹配），可以用.*?，在.*后面加一个问号，表示尽可能少的匹配，这样第一个14就是符合的，不会继续往后匹配

import re

pattern = re.compile("abc.*?14")
text = "abc114514"
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6

# 插入字符和美元字符

^表示匹配字符串的开始位置，$表示匹配字符串的结束位置

import re

pattern = re.compile("^\w.*$")
text = "abc114514"
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6

# 通配字符

在正则里，.点是通配字符，可以匹配换行符之外的任意字符

import re

pattern = re.compile(".*")
text = "abc114514"
match = re.findall(pattern,text)
print(match)

1
2
3
4
5
6

练习几个常见正则匹配

# 1. 匹配电子邮件地址

example@example.com, example@example.org, and example@example.net.

import re 

text = 'example@example.com, example@example.org, and example@example.net.'
pattern = r"\b[\w\.-]+@[\w\.-]+\.[a-z]+\b"
matches = re.findall(pattern,text)
for match in matches:
	print(match)

1
2
3
4
5
6
7

# 2. 匹配日期

The events are scheduled for 2024-08-01, 2024-09-15, and 2024-12-25.

import re 

text = 'The events are scheduled for 2024-08-01, 2024-09-15, and 2024-12-25.'
pattern = r"\b\d{4}-\d{2}-\d{2}\b"
matches = re.findall(pattern,text)
for match in matches:
	print(match)

1
2
3
4
5
6
7

# 3. 匹配电话号码

You can reach us at (123) 456-7890 or (987) 654-3210.

import re 

text = 'You can reach us at (123) 456-7890 or (987) 654-3210.'
pattern = r'\(\d{3}\) \d{3}-\d{4}'
matches = re.findall(pattern,text)
for match in matches:
	print(match)

1
2
3
4
5
6
7

# 4. 匹配 URL

Visit our website at http://www.example.com or https://example.org/page?query=1.

import re 

text = 'Visit our website at http://www.example.com or https://example.org/page?query=1.'
pattern = r'https?://[\w\.-]+\.[a-z]{2,}(?:/[\w\.-]*)?(?:\?[^\s\.]*)?'
matches = re.findall(pattern,text)
for match in matches:
	print(match)

1
2
3
4
5
6
7

# 5. 匹配所有单词

Here are some words: hello, world, python3, and regex_example.

import re 

text = 'Here are some words: hello, world, python3, and regex_example.'
pattern = r'\b[\w0-9]+\b'
matches = re.findall(pattern,text)
for match in matches:
	print(match)

1
2
3
4
5
6
7

# 6. 匹配html标签中的url

hi,how r u,i'm fine,3ku
<a href="https://the0n3.top/">hello,cola!</a>
<a href="https://baidu.com/">shit</a>
bye,bro

1
2
3
4

import re 

text = """hi,how r u,i'm fine,3ku
<a href="https://the0n3.top/">hello,cola!</a>
<a href="https://baidu.com/">shit</a>
bye,bro
"""
# 捕获组(pattern)
pattern = r'<a href="(https?://[\w-]+\.[a-z]{2,}/?)"'
matches = re.findall(pattern,text)
for match in matches:
	print(match)

1
2
3
4
5
6
7
8
9
10
11
12

参考：

[【Python编程快速上手--让繁琐工作自动化]

编辑

最后一次更新于: 2024/10/11, 23:22:45

← BurpSuite安装激活教程【踩坑】docker无法访问容器→