编程随想

会python真的可以为所欲为 by Python自学0(回) 445天前

这里还有人吗 by mikeKil1(回) 631天前

这里还有人吗 by mikeKil0(回) 631天前

每天面对着电脑屏幕，敲打键盘。我所面对的并不只是代码，而是一种生活方式。 by js特效0(回) 855天前

到处都是羊，不想上班 by Python自学0(回) 935天前

鸽子 by 张书娥0(回) 939天前

云代码 - python代码库

NLP数据清洗：去除HTML标签、URL链接、数字、标点等噪音信息

2023-04-07 作者： Python自学举报

[python]代码库

import re
import string
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
 
def clean_text(text):
    text = re.sub(r'<.*?>', '', text)  # 去除HTML标签
    text = re.sub(r'http\S+', '', text)  # 去除URL链接
    text = re.sub(r'\d+', '', text)  # 去除数字
    text = text.translate(str.maketrans('', '', string.punctuation))  # 去除标点符号
    text = text.lower()  # 转换为小写字母
    stopwords_set = set(stopwords.words('english'))  # 获取停用词集合
    words = nltk.word_tokenize(text)  # 分词
    words = [w for w in words if w not in stopwords_set]  # 去除停用词
    text = ' '.join(words)
    return text

网友评论 (发表评论)

暂无评论 (抢沙发)

发表评论：

评论须知：

1、评论每次加2分，每天上限为30；
2、请文明用语，共同创建干净的技术交流环境；
3、若被发现提交非法信息，评论将会被删除，并且给予扣分处理，严重者给予封号处理；
4、请勿发布广告信息或其他无关评论，否则将会删除评论并扣分，严重者给予封号处理。

用户注册

用户登录

发表随想

该用户最新代码

编程随想

NLP数据清洗：去除HTML标签、URL链接、数字、标点等噪音信息

[python]代码库

网友评论 (发表评论)

发表评论：

评论须知：

扫码下载

输入口令后可复制整站源码