NLP数据维护单词频率

Question

我在使用以下代码清理语料库：

token=['hi','hi','account','is','follow' ,'follow','account','delhi']
to_remove=set(words_union_of_stopwords, cities, countries, firstnames, lastnames, otherwords)
filtered_token = set(token) - to_remove
# 输出: {'account', 'follow'}

由于将token转换为集合，重复词的频率丢失了，这导致了TF-IDF性能降低。我希望在清理过程中保留输出词汇的频率。

我有一个大型语料库，手动循环去除不需要的词汇需要花费很长时间（一周左右）。而上述代码能在1个半小时内完成整个清理任务。

我期望在尽可能快的时间内得到如下输出：

{'account', 'follow', 'follow', 'account'}

David Heffernan · Answer

试试这段代码，希望对你有所帮助：

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}  # 请替换成实际的要去除的词汇集合

# 过滤掉不需要的词汇
filtered_token = [word for word in token if word not in to_remove]

# 为了保持词频，可以使用Counter计算词频并重建列表
counts = Counter(token)
output = [word for word in filtered_token for _ in range(counts[word])]

print(output)

Luiggi Mendoza · Answer

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

# 创建一个集合，存储需要移除的词
to_remove = set(['stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'])

# 创建一个字典，用于存储词频
word_counts = Counter(token)

# 过滤掉需要移除的词
filtered_token = [word for word in token if word not in to_remove]

# 根据词频构建输出列表，保留每个词的出现次数
output_token = []
for word in filtered_token:
    output_token.extend([word] * word_counts[word])

print(output_token)

Shog9 · Answer

我明白了，如果你想在清理过程中不使用列表推导式但仍保持词频，可以先过滤掉不需要的词，然后根据原始列表中词的出现次数重构输出列表来达到目的。可以使用collections.Counter模块来计数单词的出现次数，然后按此频率重新构建输出列表。

from collections import Counter

token = ['hi', 'hi', 'account', 'is', 'follow', 'follow', 'account', 'delhi']

to_remove = {'stopword', 'city', 'country', 'firstname', 'lastname', 'otherword'}

# 过滤出不在移除列表中的单词
filtered_token = [word for word in token if word not in to_remove]

# 计算原始列表中单词的出现次数
counts = Counter(token)

# 按照词频重建输出列表
output = []
for word in filtered_token:
    output.extend([word] * counts[word])

print(output)

Lazer · Answer

OP的需求是要在过滤时保持未被过滤掉的重复词的频率，这一需求颇具挑战性。无论是通过列表推导还是对列表进行深复制并遍历移除集合以从列表中移除所有实例，都很容易在过滤过程中保留重复词。不过，现有的答案并没有涵盖第二种方法。

但这里我想探讨的是保留词频问题：

存在两种方法：

每个保留词相对于其他保留词的相对频率。现有答案已经提供了对此的解决方案。
每个保留词相对于原始词序列中的频率，即考虑已经被过滤掉的词。在OP的例子中，经过过滤后的词序列是['account','follow' ,'follow','account']，每个词的频率都是0.5。这个值满足了第一种相对频率解释，即两个词之间的频率相等。但在原始词序列中，这两个词的频率实际上是0.25。如果我们希望保持这个特定的频率值，就应采用第二种解释。

我们希望过滤掉某些词，但同时也要保留被移除词的信息。一种方法是将不需要的词替换为表示其存在的标记。在这个例子中，我将使用标记：

from collections import Counter

tokens=['hi','hi','account','is','follow' ,'follow','account','delhi']

counts = Counter(tokens)
print(counts)
# Counter({'hi': 2, 'account': 2, 'follow': 2, 'is': 1, 'delhi': 1})

# 定义要移除的词集合
stopwords = {'a', 'an', 'is', 'the', 'am', 'hi'}
city = {'london', 'phnom penh', 'beijing', 'paris', 'delhi'}
country = {'india', 'new zealand', 'bhutan', 'laos'}
firstname = {'santosh', 'sanjay', 'sunil', 'khanh', 'lan'}
lastname = {'chen', 'wu', 'zhao', 'laurent', 'moreau'}
otherword = {'vortex'}

# 将上述集合合并
to_remove = stopwords.union(city, country, firstname, lastname, otherword)

# 过滤词
filtered_tokens = tokens[:]
for i in range(len(filtered_tokens)):
    if filtered_tokens[i] in to_remove:
        filtered_tokens[i] = ""
print(filtered_tokens)
# ['', '', 'account', '', 'follow', 'follow', 'account', '']

# 统计过滤后的词频
filtered_counts = Counter(filtered_tokens)
print(filtered_counts)
# Counter({'': 4, 'account': 2, 'follow': 2})

# 计算归一化频率
def normalised_counter(counter):
    total_word_count = sum(counter.values(), 0.0)
    for key in counter:
        counter[key] /= total_word_count
    return counter

print(normalised_counter(counts))
# Counter({'hi': 0.25, 'account': 0.25, 'follow': 0.25, 'is': 0.125, 'delhi': 0.125})

print(normalised_counter(filtered_counts))
# Counter({'': 0.5, 'account': 0.25, 'follow': 0.25})

# 在后续使用`filtered_tokens`进行处理时，可以根据任务需求选择是否移除``标记。

补充说明：

如果你希望根据第一种解释创建一个过滤后的列表，但希望使用list.remove()而非列表推导式：

filtered_token = token[:]
for item in list(to_remove):
    try:
        while item in filtered_token:
            filtered_token.remove(item)
    except ValueError:
        pass