192.统计词语频率 - 凌冬的个人博客

192. 统计词频

Difficulty: 中等

写一个 bash 脚本以统计一个文本文件 words.txt 中每个单词出现的频率。

为了简单起见，你可以假设：

words.txt只包括小写字母和 ' ' 。
每个单词只由小写字母组成。
单词间由一个或多个空格字符分隔。

示例:

假设 words.txt 内容如下：

1
2


the day is sunny the the
the sunny is is

你的脚本应当输出（以词频降序排列）：

1
2
3
4


the 4
is 3
sunny 2
day 1

说明:

不要担心词频相同的单词的排序问题，每个单词出现的频率都是唯一的。
你可以使用一行实现吗？

Solution

Language: ****

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


# Read from the file words.txt and output the word frequency list to stdout.

# words.txt
awk '{

    for(i=1;i<=NF;i++) {
        words[$i]++
    }

}END {

    for(w in words) {
        print(w,words[w])
    
    }


}

'  words.txt | sort -k2 -r -n 

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


awk '{
    for(i = 1; i <= NF; i++){
        res[$i] += 1 #以字符串为索引，res[$i]相同的累计
    }
}
END{
    for(k in res){
        print k" "res[k]
    }
}' words.txt | sort -nr -k2  
# n：按数值排序，r：倒序，k：按第2列排序

其他写法

1
2
3
4


# Read from the file words.txt and output the word frequency list to stdout.
cat words.txt | xargs -n 1 | sort | uniq -c | sort -nr | awk '{print $2" "$1}'

 

文章目录

192. 统计词频

Solution

其他写法