Keywords: 关系挖掘
synonym/synonymy/synonymous/aliase extraction(抽取)/detection(检测)/discovery(发现)/identification(识别)/generation(生成)
本文重点
- 生成同义词候选;
这两种情况一般都可以通过一些共现行为 (如点击) 统一到相同的形式, 即对候选 pair 的判断;
- 构造一个或多个相似性函数 (及其所需特征) 判断是否存在同义关系;
适用于大部分关系判断, 如上下位等;
- 生成同义词候选;
- 构造训练集, 训练模型, 预测;
一般情况下, 如果资源充足, 都会从无监督逐渐过渡到有监督;
Click Similarity3
Document Similarity4
Pseudo Document Similarity1
Query Context Similarit1
利用用户行为数据 (user behavioral data) 从查询词 (search query) 和搜索结果页 (search result pages, SRPs) 中挖掘同义词;
Query to Query
- 基于同一用户的 session 生成一些列 query pairs;
判断一对同义词是否满足要求
- Mandal, Aritra, Ishita K. Khan, and Prathyusha Senthil Kumar. "Query Rewriting using Automatic Synonym Extraction for E-commerce Search." eCOM@ SIGIR. 2019.
- Lu, Hanqing, et al. "Unsupervised Synonym Extraction for Document Enhancement in E-commerce Search." (2021).
- (2012,Chakrabarti) A Framework for Robust Discovery of Entity Synonyms
微软; 实体同义词 (entity synonyms); 基于点击数据; 垂类搜索领域 (电商/视频); 如何在垂搜中使用同义词;
提出 Pseudo Document Similarity (PseudoDocSim, 改进 ClickSim 和 DocSim) 和 Query Context Similarit (QCSim, 弥补 ClickSim 和 DocSim 的缺陷) 两种相似度计算方法; - (2011,Cheng) Entity Synonyms for Structured Web Search
微软; Click Similarity (ClickSim)
- Cheng T, Lauw H W, Paparizos S. Fuzzy matching of web queries to structured data[C]//2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, 2010: 713-716.
最早提出 ClickSim 的论文;
- Cheng T, Lauw H W, Paparizos S. Fuzzy matching of web queries to structured data[C]//2010 IEEE 26th International Conference on Data Engineering (ICDE 2010). IEEE, 2010: 713-716.
- (2001,Turney) Mining the Web for Synonyms: PMI-IR versus LSA on TOEFL
Document Similarity (DocSim)
- How to Build a Smart Synonyms Model | by Patrick O'Neill | Kensho Blog
基于 Wikipedia 重定向挖掘同义词; 有 Kaggle 代码;
- kdwd_aliases_and_disambiguation | Kaggle
关联的 Kaggle 代码;
- Kensho Derived Wikimedia Dataset | Kaggle
关联的 Wikipedia 数据
- Kensho Derived Wikimedia Dataset | Kaggle
- Introducing the Kensho Derived Wikimedia Dataset | by Gabriel Altay | Kensho Blog
Wikipedia 数据解析方法; 介绍如何将原始 Wikipedia 数据解析成 Kensho 版本的数据; 有 Kaggle 代码;
- kdwd_aliases_and_disambiguation | Kaggle
- check if two words are related to each other - Stack Overflow
- smallwat3r/synonym: CLI tool to find synonyms in 15 different languages.
一个 Linux 命令行工具, 通过调用 Thesaurus 提供的 API 返回同义词;
