Skip to content

Latest commit

 

History

History
107 lines (86 loc) · 5.64 KB

File metadata and controls

107 lines (86 loc) · 5.64 KB

Function Documentation

lcquad_query_aug_script_with_time.py
Main script that perform the augmentation on lcquad dataset. The problem can be portrayed as decompose the SPARQL query first and find wikidataId from query. then find the representation of wikidataId corresponding entity in the input Englsh query, and find a similar entity with its id from wikidata Knowledge Base. and render them back into template of input English sentence and SPARQL query.

function definitions:

load_data
load_property_info_df
load_pid_relate_entity_df
produce_data_dict_for_search:
load some datasource required, mainly about the lcquad dataset, propertis of wikidata in English and the entities to properties mappings (i.e. the graph node and its edge in wikidata Knowledge Graph)

http_get_wiki_entity_property_info_by_id
info_extracter:
Get info of entities and properties use request.

retrieve_all_kb_part
retrieve_all_kb_part_wide:
decompose the SPARQL query and get all entityid and propertyid with the prefix as "wd" or "wdt"

find_query_direction_format
find_query_prop_format:
forward or backward sparql query format runing on hdt file.

one_part_g_producer
py_dumpNtriple
search_triples_with_parse:
look at https://github.com/svjack/DeepPavlov-Chinese-KBQA/blob/main/api_doc.md

entity_property_search
generate data for function load_pid_relate_entity_df

merge_nest_list
get_match_blk_by_diff
get_match_blk
get_match_intersection
get_match_intersection_fill_gap
sent_list_match_to_df
sent_list_match_to_df_with_bnd
sent_list_match_to_df_bnd_cat:
Find a similar substring from a sentence and one string

lemmatize_one_token
lemma_score_match_it:
simplify treat for English.

guess_sim_representation
guess_sim_representation_by_score
map_reduce_guess_sim_representation_by_score
Use Decompsition wikidataId find the most similar entity representation from wikidata Knowledge Base and retrieve its substring representation in English sentence.

search_entity_rep_by_lang_filter
search_entity_rep_by_lang_filter_in_db
search_entity_rep_by_lang_filter_by_init_dict
look at https://github.com/svjack/DeepPavlov-Chinese-KBQA/blob/main/api_doc.md

find_zh_str_entityid_by_linking
find_zh_str_entityid_by_db
find_en_str_entityid_by_trans_near_linking
Find the entityid (wikidataId) of a text representation, if the text not in Knowledge Base, use find_en_str_entityid_by_trans_near_linking to first find a "near" text (in the sense of embedding distance of LaBSE) in Knowledge Base that have entityid and use them.

produce_pid_tuple_on_s_dict
load_pid_tuple_on_s_dict_zh_entity_search_table
generate data for propertis of wikidata in English and the entities to properties mappings.

search_sim_entity_by_property_count_by_dict
search_sim_entity_by_property_count_by_dict_add_fuzz
search_sim_entity_by_property_count_by_dict_add_fuzz_f_by_db
search_sim_entity_by_property_count
search_sim_entity_by_property_count_by_explode
search similar entity on the count number of propertis in many self-prebuild different collections.

sim_representation_decomp
main function to decompose English sentence and its corresponding SPARQL query. Find the mapping relation between wikidataId and corresponding natural language representation.

phrase_validation
justify a span of sentence is or not a phrase in the sentence. define phrase as some tokens joined by some blanks.

most_sim_token_in_sent
recurrent_decomp_entity_str_by_en_sent
most_sim_token_in_sent is to find the most similar token in the sentence compared with a entity string. And recurrent_decomp_entity_str_by_en_sent use most_sim_token_in_sent in a recurrent way, that may find the longest phrase that most similar with the entity string.

sp_string_by_desc_str_list
split the sentence by the description vocabs, the description vocabs defined like gensim's ENGLISH_CONNECTOR_WORDS

from gensim.models.phrases import ENGLISH_CONNECTOR_WORDS

guess_most_sim_pharse_in_en_sent
map_guess_dict_to_in_sent_mapping
slice_guess_dict_in_sent_mapping_df
map_guess_dict_to_in_sent_mapping_multi_times
the natural language representation in sim_representation_decomp may not a phrase but only a representation retrieve by map_reduce_guess_sim_representation_by_score, so these functions transform the representation to a phrase from English sentence. These functions can be seen as a map from substring may be unreasonable into a reasonable phrase.

produce_sim_representation_reconstruct_df
construct a dataframe whose different rows as entity representations and its meta data.(wikidataId, similar score and so on.)

maintain_entity_cut_on_en_sent
one_row_aug
augmentation perform function on above dataframe.

sim_representation_reconstruct_by_df
main function use sim_representation_decomp to decompose and produce_sim_representation_reconstruct_df to reconstruct. Then we finished the whole augmentation task.

only_fix_script_ser.py
trans_emb_utils.py:
Some toolkits for translation and embedding produce with compare.