Skip to content

Latest commit

 

History

History
61 lines (37 loc) · 4.12 KB

File metadata and controls

61 lines (37 loc) · 4.12 KB

Function Documentation

ner_model.py:
A self-trained NER model that extract entities and properties from Chinese questions. It use adapter-hub's adapter-transformers on NER downstream task. The E-TAG indicate entity type and T-TAG as properties.

tmp_classifier.py:
A self-trained BaggingClassifier that use MLP as BaseModel to classify Chinese question into 5 classes (defined in abcde_dict), with a multilanguage encoder (named LaBSE) to encode the text into dense space. Use Bagging because the 5 classes is unbalanced (also with some sampling)

ranker.py:
A self-trained BaggingClassifier that use MLP as BaseModel to classify Chinese question into 2 classes. This task similar with CrossEncoder in sentence-transformers, make pair input as (chinese_question, property_representation), train a 0-1 classifier to find the highest score pair, that the property_representation represent the question reasonable. This may indicate the evidence about the answer that satisfy the ask intent.

kbqa_step.py:
Main script that perform the KBQA task.

function definitions:

search_entity_rep_by_lang_filter_in_db:
find language representations of a wikidataId by setting the language flag (support en and zh) in a pre-build sqlite database, this DB can be analogy to the translate dictionary of entities cross English and Chinese.

Zh_Rel_Ranker:
definition of above ranker object

query_parser_bu, find_top_rels_bu:
main part of query process in DeepPavlov

t3_statement_df:
perform a SPARQL query inquiry on the wikidata hdt file and represent the conclusion as a [n, 3] shaped pandas dataframe (with columns named with s p o, where s p o is the basic Triad collection in Knowledge Base)

fix_o:
a toolkit that fix some problem when transform the stream made by hdt query iterator's o field when collect this stream to a local Ntriple file.

py_dumpNtriple:
transform on row of s p o made by hdt query iterator to Ntriple file format.

one_part_g_producer:
init a knowledge graph object with the help of rdflib

drop_duplicates_by_col:
a toolkit that drop the duplicates of a pandas dataframe by unify the value of one column

drop_duplicates_of_every_df:
a toolkit that drop the duplicates of a pandas dataframe of any dtypes (this function is useful when some cells in dataframe not have hashcode : e.x. List)

search_triples_with_parse:
perform a SPARQL query inquiry on the wikidata hdt file

perm_top_sort:
find the similar text from a collection of list compared with another text by cos distance between SentenceTransformer text encodings.

syn_sim_on_list:
find the similar text from a collection of list compared with another text by distance defined by synonyms (text only maintain Chinese parts)

t3_statement_ranking, choose_tmp_by_ranking:
use ranker to find a reasonable s p o ranking conclusion between Chinese question and the many s p o collections.

till_process_func:
some SPARQL part have some decorate such as 'FILTER (?x = a ).' so the s p o will be expand to s p o f. This function filter out the part we only careful.

fill_str, for_loop_detect:
decode BIO style conclusion from NER model to a dictionary with [E-TAG T-TAG O-TAG] as keys and list of elements as values.

ner_entity_type_predict:
use adapter-transfomers to extract entities and properties of a Chinese question.

keyword_rule_filter:
a rule based fix on the output of tmp_classifier. In the definition, every question with "多大" as its sub-span will drop the "COUNT" style SPARQL template.

tmp_type_predict:
use tmp_classifier to classify Chinese question into 5 templates defined in abcde_dict.

property_df_rep_disambiguation:
disambiguate different properties on question.

do_search:
The main function that input the Chinese question and output the query conclusion from wikidata hdt Knowledge Base.