Overview of Bioasq relevant tasks and data

a08cb952 · Adrien Klose · 69ed2606 · a08cb952 · a08cb952 · a08cb952
Commit a08cb952 authored 1 year ago by Adrien Klose
--- a/ideas_practice/code/.gitignore
+++ b/ideas_practice/code/.gitignore
+.ipynm_checkpoints/*
--- a/ideas_practice/code/bioasq_overview.ipynb
+++ b/ideas_practice/code/bioasq_overview.ipynb
--- a/src/ressources/bioasq.txt
+++ b/src/ressources/bioasq.txt
+Create an overview of the tasks and involved data.
+- there have been multiple Tasks since 2013 up until now, relevant ones are
+	- BioASQ Task 1b: Introductory Biomedical Semantic QA
+	- BioASQ Task 2b: Biomedical Semantic QA (Involves IR, QA, Summarization)
+	- BioASQ Task 3b: Biomedical Semantic QA (Involves IR, QA, Summarization)
+	- BioASQ Task 4b: Biomedical Semantic QA (Involves IR, QA, Summarization) -> includes old questions
+	- BioASQ Task 5b: Biomedical Semantic QA (Involves IR, QA, Summarization) old too
+	- BioASQ Task 6b: Biomedical Semantic QA (involves IR, QA, summarization)
+	- BioASQ Task 7b: Biomedical Semantic QA (Involves IR, QA, Summarization)
+	- BioASQ Task 8b: Biomedical Semantic QA (Involves IR, QA, Summarization)
+	- BioASQ Task Synergy: Biomedical Semantic QA For COVID-19 -> supposed to not include rdf triples
+	- BioASQ Task 9b: Biomedical Semantic QA (Involves IR, QA, Summarization)
+	- BioASQ Task 10b: Biomedical Semantic QA (involves IR, QA, summarization)
+	- BioASQ Task Synergy: Biomedical Semantic QA For developing issues -> supposed to not include rdf triples
+	- BioASQ Task 11b: Biomedical Semantic QA (involves IR, QA, summarization)
+	- BioASQ Task 12b: Biomedical Semantic QA (Involves IR, QA, Summarization) -> currently running, more than 5000 trainings questions with golden anwsers + 500 new ones in batches later
+- from 4b onwards it is stated that old questions are included, this does not mean that those befor do not include old questions as well
+- relevant/gold standard rdf triples are from designated ontologies each year and are linked to the relevant question (but do the correct anwsers exist too? does a question state what kind of anwser it expects?)
+- questions in the dataset fall into one of 4 categories
+	1. Yes/No questions -> easy to test/evaluate for now
+	- "Do CpG islands colocalise with transcription start sites?" 
+	2. Factoid questions - require an entity, a number or similar short expression as an anwser
+	- "Which virus is best known as the cause of infectious mononucleosis?" 
+	3. List questions - same as factoid but require a list of anwsers
+	- "Which are the Raf kinase inhibitors?"
+	4. Summary questions - requires short text summarizing prominent relevant information
+	- "What is the treatment of infectious mononucleosis?"
+- http://participants-area.bioasq.org/Tasks/12b/trainingDataset/ includes 5046 questions
+	-> only registered users can download the dataset
+	-> throughout the end of march until the middle of may new data will be released
+- the dataset is provided in json format and the interesting fields per question for us are type, body, exact_anwser and sometimes triples
+	- there exist 322 questions with the field triples and of those 107 are yesno questions
+	- of those 107 every triple has the fields o, p and s and in total there are 1916 triples
+
+How are the triples to be interpreted, since some directly include the medical entities while others include links?
+- the s field is always a link
+	- some of the links lead to working websites and some dont
+	- which links work seems to be based on which knowledge source is used
+	- links that include umls seem to work
+	- links like http://linkedlifedata.com/resource/#_503638303730008 dont seem to work
+- the o field is sometimes a link, sometimes a resolved object and sometimes a code that probably needs to be resolved
+	- whether the code can be resolved depends on whether the source vocabulary can be inferred from s and p
+	- the problems with links from s persists
+	- differentiating code from a resolved object could be hard
+	- example for a code is "Trial NCT00000598"
+- the p field is always a link
+	- the p field can be interpreted by getting the last part of the link ()
+	- there are 5 institutions/base websites from which relations are used
+	- some of them include a more verbose description of the relation
+
+Do the triples make sense?
+- heavyly depends on the triples
+- for now triples based on umls and geneon (possibly more) seem to be the easiest to parse
+
+
+Can we use the triples as graphs?
+- while we could include the triples into/as graphs it would be better for completness sake to create the graphs not from the triples but the designated ontologies
+- as they are now parsing the source vocabularies first and then searching in them with given codes seems to be more promising then evaluating the websites given by the link
+
+- for now the best use case seems to be to go through the questions/triples by hand to find question+triplet+anwser that are usefull for a small experiment of whether triples improve the anwser (accuracy in the case of yesno questions)
--- a/src/ressources/listing.md
+++ b/src/ressources/listing.md
@@ -6,7 +6,10 @@
 - Wikidata Diseases
    - ?
 - BioASQ
-    - ?
+    - Organizes challenges on biomedical semantic indexing and question anwsering
+    - There are 4 question types in the training data for QA
+    - Some of the training data include rdf triples from specified onthologies
+    - Examples and more information in bioasq.txt
    
 - Datenbank von Harry umls.db
    - Database that's highly likely a unidentified subsets of UMLS