Skip to content
Snippets Groups Projects
Commit a08cb952 authored by Adrien Klose's avatar Adrien Klose
Browse files

Overview of Bioasq relevant tasks and data

parent 69ed2606
No related branches found
No related tags found
No related merge requests found
.ipynm_checkpoints/*
This diff is collapsed.
Create an overview of the tasks and involved data.
- there have been multiple Tasks since 2013 up until now, relevant ones are
- BioASQ Task 1b: Introductory Biomedical Semantic QA
- BioASQ Task 2b: Biomedical Semantic QA (Involves IR, QA, Summarization)
- BioASQ Task 3b: Biomedical Semantic QA (Involves IR, QA, Summarization)
- BioASQ Task 4b: Biomedical Semantic QA (Involves IR, QA, Summarization) -> includes old questions
- BioASQ Task 5b: Biomedical Semantic QA (Involves IR, QA, Summarization) old too
- BioASQ Task 6b: Biomedical Semantic QA (involves IR, QA, summarization)
- BioASQ Task 7b: Biomedical Semantic QA (Involves IR, QA, Summarization)
- BioASQ Task 8b: Biomedical Semantic QA (Involves IR, QA, Summarization)
- BioASQ Task Synergy: Biomedical Semantic QA For COVID-19 -> supposed to not include rdf triples
- BioASQ Task 9b: Biomedical Semantic QA (Involves IR, QA, Summarization)
- BioASQ Task 10b: Biomedical Semantic QA (involves IR, QA, summarization)
- BioASQ Task Synergy: Biomedical Semantic QA For developing issues -> supposed to not include rdf triples
- BioASQ Task 11b: Biomedical Semantic QA (involves IR, QA, summarization)
- BioASQ Task 12b: Biomedical Semantic QA (Involves IR, QA, Summarization) -> currently running, more than 5000 trainings questions with golden anwsers + 500 new ones in batches later
- from 4b onwards it is stated that old questions are included, this does not mean that those befor do not include old questions as well
- relevant/gold standard rdf triples are from designated ontologies each year and are linked to the relevant question (but do the correct anwsers exist too? does a question state what kind of anwser it expects?)
- questions in the dataset fall into one of 4 categories
1. Yes/No questions -> easy to test/evaluate for now
- "Do CpG islands colocalise with transcription start sites?"
2. Factoid questions - require an entity, a number or similar short expression as an anwser
- "Which virus is best known as the cause of infectious mononucleosis?"
3. List questions - same as factoid but require a list of anwsers
- "Which are the Raf kinase inhibitors?"
4. Summary questions - requires short text summarizing prominent relevant information
- "What is the treatment of infectious mononucleosis?"
- http://participants-area.bioasq.org/Tasks/12b/trainingDataset/ includes 5046 questions
-> only registered users can download the dataset
-> throughout the end of march until the middle of may new data will be released
- the dataset is provided in json format and the interesting fields per question for us are type, body, exact_anwser and sometimes triples
- there exist 322 questions with the field triples and of those 107 are yesno questions
- of those 107 every triple has the fields o, p and s and in total there are 1916 triples
How are the triples to be interpreted, since some directly include the medical entities while others include links?
- the s field is always a link
- some of the links lead to working websites and some dont
- which links work seems to be based on which knowledge source is used
- links that include umls seem to work
- links like http://linkedlifedata.com/resource/#_503638303730008 dont seem to work
- the o field is sometimes a link, sometimes a resolved object and sometimes a code that probably needs to be resolved
- whether the code can be resolved depends on whether the source vocabulary can be inferred from s and p
- the problems with links from s persists
- differentiating code from a resolved object could be hard
- example for a code is "Trial NCT00000598"
- the p field is always a link
- the p field can be interpreted by getting the last part of the link ()
- there are 5 institutions/base websites from which relations are used
- some of them include a more verbose description of the relation
Do the triples make sense?
- heavyly depends on the triples
- for now triples based on umls and geneon (possibly more) seem to be the easiest to parse
Can we use the triples as graphs?
- while we could include the triples into/as graphs it would be better for completness sake to create the graphs not from the triples but the designated ontologies
- as they are now parsing the source vocabularies first and then searching in them with given codes seems to be more promising then evaluating the websites given by the link
- for now the best use case seems to be to go through the questions/triples by hand to find question+triplet+anwser that are usefull for a small experiment of whether triples improve the anwser (accuracy in the case of yesno questions)
......@@ -6,7 +6,10 @@
- Wikidata Diseases
- ?
- BioASQ
- ?
- Organizes challenges on biomedical semantic indexing and question anwsering
- There are 4 question types in the training data for QA
- Some of the training data include rdf triples from specified onthologies
- Examples and more information in bioasq.txt
- Datenbank von Harry umls.db
- Database that's highly likely a unidentified subsets of UMLS
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment