Paper Review umsetzen
Team Asuna proposes a 3-step approach that consists of the preprocessing, search, and re-ranking. For each document (text passage) in the retrieval collection the following three components are computed: extractive summary (LexRank), and a spam score as well as premises and claims (TARGER). Final ranking is created using Random Forests fed with the following features: aggregated BM25F-score, number of times the document was retrieved, number of tokens in document, number of sentences in document, number of premises in document, number of claims in document, waterloo spam-scores, predicted argument quality, predicted stance. The classifier was trained on the Touche 2020 and 2021 relevance judgments. The argument quality is predicted using DistilBERT fine-tuned on Webis-ArgQuality-20. The stance is also predicted using DistilBERT fine-tuned on the provided stance dataset.
The overall retrieval approach is sound, well-designed, and grounded on previous work in argument retrieval offering some novel ideas like computing document summaries. However, the presentation of the approach should be improved. In particular, I would ask the authors to NOT start every sentence from a new line. Section 2 is too short to make a whole section; I suggest to move the text from Section 2 into Introduction. Not clear, why the approach is presented as consisting of two pipelines: pre-processing and search? What about the final re-ranking step with Random Forests? I suggest, the approach should be described as a 3-step one.
General tips:
-
Format sentences to not start from a new line every time -
Move section 2 into introduction -
Format approach to be a 3 step one (preprocessing, search, reranking) instead of 2 step -
Push code to Github -
Put link to Github repo in paper -
Refactor code in repo (formatting)
Questions:
-
For the initial retrieval are original topic titles used or they are lemmatized and stopwords are removed?
The original topic titles are used. Example: {query: 'What is better: A pc or a laptop?', lda_topics:'[('laptop', 0.049109627), ('desktop', 0.030363074), ('computer', 0.016792683), ('online', 0.016539471), ('better', 0.011027149)]'} -> We use the words 'laptop', 'desktop', 'computer', 'online', 'better' to build a new query. For each query we build 3 new queries using the LDA approach, each containing 5 words. -
Clarify which targer model was used
We used the combined dataset with fastText embeddings. -
What parameters are used for LexRank? One can exclude stopwords, what size of summaries is used (in sentences), is any threshold is used?
We used the LexRank algorithm provided by the sumy library with a tokenizer for the english language, a PlainTextParser (which transforms the document content to the document representation of the sumy library) and a sentence threshold of 1 (to get a single-sentence summary of the document contents). -
Is any weighting for the fields in BM25F is used? What are the weights?
Yes, we used {SUMMARY_WEIGHT: '1.0', PREMISES_WEIGHT: '1.0', CLAIMS_WEIGHT: '1.0', CONTENTS_WEIGHT: '2.0'}. -
Did you use abstractive summarization or not (it is mentioned in Sec. 3.4)? -> Get rid of references to abstract summaries
We initially intended to use abstractive summaries, but due to issues in calculating them for the whole corpus, we discarded the idea. We'll make sure to get rid of any references to abstractive summaries in the camera-ready version fo the paper. -
How documents were cleansed (in Sec. 3.4 it is mentioned that documents are cleaned)?
The document contents cleaning process involves the following steps: Punctuation removal, tokenizing, lemmatizing, stop word removal. -
LDA: how many topics are extracted, what parameters are used, e.g., what solver, shrinkage, etc.?
We extract 3 topics, each containing 5 words. We use gensim.corpora.Dictionary to map the words in the document content to integer ids and gensim.corpora.Dictionary.doc2bow for transforming into a bag-of-words representation. We use the default parameters of the gensim.models.ldamodel.LdaModel (distributed=False, chunksize=2000, passes=1, update_every=1, alpha='symmetric', eta=None, decay=0.5, offset=1.0, eval_every=10, iterations=50, gamma_threshold=0.001, minimum_probability=0.01, random_state=None, ns_conf=None, minimum_phi_value=0.01, per_word_topics=False, callbacks=None, dtype=<class 'numpy.float32'>). -
How many queries are generated from the original one and how they are combined: e.d., using a logical OR, AND, or somehow else?
For each query we generate 4 additional queries based on the topics generated by the LDA model. Three queries based on the initial query. One query based on the query description. The words in each query are combined via logical OR. -
What is src.search.synonyms? Also what words are replaced with synonyms, how are redundant words are defined?
We intended to add 4 additional queries, each of which containing the words from the previous LDA generated query, with its most frequent word (by frequency of word occurence in LDA topics) replaced by a synonym. However, we realised, that the code for this isn't actually used in our TIRA submission, so we'll get rid of any references to synonyms in our camera-ready paper. -
Not clear: For each of those synonym replaced queries we search for the top 30 documents including their score. To keep the importance of a document occurring in different extended queries we add a third of the score from all of the same duplicate documents together. After all these scores are computed the documents are re-ranked by their new score.
Synonyms will be discarded in the camera-ready version of the paper. -
Table 1. What is a baseline?
We scaled quality scores from the Webis-argquality-20-full and Webis-argquality20-topics datasets to [-1,1], then calculated the mean quality score. The baseline consists of predicting the mean quality score for each document. The values in the table are the mean-square-error between the predicted quality score and the actual quality score. -
Random Forests: what hyper-parameters are used? Was any hyper-parameter tuning done and how?
We used the parameters 'max_depth=5, random_state=42'. No hyper-parameter tuning was done. -
Not clear how the difference in the document length was taken into account when training a Random Forests classifier and applying on the 2022 collection. E.g., one of the features is a number of tokens: but the documents in 2020 and 2021 were long web documents, and in 2022 these are short passages.
Good point. We trained the random forest on the 2021 qrels. The problem of differing document lengths will be pointed out in the camera-ready version of the paper. -
Clarify the feature number of times the document was retrieved: documents have unique ids, was it retrieved for different topics or for different query variations? After executing all query variations, we count how often each document occured in the combined results - the num_found value. We do this to address the fact that documents, which are retrieved multiple times, are probably more relevant for answering the original query. -
How many documents are re-ranked? Top-k (k equals?)? Or all documents returned for the combined query?
We re-rank all documents returned by 1. the original query 2. the 4 queries generated from the LDA topics -
Could you clarify how the aggregated BM25f-score was computed, what scores were aggregated?
For each document retrieved, we initially use its retrieval score (given by the SimpleSearcher from Pyserini). If the document occurs multiple times, for each occurrence, we add the current score (Pyserini score of the document in the current query) divided by 3 (number of topics) to the overall score of the document. The division by the number of topics is used to account for many duplicate documents induced by using large number of topics (large number of queries). Other tips: -
In Sec. 3.1 it is claimed "We do not consider main claims and main premises...", but could you epxlain why? -
There is another claim in the paper "This enables us to query the index with BM25F and fields for extractive summaries, abstractive summaries, premises, claims and cleaned documents." But no additional information about the abstractive summaries; my questions are they used or not, if not, this should be removed. -
What are cleaned documents, how cleansing was done? It is not described in the paper. -
The paper contains a results section, but Conclusion is missing. -
I suggest to use the keywords that more reflect the approach and not the used packages, e.g., Comparative queries, Argument retrieval, Argument quality -
Abstract: ... we describe the team Asuna's participation --> the team's Asuna participation -
Introduction: to build larger queries --> to construct expanded queries -
Our project consists of two pipelines -- > Our proposed approach -
Sec 3.4 Pyserini needs citation -
Sec. 4: LDA needs citation -
Sec 4.5: DistilBERt --> DistilBERT -
Sec. 4.7. aggregated BM25f-score --> BM25F score -
Sec. 5: the stance classification still behaves a bit strange --> what does strange mean? It is a bit colloquial term. -
Some citations are missing, e.g., Touche 2020, 2021, etc.