Efficient Data Preprocessing for Extractive Question Answering Models
DOI :
https://doi.org/10.5269/bspm.78619Résumé
abstract: Thisstudypresentsasystematicapproachtobuildingadomain-specificquestion-answering(QA)
dataset fromIndianLokSabhaparliamentaryproceedings,withaprimaryfocusonmeticulousdataprepro
cessing.Parliamentarytranscriptsareoftenlengthy,noisy,andunstructured,posingsignificantchallengesfor
downstreamnatural languageprocessing(NLP)tasks.Toaddressthis,wedesignedacomprehensiveprepro
cessingpipelineinvolvingcleaning,segmentation,annotation,normalization,andtokenizationtoconvertraw
transcriptsintostructured,high-qualityQA-readydata.Eachstepwastailoredtothelinguisticandstructural
characteristicsofparliamentarytext. Experimentalevaluationthroughanablationstudydemonstratedthat
ourpreprocessingpipeline ledtoasignificantperformance improvementof9.4%inExactMatch(EM)and
8.5%inF1scorewhenusedtotrainaBERT-basedQAmodel.Additionally,weconductedbiasanalysisand
comparedourdataset’sperformancewithstandardbenchmarks tovalidate itsqualityandrelevance. This
workunderscoresthatrobustpreprocessingisfoundationaltocreatingreliable,domain-adaptedQAdatasets
forextractivemodels
Téléchargements
Publié
Numéro
Rubrique
Licence
When the manuscript is accepted for publication, the authors agree automatically to transfer the copyright to the (SPM).
The journal utilize the Creative Common Attribution (CC-BY 4.0).



