Efficient Data Preprocessing for Extractive Question Answering Models
DOI:
https://doi.org/10.5269/bspm.78619Abstract
abstract: Thisstudypresentsasystematicapproachtobuildingadomain-specificquestion-answering(QA)
dataset fromIndianLokSabhaparliamentaryproceedings,withaprimaryfocusonmeticulousdataprepro
cessing.Parliamentarytranscriptsareoftenlengthy,noisy,andunstructured,posingsignificantchallengesfor
downstreamnatural languageprocessing(NLP)tasks.Toaddressthis,wedesignedacomprehensiveprepro
cessingpipelineinvolvingcleaning,segmentation,annotation,normalization,andtokenizationtoconvertraw
transcriptsintostructured,high-qualityQA-readydata.Eachstepwastailoredtothelinguisticandstructural
characteristicsofparliamentarytext. Experimentalevaluationthroughanablationstudydemonstratedthat
ourpreprocessingpipeline ledtoasignificantperformance improvementof9.4%inExactMatch(EM)and
8.5%inF1scorewhenusedtotrainaBERT-basedQAmodel.Additionally,weconductedbiasanalysisand
comparedourdataset’sperformancewithstandardbenchmarks tovalidate itsqualityandrelevance. This
workunderscoresthatrobustpreprocessingisfoundationaltocreatingreliable,domain-adaptedQAdatasets
forextractivemodels
Downloads
Published
Issue
Section
License
When the manuscript is accepted for publication, the authors agree automatically to transfer the copyright to the (SPM).
The journal utilize the Creative Common Attribution (CC-BY 4.0).



