Efficient Data Preprocessing for Extractive Question Answering Models
DOI:
https://doi.org/10.5269/bspm.78619Resumo
abstract: Thisstudypresentsasystematicapproachtobuildingadomain-specificquestion-answering(QA)
dataset fromIndianLokSabhaparliamentaryproceedings,withaprimaryfocusonmeticulousdataprepro
cessing.Parliamentarytranscriptsareoftenlengthy,noisy,andunstructured,posingsignificantchallengesfor
downstreamnatural languageprocessing(NLP)tasks.Toaddressthis,wedesignedacomprehensiveprepro
cessingpipelineinvolvingcleaning,segmentation,annotation,normalization,andtokenizationtoconvertraw
transcriptsintostructured,high-qualityQA-readydata.Eachstepwastailoredtothelinguisticandstructural
characteristicsofparliamentarytext. Experimentalevaluationthroughanablationstudydemonstratedthat
ourpreprocessingpipeline ledtoasignificantperformance improvementof9.4%inExactMatch(EM)and
8.5%inF1scorewhenusedtotrainaBERT-basedQAmodel.Additionally,weconductedbiasanalysisand
comparedourdataset’sperformancewithstandardbenchmarks tovalidate itsqualityandrelevance. This
workunderscoresthatrobustpreprocessingisfoundationaltocreatingreliable,domain-adaptedQAdatasets
forextractivemodels
Downloads
Publicado
Edição
Seção
Licença
When the manuscript is accepted for publication, the authors agree automatically to transfer the copyright to the (SPM).
The journal utilize the Creative Common Attribution (CC-BY 4.0).



