Efficient Data Preprocessing for Extractive Question Answering Models

Auteurs-es

  • Sivakumar S
  • Meenakshi S P

DOI :

https://doi.org/10.5269/bspm.78619

Résumé

abstract: Thisstudypresentsasystematicapproachtobuildingadomain-specificquestion-answering(QA)
dataset fromIndianLokSabhaparliamentaryproceedings,withaprimaryfocusonmeticulousdataprepro
cessing.Parliamentarytranscriptsareoftenlengthy,noisy,andunstructured,posingsignificantchallengesfor
downstreamnatural languageprocessing(NLP)tasks.Toaddressthis,wedesignedacomprehensiveprepro
cessingpipelineinvolvingcleaning,segmentation,annotation,normalization,andtokenizationtoconvertraw
transcriptsintostructured,high-qualityQA-readydata.Eachstepwastailoredtothelinguisticandstructural
characteristicsofparliamentarytext. Experimentalevaluationthroughanablationstudydemonstratedthat
ourpreprocessingpipeline ledtoasignificantperformance improvementof9.4%inExactMatch(EM)and
8.5%inF1scorewhenusedtotrainaBERT-basedQAmodel.Additionally,weconductedbiasanalysisand
comparedourdataset’sperformancewithstandardbenchmarks tovalidate itsqualityandrelevance. This
workunderscoresthatrobustpreprocessingisfoundationaltocreatingreliable,domain-adaptedQAdatasets
forextractivemodels

Biographie de l'auteur-e

  • Meenakshi S P

    School of Computer Science and Engineering,

    Assistant Professor Sr

Téléchargements

Publié

2025-11-01

Numéro

Rubrique

Conf. Issue: Applied Mathematics and Computing (ICAMC-25)