教学文库网 - 权威文档分享云平台
您的当前位置:首页 > 范文大全 > 行业范文 >

Blind Source Separation Techniques for Detecting Hidden Text

来源:网络收集 时间:2024-05-07
导读: Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

BlindSourceSeparationTechniquesforDetectingHiddenTextsandTexturesin

DocumentImages

AnnaTonazzini,EmanueleSalerno,MatteoMochi,andLuigiBedini

IstitutodiScienzaeTecnologiedell’Informazione-CNR

ViaG.Moruzzi,1,I-56124PISA,Italy

anna.tonazzini@http://doc.guandang.netr.it

Abstract.BlindSourceSeparationtechniques,basedbothonIndepen-dentComponentAnalysisandonsecondorderstatistics,arepresentedandcomparedforextractingpartiallyhiddentextsandtexturesindoc-umentimages.Barelyperceivablefeaturesmayoccur,forinstance,inancientdocumentspreviouslyerasedandthenre-written(palimpsests),orfortransparencyorseepingofinkfromthereverseside,orfromwater-marksinthepaper.Detectingthesefeaturescanbeofgreatimportancetoscholarsandhistorians.Inourapproach,thedocumentismodeledasthesuperpositionofanumberofsourcepatterns,andasimpli edlin-earmixturemodelisintroducedfordescribingtherelationshipbetweenthesesourcesandmultispectralviewsofthedocumentitself.Theprob-lemofdetectingthepatternsthatarebarelyperceivableinthevisiblecolorimageisthusformulatedastheoneofseparatingthevariouspat-ternsinthemixtures.Someexamplesfromanextensiveexperimentationwithrealancientdocumentsareshownandcommented.

1Introduction

Revealingthewholecontentsofancientdocumentsisanimportantaidtoschol-arsthatareinterestedindatingthedocumentsorestablishingtheirorigin,orreadingolderandhistoricallyrelevantwritingstheymaycontain.However,in-terestingdocumentfeaturesareoftenhiddenorbarelydetectableintheoriginalcolordocument.Multispectralacquisitionsinthenon-visiblerange,suchastheultravioletorthenearinfrared,constituteavalidhelpinthisrespect.Forin-stance,amethodtorevealpaperwatermarksistorecordaninfraredimageofthepaperusingtransmittedillumination.Nevertheless,thewatermarkdetectedwiththismethodisusuallyveryfaintandoverlappedtothecontentsofthepapersurface.Tomakethewatermarkpattern,orwhateverfeatureofinter-est,morereadableandfreefrominterferencesduetooverlappedpatterns,anintuitivestrategyistoprocess,forinstancebyarithmeticoperations,multiple“views”ofthedocument.Inthecasewhereacolorscanisavailable,threedi er-entviewscanbeobtainedfromthered,green,andblueimagechannels.When ThisworkhasbeensupportedbytheEuropeanCommissionproject“Isyreadet”(http://doc.guandang.net),undercontractIST-1999-57462

A.Campilho,M.Kamel(Eds.):ICIAR2004,LNCS3212,pp.241–248,2004.

cSpringer-VerlagBerlinHeidelberg2004

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

242A.Tonazzinietal.

available,scansatnon-visiblewavelengthscanbeusedaloneorinconjunctionwiththevisibleones.Byprocessingthedi erentcolorcomponents,itispossibletoextractsomeoftheoverlappedpatterns,and,sometimes,eventoachieveacompleteseparationofallthem.Indeed,sinceallthesecolorcomponentscon-tainthepatternsindi erent”percentage”,simpledi erenceoperationsbetweenthecolors,aftersuitableregulationofthelevels,can”cancel”onepatternandenhancetheother.Forthecaseofwatermarks,anotherinfraredimagetakenusingonlythere ectedilluminationcanbeusedforthispurpose[1].Ontheotherhand,someauthorsclaimthatsubtractingtheGreenfromtheRedisabletorevealhiddencharactersincharreddocuments[9].Thesearehoweverem-pirical,document-dependent,strategies.Wearelooking,instead,forautomatic,mathematicallybased,techniquesthatareabletoenhanceoreventoextractthehiddenfeaturesofinterestfromdocumentsofanykind,withouttheneedforadaptationstothespeci cproblemathand.

Ourapproachtothisproblemistomodelallthedocumentviewsaslinearcombinationsofanumberofindependentpatterns.Thesolutionconsiststhenintryingtoinvertthistransformation.Theoverlappedpatternsareusuallythemainforegroundtext,thebackgroundpattern,i.e.animageofthepaper(orparchment,orwhatever)support,whichcancontaindi erentinterferingfea-tures,suchasstains,watermarks,etc.,andoneormoreextratextsordrawings,duetopreviouslywrittenandthenerasedtexts(palimpsests),seepingofinkfromthereverseside(bleed-through),transparencyfromotherpages(show-through),andotherphenomena.Althoughourlinearimagemodelroughlysimpli esthephysicalnatureofoverlappedpatternsindocuments[8],ithasalreadyprovedtogiveinterestingresults.Indeed,thismodelhasbeenproposedin[4]toextractthehiddentextsfromcolorimagesofpalimpsests,assumingtoevaluatebyvisualinspectionthemixturecoe cients.Nevertheless,ingeneral,themixturecoe -cientsarenotknown,andtheseparationproblembecomesoneofblindsourceseparation(BSS).Ithasbeenshownthatane ectivesolutiontoBSScanbefoundifthesourcepatternsaremutuallyindependent.Theindependenceas-sumptiongivesrisetoseparationtechniquesbasedonindependentcomponentanalysis,orICA[6].Althoughthelineardatamodelissomewhatsimpli ed,andtheindependenceassumptionisnotalwaysjusti ed,wealreadyproposedICAtechniquesfordocumentimageprocessing[10],andobtainedgoodresultswithrealmanuscriptdocuments.

InthispaperwecomparetheperformanceofICAtechniqueswithsimplermethodsthatonlytrytodecorrelatetheobserveddata.Asisknown,thisre-quirementisweakerthanindependence,and,inprinciple,nosourceseparationcanbeobtainedbyonlyconstrainingsecond-orderstatistics,atleastifnoaddi-tionalrequirementissatis ed.However,ourpresentaimistheenhancementoftheoverlappedpatterns,especiallyofthosethatarehiddenorbarelydetectable,andweexperimentallyfoundthatthiscanbeachievedinmostcasesevenbysim-pledecorrelation.Ontheotherhand,whilethecolorcomponentsofanimageareusuallyspatiallycorrelated,theinpidualclassesorpatternsthatcomposetheimageareatleastlesscorrelated.Thus,decorrelatingthecolorcomponentsgivesadi erentrepresentationwherethenowuncorrelatedcomponentsoftheimagecouldcoincidewiththesingleclasses.Furthermore,thesecond-orderapproach

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

BlindSourceSeparationTechniquesforDetectingHiddenTexts243

isalwayslessexpensivethanICAalgorithms,andduetothepoormodelingortothelackofindependenceofthepatterns,theresultsfromdecorrelationcanalsobebetterthantheonesfromICA.

2FormulationoftheProblem

Letusassumethateachpixel(ofindextinatotalofT)ofamultispectralscanofadocumenthasavectorvaluex(t)ofNcomponents.Similarly,letusassumetohaveMsuperimposedsourcesrepresented,ateachpixelt,bythevectors(t).Sinceweconsiderimagesofdocumentscontaininghomogeneoustextsordrawings,wecanalsoreasonablyassumethatthecolorofeachsourceisalmostuniform,i.e.,wewillhavemeanre ectanceindicesAijforthei-thsourceatthej-thwavelenght.Thus,wewillhaveacollectionofTsamplesfromarandomN-vectorx,whichisgeneratedbylinearlyandistantaneouslymixingthecomponentsofarandomM-vectorsthroughanN×MmixingmatrixA:

x(t)=As(t)t=1,2,...,T(1)

wherethesourcefunctionssi(t),i=1,2,...,Mdenotethe“quantity”oftheMpatternsthatconcurtoformthecoloratpointt.Estimatings(t)andAfromknowledgeofx(t)iscalledaproblemofblindsourceseparation(BSS).Inthisapplication,weassumethatnoiseandblurcanbeneglected.Whenonlythevisiblecolorscanisavailable,vectorx(t)hasdimensionN=3(itiscomposedbythered,green,andbluechannels).However,mostdocumentscanbeseenasthesuperpositionofonlythree(M=3)di erentsources,orclasses,thatwewillcall“background”,“maintext”and“interferingtexture”.Ingeneral,byusingmultispectral/hyperspectralsensors,the“color”vectorcanassumeadimensiongreaterthan3.Likewise,wecanalsohaveM>3ifadditionalpatternsarepresentintheoriginaldocument.Inthispaper,weonlyconsiderthecaseM=N,thatis,samenumberofsourcesasofobservations,althoughinprinciplethereisnodi erencewiththegeneralcase.

Itiseasytoseethatthismodeldoesnotperfectlyaccountforthephe-nomenonofinterferingtextsindocuments,whichderivesfromcomplicatedchemicalprocessesofinkdi usionandpaperabsorption.Justtomentiononeaspect,inthepixelswheretwotextsaresuperimposedtoeachother,theresult-ingcolorisnotthevectorsumofthecolorsofthetwocomponents,butitislikelytobesomenonlinearcombinationofthem.In[8],anonlinearmodelisde-rivedevenforthesimplerphenomenonofshow-through.However,althoughthelinearmodelisonlyaroughapproximation,ithasdemonstrateditsusefulnessindi erentapplications,asalreadymentionedabove[4][10].

3TheProposedSolutions:ICA,PCA,andWhitening

Whennoadditionalassumptionismade,problem(1)isclearlyunderdetermined,sinceanynonsingularchoiceforAcangiveanestimateofs(t)thataccountsfortheevidencex(t).Evenifnospeci cinformationisavailable,statistical

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

244A.Tonazzinietal.

assumptionscanoftenbemadeonthesources.Inparticular,itcanbeassumedthatthesourcesaremutuallyindependent.Ifthisassumptionisjusti ed,bothAandscanbeestimatedfromx.Asmentionedintheintroduction,thisistheICAapproach[6].Ifthepriordistributionforeachsourceisknown,independenceisequivalenttoassumeafactorizedformforthejointpriordistributionofs:

P(s(t))=N

i=1Pi(si(t)) t(2)

Theseparationproblemcanbeformulatedasthemaximizationofeq.2,sub-jecttotheconstraintx=As.ThisisequivalenttothesearchforaW,W=(w1,w2,...,wN)T,suchthat,whenappliedtothedatax=(x1,x2,...,xN),Tproducesthesetofvectorswixthataremaximallyindependent,andwhosedis-tributionsaregivenbythePi.Bytakingthelogarithmofeq.2,theproblemsolvedbyICAalgorithmsisthen:

T =argmaxWlogPi(wix(t))+Tlog|det(W)|(3)Wti

isanestimateofA 1,uptoarbitraryscalefactorsandpermutationsMatrixWT ionthecolumns.Hence,eachvector si=wxisoneoftheoriginalsourcevectorsuptoascalefactor.

Besidesindependence,tomakeseparationpossibleanecessaryextracondi-tionforthesourcesisthattheyall,butatmostone,mustbenon-Gaussian.Toenforcenon-Gaussianity,genericsuper-Gaussianorsub-Gaussiandistributionscanbeusedaspriorsforthesources.Thesehaveproventogiveverygoodesti-matesforthemixingmatrixandforthesourcesaswell,nomatterofthetruesourcedistributions,which,ontheotherhand,areusuallyunknown[2].

Althoughwealreadyobtainedsomepromisingresultbythisapproach[10],thereisnoapparentphysicalreasonwhyouroriginalsourcesshouldbemutu-allyindependent,so,evenifthedatamodel(1)wascorrect,theICAprincipleisnotassuredtobeabletoseparatethedi erentclasses.However,itisintuitivelyclearthatonecantrytomaximizetheinformationcontentineachcomponentofthedatavectorbydecorrelatingtheobservedimagechannels.Toavoidcumber-somenotation,andwithoutlossofgenerality,letusassumetohavezero-meandatavectors.Wethusseekforalineartransformationy(t)=Wx(t)suchthat<yiyj>=0, i,j=1,...,M,i=j,whereWisgenerallyanM×Nmatrixandthenotation<·>meansexpectation.Inotherwords,thecomponentsofthetransformeddatavectoryareorthogonal.Itisclearthatthisoperationisnotunique,since,givenanorthonormalbasisofasubspace,anyrigidrotationofitstillyieldsanorthonormalbasisofthesamesubspace.Itiswellknownthatlineardataprocessingcanhelptorestorecolortextimages,althoughthelinearmodelisnotfullyjusti ed.In[7],theauthorscomparethee ectofmany xedlinearcolortransformationsontheperformanceofarecursivesegmentationalgorithm.Theyarguethatthelineartransformationthatobtainsmaximum-variancecom-ponentsisthemoste ective.Theythusderivea xedtransformationthat,foralargeclassofimages,approximatestheKarhunen-Loevetransformation,which

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

BlindSourceSeparationTechniquesforDetectingHiddenTexts245

isknowntogiveorthogonaloutputvectors,oneofwhichhasmaximumvariance.Thisapproachisalsocalledprincipalcomponentanalysis(PCA),andoneofitspurposesisto ndthemostusefulamonganumberofvariables[3].OurdatacovariancematrixistheN×Nmatrix:

RxxT 1x(t)xT(t)=<xxT>≈Tt=1(4)

Sincethedataarenormallycorrelated,matrixRxxwillbenondiagonal.Thecovariancematrixofvectoryis:

Ryy=<WxxTWT>=WRxxWT(5)

Toobtainorthogonaly,Ryyshouldbediagonal.LetusperformtheeigenvaluedecompositionofmatrixRxx,andcallVxthematrixoftheeigenvectorsofRxx,andΛxthediagonalmatrixofitseigenvalues,indecreasingorder.Now,itiseasytoverifythatallofthefollowingchoicesforWyieldadiagonalRyy:

TWo=Vx

TWw=ΛxVx (6)(7)

(8)Ws= TVxΛxVx

MatrixWoproducesasetofvectorsyi(t)thatareorthogonaltoeachotherandwhoseEuclideannormsareequaltotheeigenvaluesofthedatacovariancematrix.ThisiswhatPCAdoes[3].ByusingmatrixWw,weobtainasetoforthogonalvectorsofunitnorms,i.e.orthogonalvectorslocatedonasphericalsurface(whitening,orMahalanobistransform).Thispropertystillholdstrueifanywhiteningmatrixismultipliedfromtheleftbyanorthogonalmatrix.Inparticular,ifweusematrixWsde nedin(8),wehaveawhiteningmatrixwiththefurtherpropertyofbeingsymmetric.In[3],itisobservedthatapplicationofmatrixWsisequivalenttoICAwhenmatrixAissymmetric.Ingeneral,ICAappliesafurtherrotationtotheoutputvectors,basedonhigher-orderstatistics.4ExperimentalResultsandConcludingRemarks

Ourexperimentalworkhasconsistedinapplyingtheabovematricestotypi-calimagesofancientdocuments,withtheaimatemphasizingthedocumenthiddenfeaturesinthewhitenedvectors.Foreachtestimage,theresultsareofcoursedi erentfordi erentwhiteningmatrices.However,itisinterestingtonotethatthesymmetricwhiteningmatrixoftenperformsbetterthanICA,and,insomecases,itcanalsoachieveaseparationofthedi erentcomponents,whichisthe nalaimofBSS.Here,weshowsomeexamplesfromourextensiveexperimentation.The rstexample(Figure1)describestheprocessingofanancientmanuscriptwhichpresentsthreeoverlappedpatterns:amaintext,anunderwritingbarelyvisibleintheoriginalimage,andanoisybackgroundwith

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

246A.Tonazziniet

al.

(a)

(b)(c)

Fig.1.Fullseparationwithsymmetricorthogonalization:(a)grayscalerepresentationofthecolorscanofanancientmanuscriptcontainingapartiallyhiddentext;(b) rstsymmetricorthogonalizationoutputfromtheRGBcomponentsofthecolorimage;(c)secondsymmetricorthogonalizationoutputfromthesamedataset.

signi cantpaperfolds.WecomparedtheresultsoftheFastICAalgorithm[5][10],thePCA,andthesymmetricwhitening,allappliedtotheRGBchannels,andfoundthatfullseparationandenhancementofthethreeclassesisobtainedbythesymmetricorthogonalizationonly.ICAfailuremightdepend,inthiscase,onthedatamodelinaccuracyand/orthelackofmutualindependenceoftheclasses.InFigure2,wereportanotherexamplewhereapaperwatermarkpatternisdetectedandextracted.Inthiscase,weassumethedocumentasconstitutedoftwoonlyclasses:theforegroundpattern,withdrawingsandtext,andthebackgroundpatternwiththewatermark,sothattwoonlyviewsareneeded.Weusedtwoinfraredacquisitions,the rsttakenunderfrontillumination,thesecondtakenwithilluminationfromtheback.Inthiscaseagoodextractionisachievedbyusingallthethreemethodsproposed.However,thebestoneisobtainedwithFastICA.Finally,Figure3showsalastexampleofextractionofafaintunderlyingpattern,usingtheRGBcomponents.Inthiscase,allthethreeproposedmethodsperformedsimilarly.

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

BlindSourceSeparationTechniquesforDetectingHiddenTexts

247

(a)

(b)

(c)

Fig.2.Watermarkdetection:(a)infraredfrontview;(b)backilluminationinfraredview;(c)oneFastICAoutput.

Theseexperimentscon rmedourinitialintuitionaboutthevalidityofBSStechniquesforenhancingandseparatingthevariousfeaturesthatappearasoverlappedinmanyancientdocuments.Noconclusionscanbeinsteaddrawnaboutthesuperiorityofonemethodovertheothersforalldocuments.Wecanonlysaythat,whenthemaingoalistoenhancepartiallyhiddenfeatures,atleastoneofthethreemethodsproposedalwayssucceededinreachingthescopeinallourexperiments.Theadvantagesofthesetechniquesarethattheyarequitesimpleandfast,http://doc.guandang.netpositionoftheIsyreadetconsortium:TEASAS(Catanzaro,Italy),ArtInnovation(Oldenzaal,TheNetherlands),ArtConservation(Vlaardingen,TheNetherlands),Transmedia(Swansea,UK),Ate-

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

248A.Tonazziniet

al.

(a)(b)

Fig.3.Detectionofanunderlyingpattern:(a)grayscaleversionoftheoriginalcolordocument;(b)underlyingpatterndetectedbysymmetricorthogonalization.

lierQuillet(Loix,France),AccissBretagne(Plouzane,France),ENST(Brest,France),CNR-ISTI(Pisa,Italy),CNR-IPCF(Pisa,Italy).

References

1.http://www.art-innovation.nl/

2.BellAJ,SejnowskiTJ:NeuralComputation(1995)7:1129–1159

3.Cichocki,A.,Amari,S.-I.:AdaptiveBlindSignalandImageProcessing(2002)Wiley,NewYork.

4.Easton,R.L.:http://www.cis.rit.edu/people/faculty/easton/k-12/index.htm

5.Hyv¨arinen,A.,Oja,E.:NeuralNetworks(2000)13:411–430.

6.Hyv¨arinen,A.,Karhunen,J.,Oja,E.:IndependentComponentAnalysis(2001)JohnWiley,NewYork.

7.Ohta,Y.,Kanade,T.,Sakai,T.:ComputerGraphics,Vision,andImageProcessing(1980)13:222–241.

8.Sharma,G.:IEEETrans.ImageProcessing(2001)10:736–754.

9.R.Swift:http://www.cis.rit.edu/research/thesis/bs/2001/swift/thesis.html.

10.Tonazzini,A.,Bedini,L.,Salerno,E.:Int.J.DocumentAnalysisandRecognition

(2004)inpress.

Blind Source Separation Techniques for Detecting Hidden Text.doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
本文链接:https://www.jiaowen.net/fanwen/1871354.html(转载请注明文章来源)
Copyright © 2020-2021 教文网 版权所有
声明 :本网站尊重并保护知识产权,根据《信息网络传播权保护条例》,如果我们转载的作品侵犯了您的权利,请在一个月内通知我们,我们会及时删除。
客服QQ:78024566 邮箱:78024566@qq.com
苏ICP备19068818号-2
Top
× 游客快捷下载通道(下载后可以自由复制和排版)
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
× 常见问题(客服时间:周一到周五 9:30-18:00)