Blind Source Separation Techniques for Detecting Hidden Text

来源：网络收集时间：2024-05-07

导读： Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable

Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ

BlindSourceSeparationTechniquesforDetectingHiddenTextsandTexturesin

DocumentImages

AnnaTonazzini,EmanueleSalerno,MatteoMochi,andLuigiBedini

IstitutodiScienzaeTecnologiedell’Informazione-CNR

ViaG.Moruzzi,1,I-56124PISA,Italy

anna.tonazzini@http://doc.guandang.netr.it

Abstract.BlindSourceSeparationtechniques,basedbothonIndepen-dentComponentAnalysisandonsecondorderstatistics,arepresentedandcomparedforextractingpartiallyhiddentextsandtexturesindoc-umentimages.Barelyperceivablefeaturesmayoccur,forinstance,inancientdocumentspreviouslyerasedandthenre-written(palimpsests),orfortransparencyorseepingofinkfromthereverseside,orfromwater-marksinthepaper.Detectingthesefeaturescanbeofgreatimportancetoscholarsandhistorians.Inourapproach,thedocumentismodeledasthesuperpositionofanumberofsourcepatterns,andasimpli edlin-earmixturemodelisintroducedfordescribingtherelationshipbetweenthesesourcesandmultispectralviewsofthedocumentitself.Theprob-lemofdetectingthepatternsthatarebarelyperceivableinthevisiblecolorimageisthusformulatedastheoneofseparatingthevariouspat-ternsinthemixtures.Someexamplesfromanextensiveexperimentationwithrealancientdocumentsareshownandcommented.

1Introduction

Revealingthewholecontentsofancientdocumentsisanimportantaidtoschol-arsthatareinterestedindatingthedocumentsorestablishingtheirorigin,orreadingolderandhistoricallyrelevantwritingstheymaycontain.However,in-terestingdocumentfeaturesareoftenhiddenorbarelydetectableintheoriginalcolordocument.Multispectralacquisitionsinthenon-visiblerange,suchastheultravioletorthenearinfrared,constituteavalidhelpinthisrespect.Forin-stance,amethodtorevealpaperwatermarksistorecordaninfraredimageofthepaperusingtransmittedillumination.Nevertheless,thewatermarkdetectedwiththismethodisusuallyveryfaintandoverlappedtothecontentsofthepapersurface.Tomakethewatermarkpattern,orwhateverfeatureofinter-est,morereadableandfreefrominterferencesduetooverlappedpatterns,anintuitivestrategyistoprocess,forinstancebyarithmeticoperations,multiple“views”ofthedocument.Inthecasewhereacolorscanisavailable,threedi er-entviewscanbeobtainedfromthered,green,andblueimagechannels.When ThisworkhasbeensupportedbytheEuropeanCommissionproject“Isyreadet”(http://doc.guandang.net),undercontractIST-1999-57462

A.Campilho,M.Kamel(Eds.):ICIAR2004,LNCS3212,pp.241–248,2004.

cSpringer-VerlagBerlinHeidelberg2004

242A.Tonazzinietal.

available,scansatnon-visiblewavelengthscanbeusedaloneorinconjunctionwiththevisibleones.Byprocessingthedi erentcolorcomponents,itispossibletoextractsomeoftheoverlappedpatterns,and,sometimes,eventoachieveacompleteseparationofallthem.Indeed,sinceallthesecolorcomponentscon-tainthepatternsindi erent”percentage”,simpledi erenceoperationsbetweenthecolors,aftersuitableregulationofthelevels,can”cancel”onepatternandenhancetheother.Forthecaseofwatermarks,anotherinfraredimagetakenusingonlythere ectedilluminationcanbeusedforthispurpose[1].Ontheotherhand,someauthorsclaimthatsubtractingtheGreenfromtheRedisabletorevealhiddencharactersincharreddocuments[9].Thesearehoweverem-pirical,document-dependent,strategies.Wearelooking,instead,forautomatic,mathematicallybased,techniquesthatareabletoenhanceoreventoextractthehiddenfeaturesofinterestfromdocumentsofanykind,withouttheneedforadaptationstothespeci cproblemathand.

Ourapproachtothisproblemistomodelallthedocumentviewsaslinearcombinationsofanumberofindependentpatterns.Thesolutionconsiststhenintryingtoinvertthistransformation.Theoverlappedpatternsareusuallythemainforegroundtext,thebackgroundpattern,i.e.animageofthepaper(orparchment,orwhatever)support,whichcancontaindi erentinterferingfea-tures,suchasstains,watermarks,etc.,andoneormoreextratextsordrawings,duetopreviouslywrittenandthenerasedtexts(palimpsests),seepingofinkfromthereverseside(bleed-through),transparencyfromotherpages(show-through),andotherphenomena.Althoughourlinearimagemodelroughlysimpli esthephysicalnatureofoverlappedpatternsindocuments[8],ithasalreadyprovedtogiveinterestingresults.Indeed,thismodelhasbeenproposedin[4]toextractthehiddentextsfromcolorimagesofpalimpsests,assumingtoevaluatebyvisualinspectionthemixturecoe cients.Nevertheless,ingeneral,themixturecoe -cientsarenotknown,andtheseparationproblembecomesoneofblindsourceseparation(BSS).Ithasbeenshownthatane ectivesolutiontoBSScanbefoundifthesourcepatternsaremutuallyindependent.Theindependenceas-sumptiongivesrisetoseparationtechniquesbasedonindependentcomponentanalysis,orICA[6].Althoughthelineardatamodelissomewhatsimpli ed,andtheindependenceassumptionisnotalwaysjusti ed,wealreadyproposedICAtechniquesfordocumentimageprocessing[10],andobtainedgoodresultswithrealmanuscriptdocuments.

InthispaperwecomparetheperformanceofICAtechniqueswithsimplermethodsthatonlytrytodecorrelatetheobserveddata.Asisknown,thisre-quirementisweakerthanindependence,and,inprinciple,nosourceseparationcanbeobtainedbyonlyconstrainingsecond-orderstatistics,atleastifnoaddi-tionalrequirementissatis ed.However,ourpresentaimistheenhancementoftheoverlappedpatterns,especiallyofthosethatarehiddenorbarelydetectable,andweexperimentallyfoundthatthiscanbeachievedinmostcasesevenbysim-pledecorrelation.Ontheotherhand,whilethecolorcomponentsofanimageareusuallyspatiallycorrelated,theinpidualclassesorpatternsthatcomposetheimageareatleastlesscorrelated.Thus,decorrelatingthecolorcomponentsgivesadi erentrepresentationwherethenowuncorrelatedcomponentsoftheimagecouldcoincidewiththesingleclasses.Furthermore,thesecond-orderapproach

BlindSourceSeparationTechniquesforDetectingHiddenTexts243

isalwayslessexpensivethanICAalgorithms,andduetothepoormodelingortothelackofindependenceofthepatterns,theresultsfromdecorrelationcanalsobebetterthantheonesfromICA.

2FormulationoftheProblem

Letusassumethateachpixel(ofindextinatotalofT)ofamultispectralscanofadocumenthasavectorvaluex(t)ofNcomponents.Similarly,letusassumetohaveMsuperimposedsourcesrepresented,ateachpixelt,bythevectors(t).Sinceweconsiderimagesofdocumentscontaininghomogeneoustextsordrawings,wecanalsoreasonablyassumethatthecolorofeachsourceisalmostuniform,i.e.,wewillhavemeanre ectanceindicesAijforthei-thsourceatthej-thwavelenght.Thus,wewillhaveacollectionofTsamplesfromarandomN-vectorx,whichisgeneratedbylinearlyandistantaneouslymixingthecomponentsofarandomM-vectorsthroughanN×MmixingmatrixA:

x(t)=As(t)t=1,2,...,T(1)

wherethesourcefunctionssi(t),i=1,2,...,Mdenotethe“quantity”oftheMpatternsthatconcurtoformthecoloratpointt.Estimatings(t)andAfromknowledgeofx(t)iscalledaproblemofblindsourceseparation(BSS).Inthisapplication,weassumethatnoiseandblurcanbeneglected.Whenonlythevisiblecolorscanisavailable,vectorx(t)hasdimensionN=3(itiscomposedbythered,green,andbluechannels).However,mostdocumentscanbeseenasthesuperpositionofonlythree(M=3)di erentsources,orclasses,thatwewillcall“background”,“maintext”and“interferingtexture”.Ingeneral,byusingmultispectral/hyperspectralsensors,the“color”vectorcanassumeadimensiongreaterthan3.Likewise,wecanalsohaveM>3ifadditionalpatternsarepresentintheoriginaldocument.Inthispaper,weonlyconsiderthecaseM=N,thatis,samenumberofsourcesasofobservations,althoughinprinciplethereisnodi erencewiththegeneralcase.

Itiseasytoseethatthismodeldoesnotperfectlyaccountforthephe-nomenonofinterferingtextsindocuments,whichderivesfromcomplicatedchemicalprocessesofinkdi usionandpaperabsorption.Justtomentiononeaspect,inthepixelswheretwotextsaresuperimposedtoeachother,theresult-ingcolorisnotthevectorsumofthecolorsofthetwocomponents,butitislikelytobesomenonlinearcombinationofthem.In[8],anonlinearmodelisde-rivedevenforthesimplerphenomenonofshow-through.However,althoughthelinearmodelisonlyaroughapproximation,ithasdemonstrateditsusefulnessindi erentapplications,asalreadymentionedabove[4][10].

3TheProposedSolutions:ICA,PCA,andWhitening

Whennoadditionalassumptionismade,problem(1)isclearlyunderdetermined,sinceanynonsingularchoiceforAcangiveanestimateofs(t)thataccountsfortheevidencex(t).Evenifnospeci cinformationisavailable,statistical

244A.Tonazzinietal.

assumptionscanoftenbemadeonthesources.Inparticular,itcanbeassumedthatthesourcesaremutuallyindependent.Ifthisassumptionisjusti ed,bothAandscanbeestimatedfromx.Asmentionedintheintroduction,thisistheICAapproach[6].Ifthepriordistributionforeachsourceisknown,independenceisequivalenttoassumeafactorizedformforthejointpriordistributionofs:

P(s(t))=N

i=1Pi(si(t)) t(2)

Theseparationproblemcanbeformulatedasthemaximizationofeq.2,sub-jecttotheconstraintx=As.ThisisequivalenttothesearchforaW,W=(w1,w2,...,wN)T,suchthat,whenappliedtothedatax=(x1,x2,...,xN),Tproducesthesetofvectorswixthataremaximallyindependent,andwhosedis-tributionsaregivenbythePi.Bytakingthelogarithmofeq.2,theproblemsolvedbyICAalgorithmsisthen:

T =argmaxWlogPi(wix(t))+Tlog|det(W)|(3)Wti

isanestimateofA 1,uptoarbitraryscalefactorsandpermutationsMatrixWT ionthecolumns.Hence,eachvector si=wxisoneoftheoriginalsourcevectorsuptoascalefactor.

Besidesindependence,tomakeseparationpossibleanecessaryextracondi-tionforthesourcesisthattheyall,butatmostone,mustbenon-Gaussian.Toenforcenon-Gaussianity,genericsuper-Gaussianorsub-Gaussiandistributionscanbeusedaspriorsforthesources.Thesehaveproventogiveverygoodesti-matesforthemixingmatrixandforthesourcesaswell,nomatterofthetruesourcedistributions,which,ontheotherhand,areusuallyunknown[2].

Althoughwealreadyobtainedsomepromisingresultbythisapproach[10],thereisnoapparentphysicalreasonwhyouroriginalsourcesshouldbemutu-allyindependent,so,evenifthedatamodel(1)wascorrect,theICAprincipleisnotassuredtobeabletoseparatethedi erentclasses.However,itisintuitivelyclearthatonecantrytomaximizetheinformationcontentineachcomponentofthedatavectorbydecorrelatingtheobservedimagechannels.Toavoidcumber-somenotation,andwithoutlossofgenerality,letusassumetohavezero-meandatavectors.Wethusseekforalineartransformationy(t)=Wx(t)suchthat<yiyj>=0, i,j=1,...,M,i=j,whereWisgenerallyanM×Nmatrixandthenotation<·>meansexpectation.Inotherwords,thecomponentsofthetransformeddatavectoryareorthogonal.Itisclearthatthisoperationisnotunique,since,givenanorthonormalbasisofasubspace,anyrigidrotationofitstillyieldsanorthonormalbasisofthesamesubspace.Itiswellknownthatlineardataprocessingcanhelptorestorecolortextimages,althoughthelinearmodelisnotfullyjusti ed.In[7],theauthorscomparethee ectofmany xedlinearcolortransformationsontheperformanceofarecursivesegmentationalgorithm.Theyarguethatthelineartransformationthatobtainsmaximum-variancecom-ponentsisthemoste ective.Theythusderivea xedtransformationthat,foralargeclassofimages,approximatestheKarhunen-Loevetransformation,which

BlindSourceSeparationTechniquesforDetectingHiddenTexts245

isknowntogiveorthogonaloutputvectors,oneofwhichhasmaximumvariance.Thisapproachisalsocalledprincipalcomponentanalysis(PCA),andoneofitspurposesisto ndthemostusefulamonganumberofvariables[3].OurdatacovariancematrixistheN×Nmatrix:

RxxT 1x(t)xT(t)=<xxT>≈Tt=1(4)

Sincethedataarenormallycorrelated,matrixRxxwillbenondiagonal.Thecovariancematrixofvectoryis:

Ryy=<WxxTWT>=WRxxWT(5)

Toobtainorthogonaly,Ryyshouldbediagonal.LetusperformtheeigenvaluedecompositionofmatrixRxx,andcallVxthematrixoftheeigenvectorsofRxx,andΛxthediagonalmatrixofitseigenvalues,indecreasingorder.Now,itiseasytoverifythatallofthefollowingchoicesforWyieldadiagonalRyy:

TWo=Vx

TWw=ΛxVx (6)(7)

(8)Ws= TVxΛxVx

MatrixWoproducesasetofvectorsyi(t)thatareorthogonaltoeachotherandwhoseEuclideannormsareequaltotheeigenvaluesofthedatacovariancematrix.ThisiswhatPCAdoes[3].ByusingmatrixWw,weobtainasetoforthogonalvectorsofunitnorms,i.e.orthogonalvectorslocatedonasphericalsurface(whitening,orMahalanobistransform).Thispropertystillholdstrueifanywhiteningmatrixismultipliedfromtheleftbyanorthogonalmatrix.Inparticular,ifweusematrixWsde nedin(8),wehaveawhiteningmatrixwiththefurtherpropertyofbeingsymmetric.In[3],itisobservedthatapplicationofmatrixWsisequivalenttoICAwhenmatrixAissymmetric.Ingeneral,ICAappliesafurtherrotationtotheoutputvectors,basedonhigher-orderstatistics.4ExperimentalResultsandConcludingRemarks

Ourexperimentalworkhasconsistedinapplyingtheabovematricestotypi-calimagesofancientdocuments,withtheaimatemphasizingthedocumenthiddenfeaturesinthewhitenedvectors.Foreachtestimage,theresultsareofcoursedi erentfordi erentwhiteningmatrices.However,itisinterestingtonotethatthesymmetricwhiteningmatrixoftenperformsbetterthanICA,and,insomecases,itcanalsoachieveaseparationofthedi erentcomponents,whichisthe nalaimofBSS.Here,weshowsomeexamplesfromourextensiveexperimentation.The rstexample(Figure1)describestheprocessingofanancientmanuscriptwhichpresentsthreeoverlappedpatterns:amaintext,anunderwritingbarelyvisibleintheoriginalimage,andanoisybackgroundwith

246A.Tonazziniet

al.

(a)

(b)(c)

Fig.1.Fullseparationwithsymmetricorthogonalization:(a)grayscalerepresentationofthecolorscanofanancientmanuscriptcontainingapartiallyhiddentext;(b) rstsymmetricorthogonalizationoutputfromtheRGBcomponentsofthecolorimage;(c)secondsymmetricorthogonalizationoutputfromthesamedataset.

signi cantpaperfolds.WecomparedtheresultsoftheFastICAalgorithm[5][10],thePCA,andthesymmetricwhitening,allappliedtotheRGBchannels,andfoundthatfullseparationandenhancementofthethreeclassesisobtainedbythesymmetricorthogonalizationonly.ICAfailuremightdepend,inthiscase,onthedatamodelinaccuracyand/orthelackofmutualindependenceoftheclasses.InFigure2,wereportanotherexamplewhereapaperwatermarkpatternisdetectedandextracted.Inthiscase,weassumethedocumentasconstitutedoftwoonlyclasses:theforegroundpattern,withdrawingsandtext,andthebackgroundpatternwiththewatermark,sothattwoonlyviewsareneeded.Weusedtwoinfraredacquisitions,the rsttakenunderfrontillumination,thesecondtakenwithilluminationfromtheback.Inthiscaseagoodextractionisachievedbyusingallthethreemethodsproposed.However,thebestoneisobtainedwithFastICA.Finally,Figure3showsalastexampleofextractionofafaintunderlyingpattern,usingtheRGBcomponents.Inthiscase,allthethreeproposedmethodsperformedsimilarly.

BlindSourceSeparationTechniquesforDetectingHiddenTexts

247

(a)

(b)

(c)

Fig.2.Watermarkdetection:(a)infraredfrontview;(b)backilluminationinfraredview;(c)oneFastICAoutput.

Theseexperimentscon rmedourinitialintuitionaboutthevalidityofBSStechniquesforenhancingandseparatingthevariousfeaturesthatappearasoverlappedinmanyancientdocuments.Noconclusionscanbeinsteaddrawnaboutthesuperiorityofonemethodovertheothersforalldocuments.Wecanonlysaythat,whenthemaingoalistoenhancepartiallyhiddenfeatures,atleastoneofthethreemethodsproposedalwayssucceededinreachingthescopeinallourexperiments.Theadvantagesofthesetechniquesarethattheyarequitesimpleandfast,http://doc.guandang.netpositionoftheIsyreadetconsortium:TEASAS(Catanzaro,Italy),ArtInnovation(Oldenzaal,TheNetherlands),ArtConservation(Vlaardingen,TheNetherlands),Transmedia(Swansea,UK),Ate-

248A.Tonazziniet

al.

(a)(b)

Fig.3.Detectionofanunderlyingpattern:(a)grayscaleversionoftheoriginalcolordocument;(b)underlyingpatterndetectedbysymmetricorthogonalization.

lierQuillet(Loix,France),AccissBretagne(Plouzane,France),ENST(Brest,France),CNR-ISTI(Pisa,Italy),CNR-IPCF(Pisa,Italy).

References

1.http://www.art-innovation.nl/

2.BellAJ,SejnowskiTJ:NeuralComputation(1995)7:1129–1159

3.Cichocki,A.,Amari,S.-I.:AdaptiveBlindSignalandImageProcessing(2002)Wiley,NewYork.

4.Easton,R.L.:http://www.cis.rit.edu/people/faculty/easton/k-12/index.htm

5.Hyv¨arinen,A.,Oja,E.:NeuralNetworks(2000)13:411–430.

6.Hyv¨arinen,A.,Karhunen,J.,Oja,E.:IndependentComponentAnalysis(2001)JohnWiley,NewYork.

7.Ohta,Y.,Kanade,T.,Sakai,T.:ComputerGraphics,Vision,andImageProcessing(1980)13:222–241.

8.Sharma,G.:IEEETrans.ImageProcessing(2001)10:736–754.

9.R.Swift:http://www.cis.rit.edu/research/thesis/bs/2001/swift/thesis.html.

10.Tonazzini,A.,Bedini,L.,Salerno,E.:Int.J.DocumentAnalysisandRecognition

(2004)inpress.

Blind Source Separation Techniques for Detecting Hidden Text.doc 将本文的Word文档下载到电脑，方便复制、编辑、收藏和打印

下载这篇word文档

本文链接：https://www.jiaowen.net/fanwen/1871354.html（转载请注明文章来源）

上一篇：环卫车辆燃料供应处职工违章处罚和责任追究(暂行)办法
下一篇：2006年12月24日英语新六级考试真题及参考答案