Blind Source Separation Techniques for Detecting Hidden Text
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
BlindSourceSeparationTechniquesforDetectingHiddenTextsandTexturesin
DocumentImages
AnnaTonazzini,EmanueleSalerno,MatteoMochi,andLuigiBedini
IstitutodiScienzaeTecnologiedell’Informazione-CNR
ViaG.Moruzzi,1,I-56124PISA,Italy
anna.tonazzini@http://doc.guandang.netr.it
Abstract.BlindSourceSeparationtechniques,basedbothonIndepen-dentComponentAnalysisandonsecondorderstatistics,arepresentedandcomparedforextractingpartiallyhiddentextsandtexturesindoc-umentimages.Barelyperceivablefeaturesmayoccur,forinstance,inancientdocumentspreviouslyerasedandthenre-written(palimpsests),orfortransparencyorseepingofinkfromthereverseside,orfromwater-marksinthepaper.Detectingthesefeaturescanbeofgreatimportancetoscholarsandhistorians.Inourapproach,thedocumentismodeledasthesuperpositionofanumberofsourcepatterns,andasimpli edlin-earmixturemodelisintroducedfordescribingtherelationshipbetweenthesesourcesandmultispectralviewsofthedocumentitself.Theprob-lemofdetectingthepatternsthatarebarelyperceivableinthevisiblecolorimageisthusformulatedastheoneofseparatingthevariouspat-ternsinthemixtures.Someexamplesfromanextensiveexperimentationwithrealancientdocumentsareshownandcommented.
1Introduction
Revealingthewholecontentsofancientdocumentsisanimportantaidtoschol-arsthatareinterestedindatingthedocumentsorestablishingtheirorigin,orreadingolderandhistoricallyrelevantwritingstheymaycontain.However,in-terestingdocumentfeaturesareoftenhiddenorbarelydetectableintheoriginalcolordocument.Multispectralacquisitionsinthenon-visiblerange,suchastheultravioletorthenearinfrared,constituteavalidhelpinthisrespect.Forin-stance,amethodtorevealpaperwatermarksistorecordaninfraredimageofthepaperusingtransmittedillumination.Nevertheless,thewatermarkdetectedwiththismethodisusuallyveryfaintandoverlappedtothecontentsofthepapersurface.Tomakethewatermarkpattern,orwhateverfeatureofinter-est,morereadableandfreefrominterferencesduetooverlappedpatterns,anintuitivestrategyistoprocess,forinstancebyarithmeticoperations,multiple“views”ofthedocument.Inthecasewhereacolorscanisavailable,threedi er-entviewscanbeobtainedfromthered,green,andblueimagechannels.When ThisworkhasbeensupportedbytheEuropeanCommissionproject“Isyreadet”(http://doc.guandang.net),undercontractIST-1999-57462
A.Campilho,M.Kamel(Eds.):ICIAR2004,LNCS3212,pp.241–248,2004.
cSpringer-VerlagBerlinHeidelberg2004
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
242A.Tonazzinietal.
available,scansatnon-visiblewavelengthscanbeusedaloneorinconjunctionwiththevisibleones.Byprocessingthedi erentcolorcomponents,itispossibletoextractsomeoftheoverlappedpatterns,and,sometimes,eventoachieveacompleteseparationofallthem.Indeed,sinceallthesecolorcomponentscon-tainthepatternsindi erent”percentage”,simpledi erenceoperationsbetweenthecolors,aftersuitableregulationofthelevels,can”cancel”onepatternandenhancetheother.Forthecaseofwatermarks,anotherinfraredimagetakenusingonlythere ectedilluminationcanbeusedforthispurpose[1].Ontheotherhand,someauthorsclaimthatsubtractingtheGreenfromtheRedisabletorevealhiddencharactersincharreddocuments[9].Thesearehoweverem-pirical,document-dependent,strategies.Wearelooking,instead,forautomatic,mathematicallybased,techniquesthatareabletoenhanceoreventoextractthehiddenfeaturesofinterestfromdocumentsofanykind,withouttheneedforadaptationstothespeci cproblemathand.
Ourapproachtothisproblemistomodelallthedocumentviewsaslinearcombinationsofanumberofindependentpatterns.Thesolutionconsiststhenintryingtoinvertthistransformation.Theoverlappedpatternsareusuallythemainforegroundtext,thebackgroundpattern,i.e.animageofthepaper(orparchment,orwhatever)support,whichcancontaindi erentinterferingfea-tures,suchasstains,watermarks,etc.,andoneormoreextratextsordrawings,duetopreviouslywrittenandthenerasedtexts(palimpsests),seepingofinkfromthereverseside(bleed-through),transparencyfromotherpages(show-through),andotherphenomena.Althoughourlinearimagemodelroughlysimpli esthephysicalnatureofoverlappedpatternsindocuments[8],ithasalreadyprovedtogiveinterestingresults.Indeed,thismodelhasbeenproposedin[4]toextractthehiddentextsfromcolorimagesofpalimpsests,assumingtoevaluatebyvisualinspectionthemixturecoe cients.Nevertheless,ingeneral,themixturecoe -cientsarenotknown,andtheseparationproblembecomesoneofblindsourceseparation(BSS).Ithasbeenshownthatane ectivesolutiontoBSScanbefoundifthesourcepatternsaremutuallyindependent.Theindependenceas-sumptiongivesrisetoseparationtechniquesbasedonindependentcomponentanalysis,orICA[6].Althoughthelineardatamodelissomewhatsimpli ed,andtheindependenceassumptionisnotalwaysjusti ed,wealreadyproposedICAtechniquesfordocumentimageprocessing[10],andobtainedgoodresultswithrealmanuscriptdocuments.
InthispaperwecomparetheperformanceofICAtechniqueswithsimplermethodsthatonlytrytodecorrelatetheobserveddata.Asisknown,thisre-quirementisweakerthanindependence,and,inprinciple,nosourceseparationcanbeobtainedbyonlyconstrainingsecond-orderstatistics,atleastifnoaddi-tionalrequirementissatis ed.However,ourpresentaimistheenhancementoftheoverlappedpatterns,especiallyofthosethatarehiddenorbarelydetectable,andweexperimentallyfoundthatthiscanbeachievedinmostcasesevenbysim-pledecorrelation.Ontheotherhand,whilethecolorcomponentsofanimageareusuallyspatiallycorrelated,theinpidualclassesorpatternsthatcomposetheimageareatleastlesscorrelated.Thus,decorrelatingthecolorcomponentsgivesadi erentrepresentationwherethenowuncorrelatedcomponentsoftheimagecouldcoincidewiththesingleclasses.Furthermore,thesecond-orderapproach
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
BlindSourceSeparationTechniquesforDetectingHiddenTexts243
isalwayslessexpensivethanICAalgorithms,andduetothepoormodelingortothelackofindependenceofthepatterns,theresultsfromdecorrelationcanalsobebetterthantheonesfromICA.
2FormulationoftheProblem
Letusassumethateachpixel(ofindextinatotalofT)ofamultispectralscanofadocumenthasavectorvaluex(t)ofNcomponents.Similarly,letusassumetohaveMsuperimposedsourcesrepresented,ateachpixelt,bythevectors(t).Sinceweconsiderimagesofdocumentscontaininghomogeneoustextsordrawings,wecanalsoreasonablyassumethatthecolorofeachsourceisalmostuniform,i.e.,wewillhavemeanre ectanceindicesAijforthei-thsourceatthej-thwavelenght.Thus,wewillhaveacollectionofTsamplesfromarandomN-vectorx,whichisgeneratedbylinearlyandistantaneouslymixingthecomponentsofarandomM-vectorsthroughanN×MmixingmatrixA:
x(t)=As(t)t=1,2,...,T(1)
wherethesourcefunctionssi(t),i=1,2,...,Mdenotethe“quantity”oftheMpatternsthatconcurtoformthecoloratpointt.Estimatings(t)andAfromknowledgeofx(t)iscalledaproblemofblindsourceseparation(BSS).Inthisapplication,weassumethatnoiseandblurcanbeneglected.Whenonlythevisiblecolorscanisavailable,vectorx(t)hasdimensionN=3(itiscomposedbythered,green,andbluechannels).However,mostdocumentscanbeseenasthesuperpositionofonlythree(M=3)di erentsources,orclasses,thatwewillcall“background”,“maintext”and“interferingtexture”.Ingeneral,byusingmultispectral/hyperspectralsensors,the“color”vectorcanassumeadimensiongreaterthan3.Likewise,wecanalsohaveM>3ifadditionalpatternsarepresentintheoriginaldocument.Inthispaper,weonlyconsiderthecaseM=N,thatis,samenumberofsourcesasofobservations,althoughinprinciplethereisnodi erencewiththegeneralcase.
Itiseasytoseethatthismodeldoesnotperfectlyaccountforthephe-nomenonofinterferingtextsindocuments,whichderivesfromcomplicatedchemicalprocessesofinkdi usionandpaperabsorption.Justtomentiononeaspect,inthepixelswheretwotextsaresuperimposedtoeachother,theresult-ingcolorisnotthevectorsumofthecolorsofthetwocomponents,butitislikelytobesomenonlinearcombinationofthem.In[8],anonlinearmodelisde-rivedevenforthesimplerphenomenonofshow-through.However,althoughthelinearmodelisonlyaroughapproximation,ithasdemonstrateditsusefulnessindi erentapplications,asalreadymentionedabove[4][10].
3TheProposedSolutions:ICA,PCA,andWhitening
Whennoadditionalassumptionismade,problem(1)isclearlyunderdetermined,sinceanynonsingularchoiceforAcangiveanestimateofs(t)thataccountsfortheevidencex(t).Evenifnospeci cinformationisavailable,statistical
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
244A.Tonazzinietal.
assumptionscanoftenbemadeonthesources.Inparticular,itcanbeassumedthatthesourcesaremutuallyindependent.Ifthisassumptionisjusti ed,bothAandscanbeestimatedfromx.Asmentionedintheintroduction,thisistheICAapproach[6].Ifthepriordistributionforeachsourceisknown,independenceisequivalenttoassumeafactorizedformforthejointpriordistributionofs:
P(s(t))=N
i=1Pi(si(t)) t(2)
Theseparationproblemcanbeformulatedasthemaximizationofeq.2,sub-jecttotheconstraintx=As.ThisisequivalenttothesearchforaW,W=(w1,w2,...,wN)T,suchthat,whenappliedtothedatax=(x1,x2,...,xN),Tproducesthesetofvectorswixthataremaximallyindependent,andwhosedis-tributionsaregivenbythePi.Bytakingthelogarithmofeq.2,theproblemsolvedbyICAalgorithmsisthen:
T =argmaxWlogPi(wix(t))+Tlog|det(W)|(3)Wti
isanestimateofA 1,uptoarbitraryscalefactorsandpermutationsMatrixWT ionthecolumns.Hence,eachvector si=wxisoneoftheoriginalsourcevectorsuptoascalefactor.
Besidesindependence,tomakeseparationpossibleanecessaryextracondi-tionforthesourcesisthattheyall,butatmostone,mustbenon-Gaussian.Toenforcenon-Gaussianity,genericsuper-Gaussianorsub-Gaussiandistributionscanbeusedaspriorsforthesources.Thesehaveproventogiveverygoodesti-matesforthemixingmatrixandforthesourcesaswell,nomatterofthetruesourcedistributions,which,ontheotherhand,areusuallyunknown[2].
Althoughwealreadyobtainedsomepromisingresultbythisapproach[10],thereisnoapparentphysicalreasonwhyouroriginalsourcesshouldbemutu-allyindependent,so,evenifthedatamodel(1)wascorrect,theICAprincipleisnotassuredtobeabletoseparatethedi erentclasses.However,itisintuitivelyclearthatonecantrytomaximizetheinformationcontentineachcomponentofthedatavectorbydecorrelatingtheobservedimagechannels.Toavoidcumber-somenotation,andwithoutlossofgenerality,letusassumetohavezero-meandatavectors.Wethusseekforalineartransformationy(t)=Wx(t)suchthat<yiyj>=0, i,j=1,...,M,i=j,whereWisgenerallyanM×Nmatrixandthenotation<·>meansexpectation.Inotherwords,thecomponentsofthetransformeddatavectoryareorthogonal.Itisclearthatthisoperationisnotunique,since,givenanorthonormalbasisofasubspace,anyrigidrotationofitstillyieldsanorthonormalbasisofthesamesubspace.Itiswellknownthatlineardataprocessingcanhelptorestorecolortextimages,althoughthelinearmodelisnotfullyjusti ed.In[7],theauthorscomparethee ectofmany xedlinearcolortransformationsontheperformanceofarecursivesegmentationalgorithm.Theyarguethatthelineartransformationthatobtainsmaximum-variancecom-ponentsisthemoste ective.Theythusderivea xedtransformationthat,foralargeclassofimages,approximatestheKarhunen-Loevetransformation,which
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
BlindSourceSeparationTechniquesforDetectingHiddenTexts245
isknowntogiveorthogonaloutputvectors,oneofwhichhasmaximumvariance.Thisapproachisalsocalledprincipalcomponentanalysis(PCA),andoneofitspurposesisto ndthemostusefulamonganumberofvariables[3].OurdatacovariancematrixistheN×Nmatrix:
RxxT 1x(t)xT(t)=<xxT>≈Tt=1(4)
Sincethedataarenormallycorrelated,matrixRxxwillbenondiagonal.Thecovariancematrixofvectoryis:
Ryy=<WxxTWT>=WRxxWT(5)
Toobtainorthogonaly,Ryyshouldbediagonal.LetusperformtheeigenvaluedecompositionofmatrixRxx,andcallVxthematrixoftheeigenvectorsofRxx,andΛxthediagonalmatrixofitseigenvalues,indecreasingorder.Now,itiseasytoverifythatallofthefollowingchoicesforWyieldadiagonalRyy:
TWo=Vx
TWw=ΛxVx (6)(7)
(8)Ws= TVxΛxVx
MatrixWoproducesasetofvectorsyi(t)thatareorthogonaltoeachotherandwhoseEuclideannormsareequaltotheeigenvaluesofthedatacovariancematrix.ThisiswhatPCAdoes[3].ByusingmatrixWw,weobtainasetoforthogonalvectorsofunitnorms,i.e.orthogonalvectorslocatedonasphericalsurface(whitening,orMahalanobistransform).Thispropertystillholdstrueifanywhiteningmatrixismultipliedfromtheleftbyanorthogonalmatrix.Inparticular,ifweusematrixWsde nedin(8),wehaveawhiteningmatrixwiththefurtherpropertyofbeingsymmetric.In[3],itisobservedthatapplicationofmatrixWsisequivalenttoICAwhenmatrixAissymmetric.Ingeneral,ICAappliesafurtherrotationtotheoutputvectors,basedonhigher-orderstatistics.4ExperimentalResultsandConcludingRemarks
Ourexperimentalworkhasconsistedinapplyingtheabovematricestotypi-calimagesofancientdocuments,withtheaimatemphasizingthedocumenthiddenfeaturesinthewhitenedvectors.Foreachtestimage,theresultsareofcoursedi erentfordi erentwhiteningmatrices.However,itisinterestingtonotethatthesymmetricwhiteningmatrixoftenperformsbetterthanICA,and,insomecases,itcanalsoachieveaseparationofthedi erentcomponents,whichisthe nalaimofBSS.Here,weshowsomeexamplesfromourextensiveexperimentation.The rstexample(Figure1)describestheprocessingofanancientmanuscriptwhichpresentsthreeoverlappedpatterns:amaintext,anunderwritingbarelyvisibleintheoriginalimage,andanoisybackgroundwith
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
246A.Tonazziniet
al.
(a)
(b)(c)
Fig.1.Fullseparationwithsymmetricorthogonalization:(a)grayscalerepresentationofthecolorscanofanancientmanuscriptcontainingapartiallyhiddentext;(b) rstsymmetricorthogonalizationoutputfromtheRGBcomponentsofthecolorimage;(c)secondsymmetricorthogonalizationoutputfromthesamedataset.
signi cantpaperfolds.WecomparedtheresultsoftheFastICAalgorithm[5][10],thePCA,andthesymmetricwhitening,allappliedtotheRGBchannels,andfoundthatfullseparationandenhancementofthethreeclassesisobtainedbythesymmetricorthogonalizationonly.ICAfailuremightdepend,inthiscase,onthedatamodelinaccuracyand/orthelackofmutualindependenceoftheclasses.InFigure2,wereportanotherexamplewhereapaperwatermarkpatternisdetectedandextracted.Inthiscase,weassumethedocumentasconstitutedoftwoonlyclasses:theforegroundpattern,withdrawingsandtext,andthebackgroundpatternwiththewatermark,sothattwoonlyviewsareneeded.Weusedtwoinfraredacquisitions,the rsttakenunderfrontillumination,thesecondtakenwithilluminationfromtheback.Inthiscaseagoodextractionisachievedbyusingallthethreemethodsproposed.However,thebestoneisobtainedwithFastICA.Finally,Figure3showsalastexampleofextractionofafaintunderlyingpattern,usingtheRGBcomponents.Inthiscase,allthethreeproposedmethodsperformedsimilarly.
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
BlindSourceSeparationTechniquesforDetectingHiddenTexts
247
(a)
(b)
(c)
Fig.2.Watermarkdetection:(a)infraredfrontview;(b)backilluminationinfraredview;(c)oneFastICAoutput.
Theseexperimentscon rmedourinitialintuitionaboutthevalidityofBSStechniquesforenhancingandseparatingthevariousfeaturesthatappearasoverlappedinmanyancientdocuments.Noconclusionscanbeinsteaddrawnaboutthesuperiorityofonemethodovertheothersforalldocuments.Wecanonlysaythat,whenthemaingoalistoenhancepartiallyhiddenfeatures,atleastoneofthethreemethodsproposedalwayssucceededinreachingthescopeinallourexperiments.Theadvantagesofthesetechniquesarethattheyarequitesimpleandfast,http://doc.guandang.netpositionoftheIsyreadetconsortium:TEASAS(Catanzaro,Italy),ArtInnovation(Oldenzaal,TheNetherlands),ArtConservation(Vlaardingen,TheNetherlands),Transmedia(Swansea,UK),Ate-
Abstract. Blind Source Separation techniques, based both on Independent Component Analysis and on second order statistics, are presented and compared for extracting partially hidden texts and textures in document images. Barely perceivable features may occ
248A.Tonazziniet
al.
(a)(b)
Fig.3.Detectionofanunderlyingpattern:(a)grayscaleversionoftheoriginalcolordocument;(b)underlyingpatterndetectedbysymmetricorthogonalization.
lierQuillet(Loix,France),AccissBretagne(Plouzane,France),ENST(Brest,France),CNR-ISTI(Pisa,Italy),CNR-IPCF(Pisa,Italy).
References
1.http://www.art-innovation.nl/
2.BellAJ,SejnowskiTJ:NeuralComputation(1995)7:1129–1159
3.Cichocki,A.,Amari,S.-I.:AdaptiveBlindSignalandImageProcessing(2002)Wiley,NewYork.
4.Easton,R.L.:http://www.cis.rit.edu/people/faculty/easton/k-12/index.htm
5.Hyv¨arinen,A.,Oja,E.:NeuralNetworks(2000)13:411–430.
6.Hyv¨arinen,A.,Karhunen,J.,Oja,E.:IndependentComponentAnalysis(2001)JohnWiley,NewYork.
7.Ohta,Y.,Kanade,T.,Sakai,T.:ComputerGraphics,Vision,andImageProcessing(1980)13:222–241.
8.Sharma,G.:IEEETrans.ImageProcessing(2001)10:736–754.
9.R.Swift:http://www.cis.rit.edu/research/thesis/bs/2001/swift/thesis.html.
10.Tonazzini,A.,Bedini,L.,Salerno,E.:Int.J.DocumentAnalysisandRecognition
(2004)inpress.
相关推荐:
- [行业范文]学生自我鉴定50字 学生自我鉴定高中(九篇)
- [行业范文]如何写高等学校毕业生登记表自我鉴定简短(九篇)
- [行业范文]2024年毕业生个人的自我鉴定(通用10篇)
- [行业范文]最新个人在生活方面自我鉴定(汇总9篇)
- [行业范文]最新地铁安检自我鉴定书 地铁实习自我鉴定(汇总9篇)
- [行业范文]如何写护士毕业生自我鉴定通用(六篇)
- [行业范文]毕业登记表自我鉴定通用
- [行业范文]2024年试用期的自我鉴定(五篇)
- [行业范文]2024年法学毕业生自我鉴定如何写 如何写大学毕业生自
- [行业范文]精选成人大专自我鉴定(精)(9篇)
- [行业范文]最新护士毕业自我鉴定(6篇)
- [行业范文]推荐个人自我鉴定(四篇)
- [行业范文]有关护士自我鉴定总结通用
- [行业范文]精选护士毕业自我鉴定简短(8篇)
- [行业范文]推荐高校毕业生登记表自我鉴定怎么写(2篇)
- [行业范文]最新自我鉴定毕业生登记表通用
- [行业范文]最新护理大专自我鉴定(通用10篇)
- [行业范文]精选大专毕业自我鉴定汇总(3篇)
- [行业范文]2024年高三自我鉴定表 高三自我鉴定(6篇)
- [行业范文]2024年大三学生学年自我鉴定表 大三学生学年自我鉴定(
- 信息安全概论第20讲
- 最新小学生英语才艺展示决赛活动总结
- 医学影像呼吸循环系统题目
- 生活英语常用句型短语
- 金蝶k3应收应付七大议题简析
- 加强社区工作促进社区发展更好服务基层
- 运用法律手段使民间借贷阳光化
- 中国纯锆珠行业市场发展状况及投资风险
- 《文与可画筼筜谷偃竹记》课件
- Revision of unit1&unit2
- 如何对喷绘机喷头进行保养维护
- 七年级生物下册 第三单元第一章第三节
- RTK在工程测量中的应用
- 商志考研英语(修正版)
- 新教材六年级上册数学小数乘分数例5
- 少数民族非物质文化遗产女性传承人现状
- 第二章_会计科目、会计账户和复式记帐
- 第一节 自然地理环境的整体性
- 世界经济概论课后习题答案第一章
- 认识冬天小班语言教案