Silicon Technologies for Speaker Independent Speech Processing and Recognition Systems in Noisy Environments

Speech Recognition, Technologies and Applicationsecognition SOC chip is explained in this section The robust processing step which involveslication specific matrix processor for noise removal base ech signal andsubspace basedfurther discussed This AsiP matrix sol,t, QR decomposition unit, matrixLevinison-Durbin Toeplitz matrix solver, fast matrix transpositmodule Discusbased system in ALTERA FPGA is carried out in the final section of this chapterntroduction to HMM based speecognition systemthree categories namely Isolated, Connectethenition accuracy for very large vocabulary Connected speech(or more correctlytterances) recognition is similar to isolated word Recogniprances are therefore possible tonized Continuous speechecognition is method for recognizing spontaneous speech

The system is able to recognize am thatcognizes a specific speakers speech while speaker-independent systems can be used toetect speech by any unspecified speaker Currently speaker independenbased quantizers which have high recognitiecognition system the training data must berate all kinds of speakabulary size, the higher the recogniInolated digit recognition systemachieve higher accuracy by storing finer models of the digits Further if thcreased there is significant reduction in the computational performanceof the system The training data needs to be generateIsolated Word Recognition problem can be divided into two parts, namely -Front EndTypically, the frontInstem we have also implementedbust to the noise The first stage in any Speech Recognitodeling the input speech signal based on certain objective parameters also called the FrontEnd Parameters, Modeling of the input speech signal involves three basic operationspectral modeling, Feature extraction, and parametric transformation(Figure 1) Spectertingnt frequencylule can be added to the Front end processing module which will improve

ker Independent Speech Processingand Recognitio3 Speech recognition architectureNIOS 2 is a soft Processor which can be realized in any of the Alteras FPGA Developmenkits It is based on a 32-bit RISC architecture and is a natural choice in projects where CPUerformance is essential The NIOs processor can be run at different frequencies based onwhich the Computational capability of the processor can be choosen Nios Processorailable in three different speed grades and can be exteinstruction sets, and so forth By doing so, it is possibleoppart of the systerIX

By simulating P(firmware modules) as software objects, a system can be develto an advanced statebefore it needs to be tested on the actual target Another benefit of this approach is thatimplementation in Altera FPGAs with separate 32-bit instruction and data buses running atdata from both on-chip and extermalle time, Nios Processor has got 32 32bit general purpose registers and 16 32bit controlegisters, an Arithmetic Logic Unit(ALU), Exception Unit, Instruction cache and Data Cache,This flexibility allows the user to balance the required performance of the target applicationrea cost of the softdata aAltera are connected through a system bus canapped IyO) All the systemder they were put on; item at the lowest memory location of the stack goes firstterrupts if the Interrupt EnablE)bit in the Machine Status Register (MSR) is set to 1 On an interrupt the instruction inlete, one has to manually enable the interrupt enablecontrol the niosocessor must be done in C/C++NIOS tool has got gnu based have built in C/C++ compilers and debugger to generate the),half-word(16 bits), and prcessor (Agarwal 2001) NIOS Processor supportsnecessary machine code for the NIOS procemust be on word boundaries, half-word on half-wordfiguredBus masters to be added simultaneously and offers excellent arbitration capabilities withique kind of hardyoftware interface called custominstruction which acts as a hardware mapped instruction to the Nos processor (A2006) We can also accelerate the software function in NIOS pystem, Neinstructions differentle timesbe integrated into the design to accelerate thederlying software Compared to complete software perforsystem

Speech Recognition, Technologies and Applicationsacceleration improves 20X perfimprovement Our design utilizes this customle counters, ethernet controller, Dcontroller Flash controller, useonents, PLLs, HLCDpute the execution time of a software routine or used to produce tritervals so as to signal some of the hardware peripherals Hardware Ilnnected to the system in two different ways The hardware component can be configuren Custom instruction componentprocessorthoughinstruction, NIOSfour differkinds of customtechnology namely combinational; Multi cycle, Extended and Internal Registerfile based custom instruction

Custom instruction module can also be connected to thean connect some of the custom instruction signals to externalalso be interfaced to thNIOS system through the Avalon slave or Master Interface Avalon Slave devices can havepts and theyf the processor through the interrupts Theseterrupts can be prioritized manuallyNos■PData BusFig 5 NIOS Architectureg fixed point architecturenite word length effectsAll DSP based designs strongly depend on the floatingto fixed pointthe DSP algorithm may not be implementable in floating point form Fixed pint analysis

es for Speaker Independent Speech ProcessingNoisyof the system is extremely important to understand the nonlinear nature of the quantizationharacteristics This leads to certain constraints and assumptions on quantizationexample that the word-lengiafterg Quantization(Meng 2004r signals, assumed to be uniformly distributed,with a whiteand uncorrelatedadded whenever a truncation occurs Thisximate model hais dramaticallyaffected by word-length in a uniform word-length structure, decreasing at approximatethat it is not necessary to have highly accurate models ofquantization error power in order to predict the required signal width, In a multiple wordength system realization, the implementationbe adjusted muchnely, and so the resulting implementation tends to be more sensitive totput power resulting from an infinite precisionation defines thenal-to-noise ratio In order to predict the quantization effect of a particular word-lengthand scaling annotationagate the word-length values and scaling fromts of each atomic operation to then output( Haykin 1992) The precisiohe output not only dependof the inputs, it also depends on theigorithm to be implemented

For example the fixed point implementation of complex FFTch stage of computation(BFT lengths more bits of precision are lost The Feature extraction stage was implemented inNios Processor with fixed precession inputs The following plots describe the fixed pointharacteristics of the algorithmg 6 Fixed point MFCC implementation

Speech Recognition, Technologies and ApplicationsNo ot sampesnditiaseghi 2004he signal to noise ratio may vary sigtly, the word may be stretched too long or tooodel the noise HMM and subtract it from the actual speech HMM(Hermus 2007)Thereceiver must incorporate enough programmable parameters to be reconfigurable to takeeduction algorithm basedSingular Value decomposireduce theharacteristics of the speech signal (Hemkumar 1991)43 FCTSVD algoriAbut 2005Estimate theriods in the observed speech signalForm the hanktrix Hy fromI SIlence peInitialize the oreder of retained singular values of HxLet S=S+l and reconstruct the estimated matrix of Hx, Hx- using the firvalues6 Compute Frobenius Constrained norm metric and error is less than 0

0098 else goto4 4 Scalech increases, the values of the variable the formal algorithmsaturates whereas the log- Viterbi algorithm usedhis hardwarnot suffer thisnly additions rather than multiplication45 Initial estimates of HMM patersere used for A and pi Matrices Howematrix cannot beinitialized with random values as it hasinfluence in convergence of the

echnologies for Speaker Independent Speech Processingand Recognition Systalgorithm Since continuous hidden markov modelsf BMean, and Variance are obtained using segmental K-means algorithm5 Project modules1 First module is concerned with siand feature extraction(FRONT ENDPROCESSING SOFTWARE EXECUTTED IN NIOOS 2 PROCESSOR)eters required for comcy and the application where the system isto be deployed (TRAINING-OFFLINE DONE IN MATLAB-refer3 Maximum Likelihood based wordgnition(PARALLEL HARDWARE)FIGURATION

based on 2c controller (mPu 2 AUDIO Codecster Module for Audio Codec data Retrieval with integrated SRAMry controllerfor hardware recognition part with efficient modmanagement unit based on FSMs4 The speech Controller has got the following modules builtterbi based Speech Recognition unit with memory controllers for Model parameterRAMSInput Frame buffers for feature storage with memory controller for feature storageRAMbuffermodel output storageEfficient mode management unit to switch between variLED Display unit to finally display the results6 Custom Singular Value Decomposition uniSoftwareHardware Modules interfacedCustom Instructionsv Audio serial 2 ParallelFFt based featureModule(Avalon Master)Mel Filter banksg Output Frame BuffersSpeech Recognition Modentroller tov Software backsubstitutionfor SVD Speech RecognitionTable 2 Isolated Word Recognition System Hardware/Software Partition432MHZ125MHZ

5Speech Recognition, Technologies and Applicationsunfigured via 12Cquests to read are ignored Device is configured by writing data to internal regconfigured by transferring data and address of the internal registersserially through 12C__data pin Clock signal is applied to the 12C__clk pin Clock signalely USB/Normal mode master clockUD_XCLK, from which AUD_BCLK is generated), USB mode must hFIXED96kHz)12896MHz (44 1kHz 882kHz)

Thisementation utilizes normal modeclock generation at 18432MHz Transfer is initiated by pulling MPU__DATA low whileMPU_CLK is high The data formhe configuration of a particular internal register hasgot 3-bytesByte 1: ( ADDR 60Jo1 3ADDR[ 60 is DEVICE ADDRESS, which is ALWAYS Ox34Last bit is r/w bit, which is always O(write, )since WM8731 is write-onlyByte 2: ( REG[6O,DATA8 >REG[6 0] is 7-bit register address, DATA[8 is MSB ofMPU_ DATA is driven low by the CODEC betweconfirmationThe following operations needed to beto make the dperate in the intendete Oxo to AUDIO RESETdevice: Write '0 to WM8731 POWER DOWN CTL, 7 bitTurn on master mode: AUDIO INTERFACE FMI53 How this hardware system worksCodec is configured via CPU 2 12C interface with the following specificatiov WM8731 POWeR DoWn CTL is used tothe deviceY WM8731_ANALOG_ PATH- CTL Register is setble the micfacilityv WM8731_SAMPLING_ CTL Register is set to 16h100E to fix the audio codec inNORMAL MODE with ADC sampling frequency of 8 KHz Codec operating frequency2MHZStep2: The serial input bit stream is converted in parallel data using a custom Avalon Masterinterface and is stored in SRAM module The storage of audio will be interrupted byexternal user controlled switch to start the processing stepE CONFIGURED) to start the feature processing oftep4: In software the speech start and end points are detected, we perform windowinguse short time Fourier analysis on the speechnalon 30ms with

ker Independent Speech Processing5tep6: Evaluate the distance between the speech signals and do clustering using theMixture based Block quantizer based on Mahalanobis distan

cesteringperformedStep7: The features are extracted and stored in the INPUT FRAME BUFFER of the SpeechRecognition moduleStep8: Steps 1 to 6 will continue until the end of frame is detected by the hardware modulepopulated andach stage output is stored in OUTPUTplementation of continuous hidden Markov modelnd vocabulary size There is always a trade off existing between the operating frequencyIse suppression etca word HMm basedTwo essentialalgorithm1 Outpu2

Silicon Technologies for Speaker Independent Speech Processing and Recognition Systems in Noisy Environments

Log VIterbiementation-Output Probability calculation is the computationally intensilots of multiplies and Add operationecognition gorithm is56 Hardware desiOur architecture(Fig 11)concentrates on the three major issues Power, Memory(Throughput) andalary size There is always a trade off existing between theperating frequency and the recognition vocabulary, word accuracy, noise suppression etcHMM based architecture which uses continuous hmm for themplementation( Cho 2002)Two essential steps in the recognition algorithm1 Output probability calculationalary modeAudio is stored in SRAM byAterrastorocessor for features after5 Processor starts processing the samples to extract features andcompletene signal of Speech controller

Speech Processing49AttributeHeCombCombinationinationLinearMLPDCTetFig 1 Components of a speech recognition systemFrequency Cepstral Co-efficient The basic idea behind the linear predictive coding (LPC)analysis is that a speech sample can be approximated as a linear combination of past speechmples By minimizing the sum of the squared differences(over a finite interval) betwunique set ofcoefficients is determined Speech is modeled as the output of linear, time-varying systercited by either quasi-periodic pulses (during voiced speech), or random noise(duringh) The linear prediction method provides a robust, reliable, and accurateethod for estimating the parameters that characterize the linear time-varying systemepresenting vocal tract In linear prediction( LP) the signal s(n)combination of the pi1s(n-1)+e(n)Lp() are the coefficients that need to be decided, Nup is the order of the predictor, i

e thef coefficients in the model, and e(n)is the modelhe residual there existsveral methods for calculating the coefficients The coefficients of the model thatapproximates the signal within the analysis window(the frame)may be used as features,but usually further processing is applied Higher the ordthe lp filters used better willbe the model prediction of the signal A lower order model, on the other hand, captures thetrend of the signal, ideally the formants This gives a smoothened spectrum The LIcoefficients give uniform weighting to the whole spectrum, which is not consistent with the

498Speech Recognition, Technologies and Applicationshuman auditory system For voiced regions of speech all pole model of LPC provides a goodpproximation to the vocal tract spectral envelope During unvoiced and nasalized regionsof speech the LPC model is less effective than voiced region

The computation involved inLPC processing is considlies in ability to provideof speechand in its relatThe features derived using cepstral analysis outperforms those that do not use it and thafilter bank methods outperLMFCwith Fitions andMFCC are more, they are less speaker dependent and more speaker Independent InFourier transform based mFCC Feature extractioMethod for Front End Processing (Figure 2)Frame blockingWindowingcomputationx(k)=∑x(n)ei,0≤k

ker Independent Speech Processinga FFT routine After windowing the speech signal, Discrete Fourier Transform(DFT) is usedto transfer these time-domain samples into frequency-domain ones Direct computation ofOperations, assuming that the trigonometric functiwhile, the FFT algorithm only reqwidely used for speech processing to transfer speech data from time domain toX(k)=>x(n)eand imry outputs

The square root is a monotonically increasing function and can bered if only the relthe magnitudef interest (ignoring the increaseddynamic rangete((k)2+Im((A-)2putation still requires two real multiplications and cowell-known approximation to the absolute value function is givenA+jAm≈A-|+Ainless frequently used approximation is only slightly more complex to implement but offersfar better performance(refer tablAr+ jAAThe above approximation wFFT outputs and their spectral magnitudes are taken Human auditorynonlinearity and Mel filter banks to incorporate frequency nonlinearitriangular filter banks with 102 coefficients evenly spaced in Min and thecepstral vectors are extracted based on the following equation 6(refer Figure 3)(Mel( F)-Mel(FMel(f)=2595*log, (0*(10

Speech Recognition, Technologies and Applications))f(m-1)≤k≤f(m)H()-f(m)-f(m-1)f(m)≤k≤f(mf0k>f(mFig 3 Mel Filter Banksymmetric and real, the inDfT is reduced to discrete coThis transformation decorrelates features, which leads to using diagonal covariance matricesinstead of fultrices while modeling the feature coefficients by linearombinations of Gaussian functions Therefore complexity and computational cost can bereduced This is especially useful for speech recognition systems Since DCI gathers mosthe information in the signal to its lower order coefficients, by discarding the higher ordercoefficients, significant reductionutational costbe achieved

Typically thenumber of coefficients k, fognition ranges between 8 and 13 Thetral coefficients to overallwindow to minimize these sensitivities We have used weighing by a band pass filter of theare vector for forming the speechframes They can be used with the cepstral derivative ingive acceptable recognition accuracy Cepstral representatiprovnent of spectrum In practical appl△Cm()≈OCn(u*∑k=Cn(+k)}0≤m≤MWhere H is a normalization factor

Typical feature vector: Figure 4()△c2()△△cM(t-1)△Ac1(t)△Ac2(),△AcM(tFeature vector consists of both static pathe Dynamic part of the speech signalN2+M2Fig 4 Representation of Delta and Delta- Delta parametersfficiently compute p(o x), the probability f the observation sequence, given the modelrrespondingnse( i e bestplains the observation)

The Viterbi algorithmfind the optimalthe model parameters A=(A, B, n)top(oI X) This is by farhe most difficult problem of HMM We choose A=(A, B, n)in such a way that itslikelihood, p(o A), is locally maximized using an iterative procedure like Baum-Welchmethod (L Rabiner 1993)The base speech recognizer worksith noiseless HMM states and the matrix proceis used as pre conditioning block to generate the noiseless HMM models from the nois(Vaseghi)in whichContinuous hmm modeld to model the hmm statesAodel is characterizedby the no of states N, no of distinct observation symbols M,theA, the initial probability matrix Ili, the output observation probability for a feature xl instate I, b(x)

Speech Recognition, Technologies and Applicationslog b, (x,)()=0ax:2(8P,(+loga; )+logb, (x)x(0,(1)+log au)4)3)Terminationlog(P(O/a))=max(Sg(i)+logan)q'=arg max 2(8)(i)+logThe probability of observation vectors, p(oI A) has to be maximized for different modparameter values which corresponds to HMM models for different words TheForward and backward procedures as described in(Karthikeyan -ASICON 2007) Since theof Viterbi algorithm results in underflow due to very low probabilityvalues are multiplied recursively over the speech frame windalgorithm is implemented which is different from methods given in(Karthikeyan-ASICONplementation of Forward, Backward as well as the viterbiohmics of the above algorithm, Since the Forward algorithmwhich is being replacedin the modified forward algorithm

wehave used the modified forward algorithm, backward algorithm as well as viterbi algorithm3 The Baum Welch rehe third, and by fardjust the model(A, B,mize the probability of theanalytically solthe probability of the observation sequence In fact, given any finite oloder whitaining data, there is no optimal wayhowever, choose A=(A, B, n) such that P(o I X) is locally maximized using an iterativerocedure such as the Baum-Welch method To describe the procedu

ker Independent Speech Processingand Recognitio(iterative update and improvement) of HMM parameters, we first define Et(i,j),theprobability of being in state Si at time t, and state Si, at time t+1, given the model andbservation seqeither mlAP classification rules, we need to create a model of thobability p(oj)for each of the different possible classes The PDIGaussian distribution We can create a Gaussian model by just finding thhe samplend the sample covarimatrix U(√2P(2o-pU-(o-yProbability of being in state Si at time t, and state S at timgiven the model and thebservation sequence, le5 i, i)=P(q,= Si, qu+1=SO, N )

4 CovThe covariance Matrix used in model based speech Recognition problem which uses NUnivariate gauHMM modeling with m dimensional features can be considered in thefollowing ways The following 39 dimensional feature vectors are considered for designinghe continuous HMM based speech recognizerc1()c2(t)cMt),△cl(t)△c2(1)…△△cM(t-1)△Ac1(t)△△c2(1)△△cM(t1)E(t),△E(t)Where△C(r),△△Cbe represented as beloAC(=0C 2a+2kCna△△Cn(t)k*△Cmn(+k)0≤m≤MCompletematrix (distance measure mahalanobis distanceComplete covariance matrix when considered results in very high implementationcomplexity and cannot be easily achieved with the existing hardwareThe secarameter tying(Pihl-1996) In this a methodall the states and other statistical characteristicsare considered differse of common covariance matrix for all the

Speech Recognition, Technologies and Applicationsclusters obtained during GMM block quantization and considering mean, no ofobservation output different for each statehardware Ourthe covariance are block diagonal is valid since the use of an orthogonal transform likeDCT decorrelates the cepstral vectors The correlanly exists between the timedifference cepstral vectors, delta cepstral vectors, and the delta-delta cepstral vectorsSo we can construct the covariance matrix as three element block diaswhich the inverse matrix can be easily found using SingtThe last method is to consider the covariance matrix to be diagonal which yields thesimplest hardware architecture The inverse diagonal values are stored in memorylocations and only multiply operations are performed and this method isoDpputationally less intensive

Present hardware based recognizers implement thisdegrade the recognition performance of the system as it is doesn't efficiently represethe correlation introduced by the Vector quantizer Earlier proposed implementationsre based on this method only(Karthikeyan-ASICON 2007) Where E(r)representsthe statistical Expectation operaticthe cepstral vectorE(△s123)E△1C1)E(△s1k1)EAMc2,△0pletely diagonalced correlation into the feature vectors through vector quantizadynamic feature vector set Hconsider the feature vectoamong the two dynamic features set delta and delta delta feature vectors the static featuretrix can be easily obtained by lineartion solvers, Computation of the Singitontrix Abe accelerated by the parallel two sided jacobiethod with some pre-processing steps which would concentrate the Frobenius norm nearalgorithm However the gain in speed as measured by total parallel execution time dependdecisively on how efficient is the implementation of the distributed QR and LQfactorizationsven paralleltecture

Related Articles