Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
APPARATUS, COMPUTING DEVICE AND METHOD FOR SPEECH ANALYSIS
Document Type and Number:
WIPO Patent Application WO/2023/099917
Kind Code:
A1
Abstract:
According to aspects of the disclosure, an apparatus, computing device and method for speech analysis is provided. The apparatus comprises one or more processors. The apparatus further comprises a memory coupled to the one or more processors, the memory comprising instructions which, when executed by the one or more processors, cause the one or more processors to implement a comprehensibility module configured to receive audio data comprising a speech sample. The comprehensibility module is further configured to extract phonemes from the speech sample to provide a string of phonemic symbols. The comprehensibility module is further configured to analyse the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols. The comprehensibility module is further configured to obtain speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The memory further comprises instructions which, when executed by the one or more processors, cause the one or more processors to implement a comparison module configured to compare the speech metadata with reference metadata indicative of the meaning of a reference word string. The comparison module is further configured to generate a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string.

Inventors:
JONES MATT (GB)
Application Number:
PCT/GB2022/053076
Publication Date:
June 08, 2023
Filing Date:
December 02, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
LEARNLIGHT UK LTD (GB)
International Classes:
G09B19/06; G06F40/247; G06F40/284; G06F40/30; G10L15/16
Foreign References:
CN108831212A2018-11-16
US5857173A1999-01-05
CN104810017B2018-07-17
Other References:
LI DENG ET AL: "Deep Learning: Methods and Applications", FOUNDATIONS AND TRENDS IN SIGNAL PROCESSING, vol. 7, no. 3-4, 30 June 2014 (2014-06-30), pages 197 - 387, XP055365438, ISSN: 1932-8346, DOI: 10.1561/2000000039
XIE SHASHA ET AL: "Exploring Content Features for Automated Speech Scoring", 3 June 2012 (2012-06-03), pages 103 - 111, XP055941962, Retrieved from the Internet [retrieved on 20220713]
Attorney, Agent or Firm:
HGF LIMITED (GB)
Download PDF:
Claims:
CLAIMS

1 . An apparatus for speech analysis, the apparatus comprising: one or more processors; and a memory coupled to the one or more processors, the memory comprising instructions which, when executed by the one or more processors, cause the one or more processors to implement: a comprehensibility module to: receive audio data comprising a speech sample; extract phonemes from the speech sample to provide a string of phonemic symbols; analyse the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols; and obtain speech metadata based on the word string, the speech metadata indicative of the meaning of the word string; and a comparison module to: compare the speech metadata with reference metadata indicative of the meaning of a reference word string; and generate a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string.

2. The apparatus of claim 1 , wherein to obtain speech metadata based on the word string comprises one or more of: identifying one or more significant words of the word string; identifying one or more words of the word string related to a particular theme; identifying one or more words of the word string related to a particular sentiment; identifying the syntax of one or more words of the word string; and identifying one or more words of the word string related to a particular entity.

3. The apparatus of claim 1 or claim 2, wherein to analyse the string of phonemic symbols comprises: identifying the most likely words formed by the string of phonemic symbols and where word separation is most likely; and based on the identified most likely words, identifying the most likely combination of words formed by the string of phonemic symbols.

4. The apparatus of any preceding claim, wherein the comprehensibility module is further to: receive a training dataset comprising a string of phonemic symbols; and analyse the string of phonemic symbols to find the most likely combinations of phonemic symbols.

29 The apparatus of any preceding claim, wherein the comparison module is further to: compare the word string of phonemic symbols with a reference word string; and generate a further score based on the comparison, the further score based on the accuracy of the speech in the speech sample. The apparatus of any preceding claim, wherein the comprehensibility module comprises: a transcription neural network having: an input layer having input nodes for receiving audio data comprising the speech sample, and an output layer coupled to the input layer through one or more neural network layers, the output layer having output nodes for outputting a word string of phonemic symbols, wherein the transcription neural network is arranged through training to map the audio data directly to the word string of phonemic symbols based on the most likely combinations of phonemic symbols; and a context neural network having: an input layer having input nodes for receiving the word string of phonemic symbols, and an output layer coupled to the input layer through one or more neural network layers, the output layer having output nodes for outputting speech metadata based on the word string, the speech metadata indicative of the meaning of the word string, wherein the context neural network is arranged through training to map the word string of phonemic symbols directly to the speech metadata based on the meaning of each word of the word string. The apparatus of any preceding claim, wherein the speech and/or reference metadata comprises contextual information that provides context to the word string it is obtained from. The apparatus of any preceding claim, wherein the comparison module is further to select the reference word string and/or the reference metadata based on the word string of phonemic symbols and/or the speech metadata. The apparatus of any of claims 1 to 7, wherein the comparison module is further to receive an input indicative of the reference word string, and select the reference word string and/or the reference metadata based on the received input. The apparatus of any preceding claim, wherein the speech and/or reference metadata comprises one or more parameters, the parameters comprising key phrases, entities, sentiment analysis and syntax. The apparatus of any preceding claim, wherein the comparison module is to:

30 for each parameter of a plurality of parameters of the speech metadata, compare the speech metadata with reference metadata, and generate a score based on the comparison; and output a final score based on the plurality of scores. The apparatus of claim 11 , wherein the final score is a weighted average score of the plurality of scores. The apparatus of claim 12, wherein the weighted average score is standardised to produce the final score by comparing the weighted average score to a plurality of historical scores. The apparatus of any preceding claim, wherein the comparison module is to output feedback based on the generated score or the final score. The apparatus of claim 14, wherein the feedback is output in real time following the receipt of the audio data. The apparatus of claim 14 or claim 15, wherein the comprehensibility module is further to sequentially receive audio data comprising each speech sample of a plurality of speech samples such that feedback is output by the comparison module before the comprehensibility module has finished sequentially receiving the audio data. A computing device for speech analysis, the computing device comprising: the apparatus of any preceding claim; a display; and a microphone, wherein audio data is received by the comprehensibility module implemented by the one or more processors of the apparatus from the microphone, and wherein the comparison module implemented by the one or more processors of the apparatus is to output a generated score and/or feedback to the display. The computing device of claim 17, wherein the comparison module is to select the reference word string and/or reference metadata based on the contents of the display. A computer implemented method for speech analysis, the method comprising: receiving audio data comprising a speech sample; extracting phonemes from the speech sample to provide a string of phonemic symbols; analysing the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols; obtaining speech metadata based on the word string, the speech metadata indicative of the meaning of the word string; comparing the speech metadata with reference metadata indicative of the meaning of a reference word string; and generating a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string. The computer implemented method of claim 19, the method further comprising: comparing the word string of phonemic symbols with a reference word string; and generating a further score based on the comparison, the further score based on the accuracy of the speech in the speech sample. The computer implemented method of claim 19 or claim 20, the method further comprising selecting the reference word string and/or the reference metadata based on the word string of phonemic symbols and/or the speech metadata. The computer implemented method of claim 19 or claim 20, the method further comprising receiving an input indicative of the reference word string, and selecting the reference word string and/or the reference metadata based on the received input. The computer implemented method of any of claims 19 to 22, the method further comprising: for each parameter of a plurality of parameters of the speech metadata, comparing the speech metadata with reference metadata, and generating a score based on the comparison; and outputting a final score based on the plurality of scores. A computer readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the steps of the method of any of claims 19 to 23. ns for speech analysis comprising: means for processing comprising processing hardware; and means for memory coupled to the means for processing, the means for memory comprising instructions which, when executed by the means for processing, cause the means for processing to implement: means for comprehensibility to: receive audio data comprising a speech sample; extract phonemes from the speech sample to provide a string of phonemic symbols; analyse the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols; and obtain speech metadata based on the word string, the speech metadata indicative of the meaning of the word string; and means for comparison to: compare the speech metadata with reference metadata indicative of the meaning of a reference word string; and generate a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string.

33

Description:
APPARATUS, COMPUTING DEVICE AND METHOD FOR SPEECH ANALYSIS

Technical Field

[0001] The present disclosure relates to an apparatus, computing device and method for speech analysis. In particular, the disclosure relates to an apparatus for scoring a speech sample of a user.

Background

[0002] Software applications for learning languages have made language learning more accessible. In these applications, on a display of a device, a user can be shown a phrase or a question that requires the phrase to be given as the answer. The user can then record themselves saying the phrase and can then play back the recording to self-assess themselves saying the phrase to understand if they correctly recited the phrase. However, as the user does not speak the language they are learning fluently, self-assessment of themselves speaking in the language is difficult and can result in a user believing they have said the right phrase when they haven’t, or vice versa. Thus, self-assessment is not an effective way for a user to learn a language.

[0003] The present disclosure has been devised in the foregoing context.

Summary

[0004] One way of improving language learning is to remove the reliance on self-assessment by providing an application that gives a user a specific phrase to say, for example by showing the user the phrase or a question where the phrase is the answer, and then assesses the user saying that specific phrase. For example, when a user records themselves saying the phrase, the application may convert the spoken phrase to text and compare this text to the text of the specific phrase that was required by the device. By comparing the two texts, the application is able to assess the difference between the words the user should have said and the words the user did say and provide such an assessment to the user.

[0005] However, the present inventors have realised that, whilst direct text comparison is seemingly advantageous, this would only provide an assessment based on the pronunciation of the words by the user and would discriminate based on a user’s accent and dialect as the application may convert the speech to text based on one particular accent or dialect resulting in errors if a user’s accent or dialect differ from this. This excludes learners from certain backgrounds from having a correct assessment, meaning they may be unable to use such an application to learn a language. The present inventors have realised that a more effective means of learning a language would be to identify whether the user understands what they are meant to be saying rather than whether they say it in exactly the right way.

[0006] Thus, viewed from one aspect, the present disclosure provides an apparatus for speech analysis. The apparatus comprises one or more processors. The apparatus further comprises a memory coupled to the one or more processors, the memory comprising instructions which, when executed by the one or more processors, cause the one or more processors to implement a comprehensibility module configured to receive audio data comprising a speech sample. The comprehensibility module is further configured to extract phonemes from the speech sample to provide a string of phonemic symbols. The comprehensibility module is further configured to analyse the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols. The comprehensibility module is further configured to obtain speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The memory further comprises instructions which, when executed by the one or more processors, cause the one or more processors to implement a comparison module configured to compare the speech metadata with reference metadata indicative of the meaning of a reference word string. The comparison module is further configured to generate a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string.

[0007] Viewed from another aspect, the present disclosure provides a computing device for speech analysis. The computing device comprises the apparatus as described herein, a display and a microphone. Audio data is received by the comprehensibility module implemented by the one or more processors of the apparatus from the microphone. The comparison module implemented by the one or more processors of the apparatus is configured to output a generated score and/or feedback to the display.

[0008] Viewed from another aspect, the present disclosure provides a computer implemented method for speech analysis. The method comprises receiving audio data comprising a speech sample. The method further comprises extracting phonemes from the speech sample to provide a string of phonemic symbols. The method further comprises analysing the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols. The method further comprises obtaining speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The method further comprises comparing the speech metadata with reference metadata indicative of the meaning of a reference word string. The method further comprises generating a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string.

[0009] Viewed from another aspect, the present disclosure provides a computer readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the steps of the method as described herein.

[0010] In accordance with the above aspects of the disclosure, obtaining the meaning of the word string identified from a user’s speech sample, comparing this to the meaning of a reference word string that the user was required to say and generating a score based on the comparison enables assessment of a user’s comprehensibility of a word string rather than just how they pronounce each word of the word string. This removes the discrimination based on the accent and dialect of a user.

[0011] Moreover, in each language, there are often several ways of communicating what is required. The user may therefore be communicating effectively in that they are communicating what is required but may be using words different to those expected. Thus, if they were to be assessed just on their pronunciation, they would be assessed to have bad pronunciation and so would be scored poorly even though they communicated what was required. By understanding the meaning of the words a user is saying and comparing this meaning to the meaning required by the application, if a user communicates effectively in that they are communicating what is required, they will be assessed to have good comprehensibility and be scored highly even if the words they use don’t match the words of the phrase they have been prompted to say.

[0012] Thus, the apparatus as described herein is a more accurate way of assessing a user’s language learning skills and helping a user to learn a language because it helps the user to improve their comprehensibility rather than just their pronunciation by determining whether the user has got the meaning across effectively rather than just whether the learner has repeated the exact phrase correctly. This also enables more advance means of language learning than getting a user to repeat a phrase. For example, this enables a conversation to be had between the user and the application where as long as the user has got the correct meaning across they will be assessed to have understood the language. Thus, the present application ensures that users learning a language can communicate effectively (be understood) regardless of their background, accent and dialect.

[0013] In embodiments, to obtain speech metadata based on the word string may comprise one or more of identifying one or more significant words of the word string; identifying one or more words of the word string related to a particular theme; identifying one or more words of the word string related to a particular sentiment; identifying the syntax of one or more words of the word string; and identifying one or more words of the word string related to a particular entity.

[0014] In embodiments, to analyse the string of phonemic symbols may comprise identifying the most likely words formed by the string of phonemic symbols and where word separation is most likely; and based on the identified most likely words, identifying the most likely combination of words formed by the string of phonemic symbols.

[0015] In embodiments, the comprehensibility module may be further configured to receive a training dataset comprising a string of phonemic symbols; and analyse the string of phonemic symbols to find the most likely combinations of phonemic symbols.

[0016] In embodiments, the comprehensibility module may comprise a transcription neural network having an input layer having input nodes for receiving audio data comprising the speech sample. The transcription neural network may also have an output layer coupled to the input layer through one or more neural network layers, the output layer having output nodes for outputting a word string of phonemic symbols. The transcription neural network may be configured through training to map the audio data directly to the word string of phonemic symbols based on the most likely combinations of phonemic symbols. The comprehensibility module may further comprise a context neural network having an input layer having input nodes for receiving the word string of phonemic symbols. The context neural network may also have an output layer coupled to the input layer through one or more neural network layers, the output layer having output nodes for outputting speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The context neural network may be configured through training to map the word string of phonemic symbols directly to the speech metadata based on the most likely meaning of each word of the word string.

[0017] In embodiments, the comparison module may be further configured to compare the word string of phonemic symbols with a reference word string; and generate a further score based on the comparison, the further score based on the accuracy of the speech in the speech sample.

[0018] In embodiments, the speech and/or reference metadata may comprise contextual information that provides context to the word string it is obtained from.

[0019] In embodiments, the comparison module may be further configured to select the reference word string and/or the reference metadata based on the word string of phonemic symbols and/or the speech metadata. [0020] In embodiments, the comparison module may be further configured to receive an input indicative of the reference word string, and select the reference word string and/or the reference metadata based on the received input.

[0021] In embodiments, the speech and/or reference metadata may comprise one or more parameters, the parameters comprising key phrases, entities, sentiment analysis and syntax.

[0022] In embodiments, the comparison module may be configured to, for each parameter of a plurality of parameters of the speech metadata, compare the speech metadata with reference metadata, and generate a score based on the comparison. The comparison module may be further configured to output a final score based on the plurality of scores.

[0023] In embodiments, the final score may be a weighted average score of the plurality of scores.

[0024] In embodiments, the weighted average score may be standardised to produce the final score by comparing the weighted average score to a plurality of historical scores.

[0025] In embodiments, the comparison module may be configured to output feedback based on the generated score or the final score.

[0026] In embodiments, the feedback may be output in real time following the receipt of the audio data.

[0027] In embodiments, the comprehensibility module may be further configured to sequentially receive audio data comprising each speech sample of a plurality of speech samples such that feedback is output by the comparison module before the comprehensibility module has finished sequentially receiving the audio data.

[0028] In embodiments, the comparison module may be configured to select the reference word string and/or reference metadata based on the contents of the display.

[0029] In embodiments, the method may further comprise comparing the word string of phonemic symbols with a reference word string; and generating a further score based on the comparison, the further score based on the accuracy of the speech in the speech sample.

[0030] In embodiments, the method may further comprise selecting the reference word string and/or the reference metadata based on the word string of phonemic symbols and/or the speech metadata.

[0031] In embodiments, the method may further comprise receiving an input indicative of the reference word string, and selecting the reference word string and/or the reference metadata based on the received input.

[0032] In embodiments, the method may further comprise, for each parameter of a plurality of parameters of the speech metadata, comparing the speech metadata with reference metadata, and generating a score based on the comparison. The method may further comprise outputting a final score based on the plurality of scores.

[0033] A computer program and/or the code/instructions for performing such methods as described herein may be provided to an apparatus, such as a computer, on a computer readable medium or computer program product. The computer readable medium could be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or a propagation medium for data transmission, for example for downloading the code over the Internet. Alternatively, the computer readable medium could take the form of a physical computer readable medium such as semiconductor or solid-state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disk, such as a CD-ROM, CD-R/W or DVD. [0034] Many modifications and other embodiments of the inventions set out herein will come to mind to a person skilled in the art to which these inventions pertain in light of the teachings presented herein. Therefore, it will be understood that the disclosure herein is not to be limited to the specific embodiments disclosed herein. Moreover, although the description provided herein provides example embodiments in the context of certain combinations of elements, steps and/or functions may be provided by alternative embodiments without departing from the scope of the invention.

Brief Description Of The Drawings

[0035] Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings in which like reference numerals are used to depict like parts. In the drawings:

Figure 1 shows a schematic illustration of an apparatus for speech analysis in accordance with aspects of the present disclosure;

Figure 2 shows a schematic illustration of an apparatus for speech analysis in accordance with aspects of the present disclosure;

Figure 3 shows an illustration of a comprehensibility module in accordance with aspects of the present disclosure;

Figure 4 shows a schematic illustration of a computing device for speech analysis in accordance with aspects of the present disclosure; and

Figure 5 shows a method for speech analysis in accordance with aspects of the present disclosure.

Detailed Description

[0036] Hereinafter, embodiments of the disclosure are described with reference to the accompanying drawings. However, it should be appreciated that the disclosure is not limited to the embodiments, and all changes and/or equivalents or replacements thereto also belong to the scope of the disclosure. The same or similar reference denotations may be used to refer to the same or similar elements throughout the specification and the drawings.

[0037] As used herein, the terms “have,” “may have,” “include,” or “may include” a feature (e.g., a number, function, operation, or a component such as a part) indicate the existence of the feature and do not exclude the existence of other features.

[0038] As used herein, the terms “A or B,” “at least one of A and/or B,” or “one or more of A and/or B” may include all possible combinations of A and B. For example, “A or B,” “at least one of A and B,” “at least one of A or B” may indicate all of (1) including at least one A, (2) including at least one B, or (3) including at least one A and at least one B.

[0039] As used herein, the terms “configured (or set) to” may be interchangeably used with the terms “suitable for,” “having the capacity to,” “designed to,” “adapted to,” “made to,” or “capable of’ depending on circumstances.

[0040] It is to be understood that the singular forms “a,” “'an,” and “the” include plural references unless the context clearly dictates otherwise.

[0041] The terms as used herein are provided merely to describe some embodiments thereof, but not to limit the scope of other embodiments of the disclosure. All terms including technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiments of the disclosure belong. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

[0042] As will be appreciated upon reading the detailed description, a phoneme is a perceptually distinct unit of sound in a specified language that distinguishes one word from another. Phonemes form spoken words and therefore any reference to phonemes also refers to spoken words. A phonemic symbol is the symbol used to represent the individual phoneme sound. A phonemic symbol therefore corresponds to a particular phoneme. Thus, by identifying phonemes, the corresponding phonemic symbol can be identified. When phonemes are identified in speech they can therefore be converted into a string of phonemic symbols. As phonemes form spoken words, phonemic symbols form written words. Therefore a string of phonemic symbols may be a string of words, the words corresponding to the spoken words formed by the corresponding phonemes. A word string is used interchangeably with a phrase throughout the application, where a phrase or word string can comprise any number of words and any number of sentences. For example, a phrase orword string may comprise one word.

[0043] As will be appreciated upon reading the detailed description, whilst the below description focuses on English, it can be equally applied to any language to enable any language to be learnt.

[0044] With reference now to the figures, Figure 1 shows a schematic illustration of an apparatus 100 for speech analysis in accordance with aspects of the present disclosure. The apparatus 100 comprises one or more processors 126 and a memory 128 coupled to the one or more processors 126, the memory 128 comprising instructions 130 which, when executed by the one or more processors 126, cause the one or more processors 126 to implement a comprehensibility module 102 and a comparison module 104. Thus, the apparatus 100 effectively comprises the comprehensibility module 102 and the comparison module 104.

[0045] Throughout the application, any steps referred to herein as being performed by the comprehensibility module 102 and comparison module 104 may be implemented by the one or more processors 126 executing instructions 130 stored in the memory 128. Moreover, any reference herein to inputs to the comprehensibility module 102 and outputs from the comparison module 104 may be inputs to and outputs from the one or more processors 126 and/or the apparatus 100.

[0046] The comprehensibility module 102 is configured to receive audio data comprising a speech sample, for example from input 106. The comprehensibility module 102 is further configured to extract phonemes from the speech sample to provide a string of phonemic symbols. The comprehensibility module 102 is further configured to analyse the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols. The comprehensibility module 102 is further configured to obtain speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The comparison module 104 is configured to compare the speech metadata with reference metadata indicative of the meaning of a reference word string. The comparison module 104 is configured to generate a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string. The comparison module 104 may output the score, for example, at output 108. [0047] The memory 128 of the apparatus 100 may comprise a plurality of reference phrases in at least one language, for example, English. A user wishing to learn a language may select the language they wish to learn. The apparatus 100 may instruct or prompt the user to say a particular reference phrase of the plurality of reference phrase, as discussed in more detail in relation to Figure 4. A user may then say a phrase in response to the prompt. The phrase spoken by the user may differ to the reference phrase based on the user’s ability to speak the language. The apparatus 100 may determine the difference in order to score the user’s ability to speak the language and provide feedback to the user on how to improve.

[0048] Thus a reference phrase, or reference word string, is a phrase that a user would say if they were fluent in a language, and may be a phrase that they are prompted to say. A reference phrase is used to determine a user’s ability to speak in a particular language. Where a user says the reference phrase, this would give the optimal score because the user has been able to speak the particular language correctly. A reference phrase may be a predetermined phrase and may be stored in the memory 128.

[0049] In response to a prompt to say a reference phrase, for example, by a display, the user may record themselves saying the reference phrase using a microphone such that the reference phrase is converted to audio data. The audio data may comprise a speech sample comprising the whole reference phrase spoken by the user. Alternatively, the audio data may comprise a speech sample comprising a reference phrase forming part of the whole reference phrase spoken by the user. For example, the reference phrase that the user has been prompted to say may be divided into smaller reference phrases and a speech sample may be received for each smaller phrase spoken. Whether the phrase is the whole phrase prompted by the user or part of the phrase prompted by the user, the speech sample still comprises a spoken phrase by a user and the user is still prompted to say a reference phrase.

[0050] The comprehensibility module 102 is configured to receive audio data comprising a speech sample. The audio data may be received at an input of the apparatus 100, for example, from a microphone. The microphone may convert the sound output from a user into audio data. The audio data may be a stream of bits. The audio data may be received directly from an internal or external microphone. Alternatively, the audio data may be received from another device, for example, over a communications channel. The audio data may be encoded into one of the industry formats such as M4A, FLAC, MP3 or MP4. The comprehensibility module 102 may receive the raw audio data or the encoded audio data. The audio data comprising each speech sample of a plurality of speech samples may be sequentially received. The audio data for each speech sample may be streamed or audio data comprising a plurality of speech samples may be stored and then batch processed. Streaming of the audio data, and consequently processing the audio data comprising the speech sample as soon as it is received, is advantageous to provide real time feedback to a user on their speech, i.e. before the user has finished speaking. In particular, where a speech sample is part of a phrase that a user has been prompted to say, feedback can be provided on that speech sample before the user has finished saying the phrase.

[0051] The comprehensibility module 102 is further configured to extract phonemes from the speech sample to provide a string of phonemic symbols. The phonemes may be determined by the comprehensibility module 102 and converted to phonemic symbols due to the correspondence between phonemes and phonemic symbols, as discussed above.

[0052] The comprehensibility module 102 may be trained using speech data to recognise phonemes and their corresponding phonemic symbol. For example, the comprehensibility module 102 may receive training audio data comprising speech samples that have been labelled with the phonemic symbols corresponding to the phonemes in the speech samples. For example, the comprehensibility module 102 may be trained on the Kaggle speech accent dataset, which contains audio data from speakers and the corresponding text that the speakers read. The comprehensibility module 102 may learn from the training data which phoneme of the speech sample corresponds to which phonemic symbol of the corresponding text such that when the trained comprehensibility module 102 receives the audio data comprising the speech sample it can convert the audio data into a string of phonemic symbols.

[0053] The comprehensibility module 102 is further configured to analyse the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols. A string of phonemic symbols refers to phonemic symbols that are most likely to correspond to the phonemes of the speech sample, whereas a word string of phonemic symbols refers to phonemic symbols that form likely words or phrases from the speech sample, as discussed in more detail below. The word string of phonemic symbols comprises words with spaces between them, and may also comprise punctuation and capital letters. The extraction of phonemes and identification of a word string of phonemic symbols may be performed together. For example, there may be no output of a string of phonemic symbols as the audio data may be converted straight to a word string by performing the extraction and identification. [0054] A first stage of analysing the string of phonemic symbols may comprise identifying words formed by consecutive phonemic symbols in the string and grouping the phonemic symbols into words. Identifying words formed by consecutive phonemic symbols in the string may include identifying the likely spaces between words. Identifying words formed by consecutive phonemic symbols in the string may also include determining if consecutive phonemic symbols form a word. Where a word cannot be formed by consecutive phonemic symbols, which indicates one or more phonemic symbols are incorrect, the likely incorrect phonemic symbols in the string may be replaced in order for words to be formed. This assists in the determination and correction of any incorrect phonemic symbols. For example, the comprehensibility module 102 may extract one or more phonemes incorrectly from the speech sample and so one or more phonemic symbols in the string of phonemic symbols may be incorrect. An incorrect phonemic symbol is one that does not correspond to the phoneme output by the user. The output of this analysis may provide a word string with a space between each word. This analysis may also comprise determining where punctuation should be in the string of phonemic symbols. In this example, the word string may comprise punctuation.

[0055] A second stage of analysing the string of phonemic symbols may comprise, identifying the likelihood of the combination of words formed by the string of phonemic symbols in order to identify whether the combination of words in the word string identified by the first stage of analysis above is likely to be correct. The comprehensibility module 102 may utilise context to identify the likelihood of the combination of words. For each word, the comprehensibility module 102 may determine if a particular word is likely to be used in combination with the previous and subsequent words. This assists in the determination and correction of any incorrect phonemic symbols and any incorrect words. The first stage and second stage of analysis may be repeated until the comprehensibility module 102 has corrected any incorrect phonemic symbols it identifies and so produces a correct or at least highly probable word string of phonemic symbols. [0056] For the first and/or second stage of analysis, where the comprehensibility module 102 determines that a string of phonemic symbols or a combination of words is unlikely to be used together, and so identifies that one or more phonemic symbols may be incorrect, the comprehensibility module 102 may replace one or more of the phonemic symbols or words with more likely phonemic symbols or words. To determine whether to replace one or more of the phonemic symbols or words with more likely phonemic symbols or words, the comprehensibility module 102 may take into account the likelihood that a user said the particular phonemic symbol or word and the likelihood of the combination of phonemic symbols, which includes the likelihood of the combination of words. The likelihood that a user said the particular phonemic symbol or word includes the likelihood of the corresponding phoneme in the speech sample being correctly detected. For example, where the comprehensibility module 102 determines a phoneme is highly likely to have been correctly detected, the comprehensibility module may determine the corresponding phonemic symbol is correct and should not be replaced.

[0057] However, where the comprehensibility module 102 determines it is highly likely that a user said a word but where the word and the consecutive word are rarely used together, the comprehensibility module 102 may nevertheless determine that a user must’ve said a different word. For example, if the comprehensibility module 102 determined a string of phonemic symbols “fast snail”, it may determine that the likelihood of this combination of words is very low and may determine that a more likely combination of words is “fast snake”. Whilst the likelihood of the extracted phonemes from the speech sample being “fast snail” may be higher than “fast snake”, the likelihood of the user saying “fast snake” is higher than “fast snail” as the words “fast” and “snake” are combined more often than “fast” and “snail”. Therefore the comprehensibility module 102 may replace “snail” with “snake” and so identify the word string “fast snake” instead.

[0058] For the string of phonemic symbols, where combinations of phonemic symbols are common, they are the most likely combinations and are therefore more likely to be retained in the word string. Combinations of phonemic symbols that are not common are less likely combinations and are therefore more likely to be replaced in the word string with more common phonemic symbols. The most likely combinations of phonemic symbols may therefore refer to how common the phonemic symbols are. For example, the combination of phonemic symbols “th” and “is” are more common than the combination of phonemic symbols “t” and “is” and therefore if the string of phonemic symbols provides consecutive phonemic symbols “t” and “is”, the “t” may be replaced with “th”.

[0059] There may be no replacement of phonemic symbols when identifying the word string of phonemic symbols where the phonemic symbols are likely combinations of phonemic symbols and/or where the comprehensibility module 102 is certain that the phonemic symbols correspond to the phonemes of the speech sample.

[0060] To understand the most likely combinations of phonemic symbols, the comprehensibility module 102 may be trained using speech data to recognise common phonemic symbols, common combinations of phonemic symbols, common words and common combinations of words. For example, the comprehensibility module 102 may receive training data comprising a word string of phonemic symbols and may log how many times each pair of phonemic symbols appears. For example, for the word string “to see the”, the comprehensibility module 102 may increase the count of the pair “to see” by one and increase the count of the pair “see the” by one. A trained comprehensibility module 102 would therefore know the likelihood of a combination of phonemic symbols being present in a word string. [0061] Additionally or alternatively, the comprehensibility module 102 may be trained using a string of phonemic symbols, for example, a block of text such as a paragraph, or a book. The comprehensibility module 102 may receive a training text dataset comprising a string of phonemic symbols and learn the most likely combinations of phonemic symbols and words using the training text dataset. This allows the comprehensibility module 102 to more accurately identify the word string of phonemic symbols, as explained above. Where the reference phrase that the user is prompted to say is known before the user speaks, the training text dataset may be the reference phrase, such that the combinations of words in the reference phrase are learnt by the comprehensibility module. This reduces the likelihood of comprehensibility module 102 wrongly correcting the string of phonemic symbols by increasing the likelihood of the combinations of phonemic symbols in the string of phonemic symbols, and so increases the accuracy of the identified word string of phonemic symbols.

[0062] When the word string of phonemic symbols has been identified, this may comprise punctuation and capital letters. For example, the comprehensibility module 102 may be configured to split the combination of phonemic symbols into sentences and apply full stops and capital letters. The comprehensibility module may be trained to do this, for example, by receiving training data comprising sentences including punctuation and capital letters and building statistics on such training data such as the words typically used at the beginning and end of a sentence and the typical number of words in a sentence.

[0063] The comprehensibility module 102 is further configured to obtain speech metadata based on the word string of phonemic symbols, the speech metadata indicative of the meaning of the word string. The meaning of a word string may be the point or purpose of the word string Speech metadata and reference metadata may comprise the same parameters and be obtained in the same way. The only difference is that speech metadata is obtained from the word string identified from the speech sample and reference metadata is obtained from a reference word string. The reference word string and the speech word string are in the same language. Thus, the comprehensibility module 102 may be further configured to obtain reference metadata based on the reference word string, the reference metadata indicative of the meaning of the reference word string.

[0064] Speech metadata may comprise a collection of phrases, each phrase comprising one or more words. Speech metadata may comprise a word string smaller than the word string identified from the speech sample. Speech metadata may comprise contextual information that provides context to the word string it is obtained from. For example, the speech metadata may comprise key phrases identified from the word string. The memory 128 may comprise a list of key phrases, which may be utilised by the comprehensibility module 102 in order to identify key phrases in a word string. The speech metadata may comprise one or more of key phrases, entities, syntax, themes, and sentiment of the word string. In some examples, the speech metadata may only comprise key phrases.

[0065] The comprehensibility module 102 may be able to determine which part of speech each word is from and therefore whether the word or phrase comprising the word is a key phrase, i.e. containing a significant word. The parts of speech for the English language include a noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. In an example, the comprehensibility module 102 may be able to determine that “Sarah” is a noun and therefore determine that “Sarah” is a key phrase. The comprehensibility module 102 may be trained to understand the key phrases for example, by receiving training data including key phrases and learning which words are which part of speech and which parts of speech are key phrases.

[0066] The speech metadata may comprise phrases from the word string identified as relating to a particular theme. These phrases may be identified as key phrases. The comprehensibility module 102 may be able to determine whether each phrase relates to a particular theme and extract phrases relating to the particular theme. The comprehensibility module 102 may be trained to understand the phrases relating to a particular theme for example, by receiving training data including phrases labelled with their corresponding theme and learning which phrases relate to which theme.

[0067] The speech metadata may comprise phrases from the word string identified as relating to a particular sentiment. The comprehensibility module 102 may be able to determine whether each phrase relates to a particular sentiment and extract phrases relating to the particular sentiment. The comprehensibility module 102 may be trained to understand the phrases relating to a particular sentiment for example, by receiving training data including phrases labelled with their corresponding sentiment and learning which phrases relate to which sentiment.

[0068] The speech metadata may comprise phrases from the word string identified as relating to a particular entity. The comprehensibility module 102 may be able to determine whether each phrase relates to a particular entity and extract phrases relating to the particular entity. The comprehensibility module 102 may be trained to understand the phrases relating to a particular entity for example, by receiving training data including phrases labelled with their corresponding entity and learning which phrases relate to which entity.

[0069] The speech metadata may comprise the syntax of words in the word string. The comprehensibility module 102 may be able to determine the syntax of the words in the word string and whether a sentence is well formed. This may be done by determining which part of speech each word is from, as explained above. The comprehensibility module 102 may be trained to understand the syntax of each word, for example, by receiving training data including words labelled with their corresponding syntax.

[0070] The speech metadata may be obtained based on a word string of phonemic symbols comprising punctuation and capital letters. Additionally or alternatively, where a word string of phonemic symbols comprises punctuation and capital letters, these may be removed before the speech metadata is obtained. [0071] The comparison module 104 is configured to compare the speech metadata with reference metadata indicative of the meaning of a reference word string. The comparison module may receive an input indicative of the reference word string, and select the reference word string and/or the reference metadata based on the received input, as explained in more detail in relation to Figure 4. Where the speech metadata is obtained based on a word string of phonemic symbols comprising punctuation and capital letters, the reference metadata is also obtained based on a reference word string of phonemic symbols comprising punctuation and capital letters. Where punctuation and capital letters are removed from the word string of phonemic symbols before the speech metadata is obtained, punctuation and capital letters are also removed from the reference word string before the reference metadata is obtained.

[0072] The phrase “word string of phonemic symbols”, as identified by the comprehensibility module, may be interchangeably used with the phrase “speech word string”. The comparison module 104 may be trained using a training dataset of phrases being spoken. For example, forthe English language, the Kaggle speech accent dataset discussed above may be used to train the comparison module 104. [0073] The speech metadata may comprise one or more of key phrases, entities and syntax. The reference metadata may comprise the same elements as the speech metadata and so may comprise one or more of key phrases, entities and syntax. This is because the key phrases, entities and whether the correct syntax is used should be treated as more important than other words in the word string because they provide the meaning of the word string and so show a user’s understanding of the language. For example, if a user does not say “a”, which is not a key phrase or entity, the word string and the meaning of the word string could still be understood. Moreover, if a user says “her”, rather than “him”, as the syntax is the same, this shows a higher understanding by a user than if the syntax was different.

[0074] A key phrase may be a noun phrase that describes a particular thing. For example, "a beautiful day" is a noun phrase that includes an article ("a") and an adjective ("beautiful"). A key phrase may comprise significant words of the word string and/or words of the word string related to a particular theme. For example, key phrase extraction from a word string about a basketball game might return the names of the teams, the name of the venue, and the final score. An entity may be a textual reference to the unique name of a real-world object such as people, places, and commercial items, and to precise references to measures such as dates and quantities. The syntax may be the part of speech of a word of the word string, for example, whether the word is a noun, verb or adjective. Each key phrase, entity and syntax may have an associated confidence value, as discussed in more detail below. The speech metadata may further comprise the sentiment of the word string and/or words within the word string.

[0075] For example, for a word string “Please call Stella. Ask her to bring these things”, a key phrase would be “Please call Stella”, which may be associated with a confidence value 0.99. An entity would be “Stella”, which may be associated with entity type “Person” and a confidence value of 0.99. The syntax of each word may be as shown in the table below.

Table 1: Syntax of an example word string

[0076] By obtaining and comparing speech metadata, it can be determined if the user has said the important and meaningful words and therefore has understood the phrase, indicating whether the user has learnt the language. For example, if a user misses “please”, this does not mean they have not understood the phrase because the meaning and understanding of the phrase is the same, indicating the user has learnt the language. Alternatively, if they do not say “call” then this changes the meaning and understanding of the phrase and so may indicate that the user has misunderstood the phrase and has not learnt the language.

[0077] The comparison module 104 may compare the speech metadata to the reference metadata in a plurality of ways in order to generate one or more scores. For example, the comparison module 104 may be configured to, for each parameter of the plurality of parameters of the speech metadata, compare the speech metadata with reference metadata, and generate one or more scores based on the comparison. The one or more scores may be final scores generated by the comparison module or the one or more scores may be further manipulated to generate a final score, as described in more detail below. The comparison module 104 may output the final score, which may also be output by apparatus 100. Examples of how the scores are generated for each parameter of speech metadata and how the final score is generated will now be described.

[0078] The comparison module 104 may compare the key phrases within the speech metadata to the key phrases within the reference metadata. Based on the comparison, the comparison module 104 may generate a key phrase score. The key phrase score may be representative of the number of key phrases of the speech metadata that are in the reference metadata. To compare the key phrases within the speech metadata to the key phrases within the reference metadata, for each key phrase of the speech metadata, the comparison module 104 may compare the key phrase to key phrases in the reference metadata and increment a variable by 1 when it finds the key phrase in the reference metadata. All variables mentioned herein may be reset to 0 before the comparison module 104 begins comparing the speech metadata and reference metadata. Where a key phrase of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable. The number in the variable may therefore contain the number of key phrases in the speech metadata that are also in the reference metadata. The key phrase score may be based on the number in the variable once all of the key phrases of the speech metadata have been compared. For example, the key phrase score may equal the number of key phrases in the speech metadata that are also in the reference metadata, as provided by the variable, divided by the number of key phrases in the reference metadata. One way of obtaining the key phrase score is to utilise two variables, a first variable as described above that, once all of the key phrases of the speech metadata have been compared, contains the number of key phrases in the speech metadata that are also in the reference metadata and a second variable that contains the number of key phrases in the reference metadata and to divide the first variable by the second variable. In this example, where all the key phrases in the reference metadata are also in the speech metadata, this would provide a key phrase score of 1 , which may be the maximum score, indicating that a user has recited the phrase well and has learnt the language well. The minimum score would be 0, where there has been no increment of the variable by 1 because none of the key phrases from the reference metadata were present in the speech metadata, indicating that a user has not recited the phrase well and has not learnt the language well.

[0079] Each key phrase may have an associated confidence value. For example, the confidence value associated with each key phrase may indicate the level of confidence that the comparison module 104 has that the key phrase is a noun phrase. Additionally or alternatively, the confidence value associated with each key phrase may indicate the likelihood that the key phrase or each word of the key phrase is correct, as discussed previously. Additionally or alternatively, the confidence value associated with each key phrase may indicate the difficulty of the key phrase for a user to say. For example, where user’s often say the first key phrase of the speech word string but not the second key phrase of the speech word string, this may indicate that the second key phrase is harder to say and so the confidence value of the first key phrase may be lower than for the second key phase. Therefore where the confidence values are added to provide a final score, the score is higherwhere the user has said more difficult key phrases. The comparison module 104 may have access to a list comprising key phrases and their associated confidence values, which may be stored in memory 128. The comparison module 104 may generate the confidence values. When comparing the key phrases within the speech metadata to the key phrases within the reference metadata, for each key phrase of the speech metadata, if the key phrase is also in the reference metadata, the confidence value of the key phrase in the speech metadata may be added to the variable, rather than the variable being incremented by 1 . Where a key phrase of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable.

[0080] The confidence value associated with each key phrase in the speech metadata may differ to the confidence value associated with the same key phrase in the reference metadata. For example, the confidence value may depend on the other words in the sentence containing the key phrase. Therefore, even though the same key phrase may be in the reference word string and the speech word string, the confidence value of the key phrase in the speech word string may differ to the confidence value of the key phrase in the reference word string.

[0081] Thus, when comparing the key phrases within the speech metadata to the key phrases within the reference metadata, for each key phrase of the speech metadata, if the key phrase is also in the reference metadata, a difference value of 1 plus the difference between the confidence value of the key phrase in the speech metadata and the confidence value of the key phrase in the reference metadata may be added to a variable, rather than the confidence value being added to the variable or the variable being incremented by 1 . Again, where a key phrase of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable. In some examples, if the key phrase in the speech metadata is also in the reference metadata, before a value is added to the variable, it may be adjusted by being multiplied 1) + where p is the percentage of common key phrases between the speech metadata and reference metadata and Ir is the number of key phrases in the reference metadata. For example, the difference value may be adjusted by being multiplied 1) + 1 before being added to the variable. The key phrase score may be calculated by finding the average of the difference values over all key phrases of the speech metadata orthe average of the adjusted difference values over all key phrases of the speech metadata, for example, by dividing the value in the variable by the number of key phrases in the speech metadata.

[0082] In the examples above, when comparing the key phrases within the speech metadata to the key phrases within the reference metadata, for each key phrase, if the key phrase in the speech metadata is not in the reference metadata, instead of no value being added to the variable, the difference between the confidence value of the key phrase in the speech metadata and the average confidence value of all the key phrases in the reference metadata may be added to the variable.

[0083] The comparison module 104 may compare the key phrases in the speech metadata to the key phrases in the reference metadata in a plurality of ways in order to generate different key phrase scores. Multiple key phrase scores may be generated for the speech metadata. The comparison module 104 may perform a comparison labelled as a key differential comparison to generate a first key phrase score. In this comparison, for each key phrase, if the key phrase in the speech metadata is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the key phrase in the speech metadata and the confidence value of the key phrase in the reference metadata is calculated and adjusted by multiplying it by 1) + 1 and the adjusted difference value is added to a first key phrase variable. Where a key phrase of the speech metadata is not in the reference metadata, no value is added or subtracted from the first key phrase variable. The first key phrase score is then calculated by dividing the total in the first key phrase variable by the number of key phrases in the speech metadata to average the adjusted difference value of each key phrase over all the key phrases in the speech metadata. Thus, the first key phrase score is the average adjusted difference value of each key phrase.

[0084] The comparison module 104 may perform a comparison labelled as a key differential adjusted comparison to generate a second key phrase score. In this comparison, for each key phrase, if the key phrase in the speech metadata is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the key phrase in the speech metadata and the confidence value of the key phrase in the reference metadata is calculated and added to a second key phrase variable without adjustment. Where a key phrase of the speech metadata is not in the reference metadata, instead of no value being added to the second key phrase variable, the difference between the confidence value of the key phrase in the speech metadata and the average confidence value of all the key phrases in the reference metadata is added to the second key phrase variable.

[0085] The second key phrase score is then calculated by dividing the total in the second key phrase variable by the number of key phrases in the speech metadata to find the average. If there are no common key phrases between the speech metadata and the reference metadata, the differential adjusted comparison is the preferable comparison. The comparison module 104 may perform the differential comparison and/or the differential adjusted comparison to generate the first and/or second key phrase scores. Each key phrase score may be between 0 and 1 . The more key phrases of the speech metadata that are in the reference metadata, the higher the key phrase score. Thus, a higher score indicates the user has learnt the language well.

[0086] The comparison module 104 may compare the entities within the speech metadata to the entities within the reference metadata. Based on the comparison, the comparison module 104 may generate an entity score. The entity score may be representative of the number of entities of the speech metadata that are in the reference metadata. To compare the entities within the speech metadata to the entities within the reference metadata, for each entity of the speech metadata, the comparison module 104 may compare the entity to entities in the reference metadata and increment a variable by 1 when it finds the entity in the reference metadata. Where an entity of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable. The number in the variable may therefore contain the number of entities in the speech metadata that are also in the reference metadata. The entity score may be based on the number in the variable once all of the entities of the speech metadata have been compared. For example, the entity score may equal the number of entities in the speech metadata that are also in the reference metadata, as provided by the variable, divided by the number of entities in the reference metadata. One way of obtaining the entity score is to utilise two variables, a first variable as described above that, once all of the entities of the speech metadata have been compared, contains the number of entities in the speech metadata that are also in the reference metadata and a second variable that contains the number of entities in the reference metadata and to divide the first variable by the second variable. In this example, where all the entities in the reference metadata are also in the speech metadata, this would provide an entity score of 1 , which may be the maximum score, indicating that a user has recited the phrase well and has learnt the language well. The minimum score would be 0, where there has been no increment of the variable by 1 because none of the entities from the reference metadata were present in the speech metadata, indicating that a user has not recited the phrase well and has not learnt the language well.

[0087] Each entity may have an associated confidence value. For example, the confidence value associated with each entity may indicate the level of confidence that the comparison module 104 has that the entity type has been correctly detected, the entity type being the type of entity, such as people, places, items, dates and quantities. Additionally or alternatively, the confidence value associated with each entity may indicate the likelihood that the word forming the entity is correct, as discussed previously. Additionally or alternatively, the confidence value associated with each entity may indicate how difficult it is for the user to say the entity, as described previously with respect to key phrases. The comparison module 104 may have access to a list comprising entities and their associated confidence values, which may be stored in memory 128. The comparison module 104 may generate the confidence values. When comparing the entities within the speech metadata to the entities within the reference metadata, for each entity of the speech metadata, if the entity is also in the reference metadata, the confidence value of the entity in the speech metadata may be added to a variable, rather than the variable being incremented by 1. Where an entity of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable.

[0088] The confidence value associated with each entity in the speech metadata may differ to the confidence value associated with the same entity in the reference metadata. For example, the confidence value may depend on the other words in the sentence containing the entity. Therefore, even though the same entity may be in the reference word string and the speech word string, the confidence value of the entity in the speech word string may differ to the confidence value of the entity in the reference word string. [0089] Thus, when comparing the entities within the speech metadata to the entities within the reference metadata, for each entity of the speech metadata, if the entity is also in the reference metadata, a difference value of 1 plus the difference between the confidence value of the entity in the speech metadata and the confidence value of the entity in the reference metadata may be added to a variable, rather than the confidence value being added to the variable or the variable being incremented by 1 . Again, where an entity of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable. In some examples, if the entity in the speech metadata is also in the reference metadata, before a value is added to the variable, it may be adjusted by being multiplied - 1) + 1, where p is the percentage of common entities between the speech metadata and reference metadata and Ir is the number of entities in the reference metadata. For example, the difference value may be adjusted by being multiplied 1) + 1 before being added to the variable. The entity score may be calculated by taking the average of the difference values over all entities of the speech metadata or the average of the adjusted difference values over all entities of the speech metadata, for example, by dividing the value in the variable by the number of entities in the speech metadata.

[0090] In the examples above, when comparing the entities within the speech metadata to the entities within the reference metadata, for each entity, if the entity in the speech metadata is not in the reference metadata, instead of no value being added to the variable, the difference between the confidence value of the entity in the speech metadata and the average confidence value of all the entities in the reference metadata may be added to the variable.

[0091] The comparison module 104 may compare the entities in the speech metadata to the entities in the reference metadata in a plurality of ways in order to generate different entity scores. Multiple entity scores may be generated for the speech metadata. The comparison module 104 may perform a comparison labelled as an entity differential comparison to generate a first entity score. In this comparison, for each entity, if the entity in the speech metadata is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the entity in the speech metadata and the confidence value of the entity in the reference metadata is calculated and adjusted by multiplying - 1) + 1 and the adjusted difference value is added to a first entity variable. Where an entity of the speech metadata is not in the reference metadata, no value is added or subtracted from the first entity variable. The first entity score is then calculated by dividing the total in the first entity variable by the number of entities in the speech metadata to average the adjusted difference value of each entity over all the entities in the speech metadata. Thus, the first entity score is the average adjusted difference value of each entity.

[0092] The comparison module 104 may perform a comparison labelled as an entity differential adjusted comparison to generate a second entity score. In this comparison, for each entity, if the entity in the speech metadata is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the entity in the speech metadata and the confidence value of the entity in the reference metadata is calculated and added to a second entity variable without adjustment. Where an entity of the speech metadata is not in the reference metadata, instead of no value being added to the second entity variable, the difference between the confidence value of the entity in the speech metadata and the average confidence value of all the entities in the reference metadata is added to the second entity variable. The second entity score is then calculated by dividing the total in the second entity variable by the number of entities in the speech metadata to find the average. If there are no common entities between the speech metadata and the reference metadata, the comparison to generate the second entity score is the preferable comparison. The comparison module 104 may perform the entity differential comparison and/or the entity differential adjusted comparison to generate the first and/or second entity scores. Each entity score may be between 0 and 1. The more entities of the speech metadata that are in the reference metadata, the higher the entity score. Thus, a higher score indicates the user has learnt the language well.

[0093] The comparison module 104 may compare the syntax within the speech metadata to the syntax within the reference metadata. Based on the comparison, the comparison module 104 may generate a syntax score. For example, the syntax within the speech metadata may be identified for each word in the speech word string. Therefore, for each word in the speech word string, the comparison module 104 may determine whether the syntax of the word is the same as the syntax of the corresponding word in the reference word string. Thus, even if the word is different, if the syntax is the same, this indicates a higher understanding by the user than if the syntax is different.

[0094] The syntax score may be representative of how much of the same syntax of the speech metadata is in the reference metadata. Each syntax may correspond to a word in the word string. To compare the syntax within the speech metadata to the syntax within the reference metadata, for each syntax of the speech metadata, and so each word of the speech word string, the comparison module 104 may compare the syntax to syntax in the reference metadata and increment a variable by 1 when it finds the syntax in the reference metadata. Where a syntax of the speech metadata is not in the reference metadata, no value is added or subtracted from the variable. The number in the variable may therefore contain the number of syntax in the speech metadata that are also in the reference metadata. The syntax score may be based on the number in the variable once all of the syntax of the speech metadata has been compared. For example, the syntax score may equal the number of syntax in the speech metadata that are also in the reference metadata, as provided by the variable, divided by the number of syntax in the reference metadata. One way of obtaining the syntax score is to utilise two variables, a first variable as described above that, once all of the syntax of the speech metadata has been compared, contains the number of syntax in the speech metadata that are also in the reference metadata and a second variable that contains the number of syntax in the reference metadata and to divide the first variable by the second variable. In this example, where all the syntax in the reference metadata is also in the speech metadata, this would provide a syntax score of 1 , which may be the maximum score, indicating that a user has recited the phrase well and has learnt the language well. The minimum score would be 0, where there has been no increment of the variable by 1 because none of the syntax from the reference metadata were present in the speech metadata, indicating that a user has not recited the phrase well and has not learnt the language well.

[0095] The syntax of each word may have an associated confidence value. For example, the confidence value associated with the syntax of a word may indicate the level of confidence that the comparison module 104 has that the syntax has been correctly detected for that word. Additionally or alternatively, the confidence value associated with each syntax may indicate how difficult it is for the user to say the syntax, as described previously with respect to key phrases. The comparison module 104 may have access to a list comprising words, their syntax and their associated confidence values, which may be stored in memory 128. The comparison module 104 may generate the confidence values. When comparing the syntax within the speech metadata to the syntax within the reference metadata, for each syntax of the speech metadata, if the syntax is also in the reference metadata, the confidence value of the syntax in the speech metadata may be added to a variable, rather than the variable being incremented by 1 . Where a syntax of the speech metadata is not in the reference metadata, no value is added to the variable.

[0096] The confidence value associated with each syntax in the speech metadata may differ to the confidence value associated with the same syntax in the reference metadata. For example, the confidence value may depend on the other words in the sentence containing the syntax. Therefore, even though the same syntax may be in the reference word string and the speech word string, the confidence value of the syntax in the speech word string may differ to the confidence value of the syntax in the reference word string.

[0097] Thus, when comparing the syntax within the speech metadata to the syntax within the reference metadata, for each syntax of the speech metadata, if the syntax is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the syntax in the speech metadata and the confidence value of the syntax in the reference metadata may be added to a variable, rather than the confidence value being added to the variable or the variable being incremented by 1. Again, where a syntax of the speech metadata is not in the reference metadata, no value is added to the variable. In some examples, if the syntax in the speech metadata is also in the reference metadata, before a value is added to the variable, it may be adjusted by being multiplied by p (- - l) + l, where p is the percentage of common syntax between the speech metadata and reference metadata and Ir is the number of syntax in the reference metadata. For example, the difference value may be adjusted by being multiplied 1) + 1 before being added to the variable. The syntax score may be calculated by taking the average of the difference values over all syntax of the speech metadata or the average of the adjusted difference values over all syntax of the speech metadata, for example, by dividing the value in the variable by the number of syntax in the speech metadata.

[0098] In the examples above, when comparing the syntax within the speech metadata to the syntax within the reference metadata, for each syntax, if the syntax in the speech metadata is not in the reference metadata, instead of no value being added to the variable, the difference between the confidence value of the syntax in the speech metadata and the average confidence value of all the syntax in the reference metadata may be added to the variable.

[0099] The comparison module 104 may compare the syntax in the speech metadata to the syntax in the reference metadata in a plurality of ways in orderto generate different syntax scores. Multiple syntax scores may be generated for the speech metadata. The comparison module 104 may perform a comparison labelled as a syntax differential comparison to generate a first syntax score. In this comparison, for each syntax, if the syntax in the speech metadata is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the syntax in the speech metadata and the confidence value of the syntax in the reference metadata is calculated and adjusted by multiplying it by - 1) + 1 and the adjusted difference value is added to a first syntax variable. Where a syntax of the speech metadata is not in the reference metadata, no value is added to the first syntax variable. The first syntax score is then calculated by dividing the total in the first syntax variable by the number of syntax in the speech metadata to average the adjusted difference value of each syntax over all the syntax in the speech metadata. Thus, the first syntax score is the average adjusted difference value of each syntax.

[0100] The comparison module 104 may perform a comparison labelled as a syntax differential adjusted comparison to generate a second syntax score. In this comparison, for each syntax, if the syntax in the speech metadata is also in the reference metadata, the difference value of 1 plus the difference between the confidence value of the syntax in the speech metadata and the confidence value of the syntax in the reference metadata is calculated and added to a second syntax variable without adjustment. Where a syntax of the speech metadata is not in the reference metadata, instead of no value being added to the second syntax variable, the difference between the confidence value of the syntax in the speech metadata and the average confidence value of all the syntax in the reference metadata is added to the second syntax variable. The second syntax score is then calculated by dividing the total in the second syntax variable by the number of syntax in the speech metadata to find the average. If there are no common syntax between the speech metadata and the reference metadata, the comparison to generate the second syntax score is the preferable comparison. The comparison module 104 may perform the syntax differential comparison and/or the syntax differential adjusted comparison to generate the first and/or second syntax scores. Each syntax score may be between 0 and 1 . The more syntax of the speech metadata that is in the reference metadata, the higher the syntax score. Thus, a higher score indicates the user has learnt the language well. [0101] The comparison module 104 may also compare the speech word string to the reference word string that the reference metadata is extracted from and may generate a score based on the comparison. Such a score is based on the accuracy of the speech in the speech sample. However, it is also based on the accuracy of the error correction of the comprehensibility module 102 as the comprehensibility module 102 may obtain incorrect phonemic symbols, as discussed above. Moreover it is based on a user’s accent or dialect because the comprehensibility module may obtain incorrect phonemic symbols due to the user’s accent or dialect. Thus, whilst comparison of the word strings alone is less accurate, the comparison of the speech word string and the reference word string in combination with the comparison of the speech and reference metadata provides an effective and more accurate score and thus an effective and more accurate means of learning a language. Additionally, this combined comparison reduces the error contribution due to not merely relying on the similarity between two word strings.

[0102] Thus, in addition to one or more of the above comparisons, the comparison module 104 may compare the speech word string to the reference word string in order to determine how much the speech word string deviates from the reference word string. Based on the comparison, the comparison module 104 may generate a text score. The text score may be based on a comparison of the speech word string as a whole. The text score may be representative of the number of words in the speech word string that are in the reference word string, taking into account the positions of the words. The text score may be based on the removed words, i.e. the words in the reference word string that are not in the speech word string, the inserted words, i.e. the words in the speech word string that are not in the reference word string, and words that have changed position, i.e. the words in the speech word string that are in a different location in the speech word string. The inserted and removed words may contribute a score of 1. The words that have changed position may contribute a score of less than 1 . Regarding the words that have changed position, the farther the word is away from the original position in the reference word string, the higher the score is (indicating the recitation by the user is not as good). The text score may be calculated by summing the scores for the inserted words, removed words and words that have changed position and normalizing the total score to be between 0 and 1 .

[0103] The comparison module 104 may generate a first text score by performing a comparison labelled as a free text differential comparison. In this comparison, for each word of the speech word string, the comparison module 104 may compare the word to the words in the reference word string. Where the word is in the reference string in the correct position, a 0 may be added to the total text score, which may be assigned as a variable. Where the word is in the reference string in a different position, a score of less than 1 may be added to the total text score, based on the distance between the position of the word in the speech word string and the position of the word in the reference word string. Where the word is not in the reference word string, a 1 may be added to the total text score. Additionally, for each word of the reference word string, the comparison module 104 may compare the word to the words in the speech word string. Where the word is not in the speech word string, a 1 may be added to the total text score. As mentioned above, once the words of the speech word string and reference word string have been compared, the total text score may be divided by the number of distinct words in the reference word string and the speech word string, which provides a first text score of between 0 and 1 .

[0104] In addition to generating a first text score, the comparison module 104 may generate a text output comprising the speech text string having words with position changes prepended with inserted words prepended with “+” and removed words prepended with

[0105] The comparison module 104 may generate a second text score by performing a comparison labelled as an average free text differential comparison. To perform the average free text differential comparison, the comparison module 104 may perform the same steps as for the free text differential comparison. The only difference between the average free text differential comparison and the free text differential comparison is that rather than adding 1 to the total text score for inserted and removed words, a score of less than 1 is added to the total text score. The score to be added to the total text score for each inserted and removed word is determined by the best Levenshtein distance to the word in the same or a similar position in the reference, the best Levenshtein distance being the smallest distance between the words. A Levenshtein distance is a string metric for measuring the difference between two sequences. For example, the Levenshtein distance between two words may be the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other. For example, the Levenshtein distance between “for” and “far” is 1 . The Levenshtein distance is a known metric and so will not be discussed in detail. By basing the score contribution on the Levenshtein distance, words that have been incorrectly converted from the audio data by the comprehensibility module 102 but are similarto words in the reference do not count as a complete miss, enabling more accurate assessment of a user’s speech same and more accurate feedback to be provided to a user. In some examples, the comparison module 104 may generate the first text score or the second text score.

[0106] The comparison module 104 may generate a third text score by performing a comparison labelled as a Levenshtein distance comparison. The comparison comprises calculating a total score based on the best Levenshtein distance from the reference word string to the speech word string. The third text score may then be calculated by dividing the total score by the length of the longer string out of the reference word string and the speech word string, which provides a third text score between 0 and 1 .

[0107] The comparison module 104 may generate a fourth text score by performing a comparison labelled as an average Levenshtein distance comparison. The average Levenshtein distance comparison may find the best Levenshtein distance for each word of the speech word string. To calculate the best Levenshtein distance for each word of the speech word string, a search window may be set up within the reference word string, centred at an index point that has the same relative position in the reference word string that the word has in the speech word string. This is because the word in the speech word string should be in approximately the same relative position in the reference word string. If so, the best Levenshtein distance will be 0, and if not, the best Levenshtein distance will be above 0.

[0108] Once the Levenshtein distance has been calculated for each word, the average Levenshtein distance comparison may comprise summing the Levenshtein distances to provide a total score comprising the sum of the best Levenshtein distance for each word of the speech word string. The comparison may then comprise averaging the total score. The fourth text score may then be calculated by dividing the averaged total score by the length of the longer string out of the reference word string and the speech word string, which provides a fourth text score between 0 and 1 . The first, second, third and/or fourth text scores may provide a lower score for better recitation of the word string and may therefore be inverted when added to the final score. In some examples, the comparison module 104 may generate the third text score or the fourth text score.

[0109] The speech word string may be compared to the reference word string where both word strings comprise punctuation and capital letters. Additionally or alternatively, the speech word string may be compared to the reference word string where both word strings do not comprise punctuation and capital letters, for example, because the punctuation and capital letters are removed. [0110] The comparison module 104 may be configured to compare the speech metadata with reference metadata indicative of the meaning of a reference word string in order to generate the key phrase score, entity score, and syntax score. The comparison module 104 may also be configured to compare the words within the speech word string to the words within the reference word string in order to generate the text score in addition to the key phrase score, entity score, and syntax score. In some examples, the comparison module 104 may be configured to perform one or more of the key differential comparison, key differential adjusted comparison, entity differential comparison, entity differential adjusted comparison, syntax differential comparison, syntax differential adjusted comparison, free text differential comparison, average free text differential comparison, Levenshtein distance comparison and average Levenshtein distance comparison. Thus, the comparison module 104 may be configured to generate and, optionally, output one or more scores, the one or more scores comprising one or more of the first key phrase score, second key phrase score, first entity score, second entity score, first syntax score, second syntax score, first text score, second text score, third text score and fourth text score. For example, the comparison module may generate a matrix of differential scores comprising one or more of the first key phrase score, second key phrase score, first entity score, second entity score, first syntax score, second syntax score, first text score, second text score, third text score and fourth text score. The variables mentioned above may be stored in memory within or external to the comparison module 104. The comparison module 104 may be configured to normalise each of the scores to between 0 and 1 .

[0111] The comparison module 104 may be configured to perform each of the comparisons above twice, once where the speech word string and reference word string comprise punctuation and capital letters and once where the speech word string and reference word string do not comprise punctuation and capital letters. Therefore the one or more scores may comprise the first key phrase score, second key phrase score, first entity score, second entity score, first syntax score, second syntax score, first text score, second text score, third text score and fourth text score with and without punctuation and capital letters.

[0112] Once the one or more scores have been calculated, the comparison module 104 may combine the scores into an average score. For example, the average score may be the average of the one or more scores. In some examples, the comparison module 104 may combine the one or more scores into a weighted average score, where each score has a corresponding weighting indicating the contribution of the score to the weighted average score. For example, the first key phrase score may have a higher weighting than the first entity score and so may provide a higher contribution to the average score than the first entity score, indicating that the first key phrase score is of higher importance than the first entity score. For example, the weighted average score may comprise 20% of the first key phrase score and 10% of the first entity score. The final score may be the average score or weighted average score.

[0113] In order to determine the corresponding weight for each score, training data may be used. For example, the training data may comprise speech data samples of a plurality of people reading one or more reference word strings, each speech data sample labelled with a correct final score. For each reference word string, each speech data sample may be input into the comprehension module 102 and, in response, the comparison module 104 may output a weighted average score based on the input speech data sample, using particular weightings. The weighted average score output by the comparison module 104 may be compared to the correct final score for that speech data sample. Where the weighted average score and correct final score do not match, the weightings may be changed until they match or are at least very similar for each of the speech data samples in the training data.

[0114] The training data may be the Kaggle speech accent dataset as described previously. For example, for users trying to learn English, training data may be audio data of the English speakers from the Kaggle speech accent dataset. Each piece of audio data comprising a speech data sample may be labelled, for example manually, with a correct final score. For example, each speech data sample may be input into the comprehension module 102 and, in response, the comparison module 104 may generate and, optionally, output one or more scores based on the speech data sample. The correct final score may then be manually chosen based on the outputted score.

[0115] The comparison module 104 may standardise the weighted average score to produce the final score. The comparison module 104 may standardise the weighted average score using the training data mentioned above. Alternatively, the comparison module 104 may standardise the weighted average score using a plurality of historical scores. Standardization is a scaling method for statistically adjusting the scores where the values are centred around a mean with a unit standard deviation. If the mean and standard deviation of standardised scores are calculated, they will be 0 and 1 respectively. As mentioned above, for each reference word string, each speech data sample may be input into the comprehension module 102 and, in response, the comparison module 104 may output a weighted average score. Thus, fortraining data mentioned above that uses a plurality of speech data samples, a plurality of weighted average scores may be output from the comparison module 104. The plurality of weighted scores forms a normal distribution. The mean and, optionally, standard deviation are derived from the normal distribution formed. When the audio data comprising a speech sample is received by the comprehensibility module 102, the comparison module may then output a final score, the final score being the weighted average score standardised by subtracting the derived mean and optionally dividing by the derived standard deviation. In another example, the one or more scores may be standardised before the final score is generated. For example, the matrix of differential scores may be standardised to produce a matrix of standardised differential scores.

[0116] The final score may be the standardised weighted average score. This final score may be between 0 and 1 , where 1 means the user has fully learnt the language, i.e. the reference word string has the same meaning as the speech word string, and 0 meaning the user has not learnt the language at all, i.e. the reference word string has a completely different meaning to the speech word string. Thus, the final score may be referred to as a fluency score. The comparison module may be configured to output feedback based on the final score, as discussed below in relation to Figure 4. By comparing the speech metadata to the reference metadata to produce a final score, the comparison module 104 is able to compare the meaning of the speech metadata to the meaning of the reference metadata and can therefore assess whether the user has understood what they needed to say and has effectively communicated the same meaning of the reference word string. The final score will therefore score a user’s comprehensibility rather than ability to repeat a phrase and therefore provides a more effective means of scoring a user’s ability to learn a language because it is not dependent on how a user says a phrase. Thus, the apparatus 100 comprising the comparison module 104 provides a method of scoring the output from a user without discriminating between users. This provides a more accurate scoring method for whether a user has learnt a language because they are scored on their understanding rather than just their pronunciation. [0117] Figure 2 shows a schematic illustration of an apparatus 200 for speech analysis in accordance with aspects of the present disclosure. The apparatus 200 is an example of apparatus 100 of Figure 1 , with comprehensibility module 202 being an example of comprehensibility module 102 and comparison module 204 being an example of comparison module 104. The apparatus 200 may also include the one or more processors 126 and memory 128 coupled to the one or more processors 126, however these are not shown in the Figure. The apparatus 200 comprises the comprehensibility module 202 and the comparison module 204. The comprehensibility module 202 comprises a transcription module 210 and a context module 212. The transcription module 210 converts the received audio data comprising a speech sample into a word string of phonemic symbols. To do this, the transcription module 210 receives audio data comprising a speech sample, extracts phonemes from the speech sample to provide a string of phonemic symbols and analyses the string of phonemic symbols to identify a speech word string based on the most likely combinations of phonemic symbols. These steps may be performed in the same way as described above in relation to Figure 1. The context module 212 obtains speech metadata indicative of the meaning of the word string from the word string received from the transcription module. This step may be performed in the same way as described above in relation to Figure 1 .

[0118] One way in which the transcription module 210 may perform the conversion of the received audio data comprising a speech sample into a word string is using a neural network. Thus, the transcription module 210 may be a transcription neural network, as described in more detail in relation to Figure 3. Moreover, one way in which the context module 212 performs the extraction of the speech metadata indicative of the meaning of the word string from the word string is using a neural network. Thus, the context module 212 may be a context neural network, as described in more detail in relation to Figure 3.

[0119] Figure 3 shows an illustration of a comprehensibility module 302 in accordance with aspects of the present disclosure. The comprehensibility module 302 is an example of comprehensibility module 102 and comprehensibility module 202. The comprehensibility module 302 comprises a transcription neural network 310 and a context neural network 312. Transcription neural network 310 is an example of transcription module 210 and context neural network 312 is an example of context module 212. The transcription neural network 310 comprises an input layer 316 having input nodes 314 for receiving audio data comprising the speech sample, and an output layer 320 coupled to the input layer 316 through one or more neural network layers, the output layer 320 having output nodes 318 for outputting a speech word string. The transcription neural network 310 is configured through training to map the audio data directly to the speech word string based on the most likely combinations of phonemic symbols. The context neural network 312 comprises an input layer 324 having input nodes 322 for receiving the speech word string, and an output layer 328 coupled to the input layer 324 through one or more neural network layers, the output layer 328 having output nodes 326 for outputting speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The context neural network 312 is configured through training to map the speech word string directly to the speech metadata based on the meaning of each word of the word string. Each of the transcription neural network 310 and context neural network 312 may be formed of multiple convolutional neural networks. For example, the transcription neural network 310 and context neural network 312 may each comprise a neural network for each language and/or for each reference phrase.

[0120] In order to map the audio data directly to the speech word string, the transcription neural network 310 may be trained using speech data to recognise phonemes and their corresponding phonemic symbol. For example, the transcription neural network 310 may receive training audio data comprising speech samples that have been labelled with the phonemic symbols corresponding to the speech samples, as discussed in relation to Figure 1. For each speech sample in the training dataset, the training audio data comprising a speech sample may be input into the transcription neural network 310 and the transcription neural network may output a string of phonemic symbols. The output string of phonemic symbols may be compared to the string of phonemic symbols associated with the speech sample in the training data. Where the strings of phonemic symbols do not match, the parameters of the transcription neural network 310 may be adjusted. This process may be repeated until the strings of phonemic symbols are the same for each speech sample. Therefore, the transcription neural network 310 may be trained to know which phoneme of the speech sample corresponds to which phonemic symbol such that when the trained transcription neural network 310 receives the audio data comprising the speech sample it can map the audio data into a string of phonemic symbols.

[0121] The transcription neural network 310 may also receive text data in order to train the transcription neural network on likely combinations of words. The text data may be generic text data such as a book or may be specific to the likely input such as the reference phrase that a user is prompted to say. Thus, the transcription neural network 310 may be trained specifically for the phrase that the user has been prompted to say. This increases the accuracy of the word string output by the transcription neural network 310.

[0122] In order to map the speech word string directly to the speech metadata, the context neural network 312 may be trained by receiving text data labelled with associated metadata, for example, key phrases. The context neural network 312 may output metadata based on the text data. This metadata may then be compared to the associated metadata. Where the output metadata and associated metadata do not match, the parameters of the context neural network 312 may be adjusted. This process may be repeated until the metadata matches the associated metadata. Further details on the methods of training the transcription neural network 310 and a context neural network 312 have been mentioned above in relation to the comprehensibility module 102 of Figure 1. These details apply equally to the transcription neural network 310 and a context neural network 312 of Figure 3. Moreover other methods of training neural networks may equally apply.

[0123] Figure 4 shows a schematic illustration of a computing device 450 for speech analysis in accordance with aspects of the present disclosure. The computing device 450 comprises an apparatus 400, which may be apparatus 100 of Figure 1 or apparatus 200 of Figure 2. The computing device 450 also comprises a display 426 and a microphone 428.

[0124] The processor 126 may output a reference phrase or instructions to the display 426. In response, the display 426 may provide a display related to a particular reference phrase of a plurality of reference phrases in order to prompt the user to attempt to say that phrase. For example, the display 426 may display the reference phrase in the chosen language for the user to then attempt to repeat the phrase or may display a prompt in the chosen language for a user to attempt to say the phrase. For example, the display 426 may display a question in the chosen language where the reference phrase would be the correct answer. Alternatively, the display 426 may display the reference phrase or question in the user’s native language for the user to then attempt to say the phrase in their chosen language. The native language of the user of the apparatus 100 may be selected by the user. Alternatively, the user may not be prompted to say a particular reference phrase. For example, the display 426 may be related to a particular theme for the user to talk about. In this example, a reference phrase may be determined based on the phrase spoken by the user.

[0125] The contents of the display may be used by the comprehensibility module 402 and/or comparison module 404 for selecting the reference word string and/or reference metadata. The display 426 or the processor 126 may output the reference phrase to the comprehensibility module 402 for the comprehensibility module 402 to extract the reference metadata from the reference phrase.

[0126] When prompted by the display to say a particular reference phrase , a user may activate a microphone, for example, by pressing on a touch screen of the display 426. When activated, the microphone generates audio data of the user speaking and transmits the audio data to the comprehensibility module 402 of the apparatus 400, for example, via connection 106. After the comprehensibility module 402 has converted the audio data into a word string and extracted speech metadata and the comparison module 404 has generated the final score, the comparison module 404 then outputs 108 the generated final score and/or feedback to the display. The feedback output to the display may be based on the final score. For example, where the final score is close to 0, the feedback may be positive, and where the final score is close to 1 , the feedback may be negative. The feedback may be based on the one or more differential scores. For example, where the first text score is low but the first key phrase is high, the feedback may be that the user has good comprehensibility but needs to work on their connecting words or word order.

[0127] The feedback may be output in real time following the receipt of the audio data. As mentioned previously, the user may be prompted to say a particular phrase which may comprise a number of speech samples. The microphone 428 may sequentially receive each speech sample and, after the first speech sample has been received, a score may be generated by the apparatus. The score and feedback may be output to the display before the user has finished the phrase.

[0128] Whilst the apparatus 400 is illustrated a being within computing device 450, it may be separate to the computing device, for example, it may communicate wirelessly with the computing device 450.

[0129] Figure 5 shows a method 500 for speech analysis in accordance with aspects of the present disclosure. The method 500 is implemented by a computer. The method 500 comprises receiving 502 audio data comprising a speech sample and extracting 504 phonemes from the speech sample to provide a string of phonemic symbols. The method 500 further comprises analysing 506 the string of phonemic symbols to identify a word string of phonemic symbols based on the most likely combinations of phonemic symbols and obtaining 508 speech metadata based on the word string, the speech metadata indicative of the meaning of the word string. The method 500 further comprises comparing 510 the speech metadata with reference metadata indicative of the meaning of a reference word string and generating 512 a score based on the comparison, the score based on the difference between the meaning of the word string and the meaning of the reference word string.

[0130] The method 500 may further comprise comparing the word string of phonemic symbols with a reference word string, and generating a further score based on the comparison, the further score based on the accuracy of the speech in the speech sample. The method 500 may further comprise selecting the reference word string and/or the reference metadata based on the word string of phonemic symbols and/or the speech metadata. The method 500 may further comprise receiving an input indicative of the reference word string, and selecting the reference word string and/or the reference metadata based on the received input. The method 500 may further comprise, for each parameter of a plurality of parameters of the speech metadata, comparing the speech metadata with reference metadata, and generating a score based on the comparison, and outputting a final score based on the plurality of scores.

[0131] The steps of the method may be carried out by execution by a computer of instructions stored in a computer readable storage medium. The computer readable storage medium may be a transitory or a non- transitory computer readable storage medium.

[0132] Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CDROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain or store a program for use by or in connection with an instruction execution system, apparatus, or device.

[0133] A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in a baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electromagnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

[0134] Computer code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, radio frequency (RF), etc., or any suitable combination thereof.

[0135] Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java™, Smalltalk™, C++, or the like, and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).

[0136] Where functional units have been described as modules, the modules may be implemented as circuitry, the circuitry may be general purpose processing circuitry configured by program code to perform specified processing functions. The circuitry may also be configured by modification to the processing hardware. Configuration of the circuitry to perform a specified function may be entirely in hardware, entirely in software or using a combination of hardware modification and software execution. Program instructions may be used to configure logic gates of general purpose or special-purpose processor circuitry to perform a processing function.

[0137] Circuitry may be implemented, for example, as a hardware circuit comprising processors, microprocessors, circuits, circuit elements (e.g., transistors, resistors, capacitors, inductors, and so forth), integrated circuits, application specific integrated circuits (ASIC), programmable logic devices (PLD), digital signal processors (DSP), field programmable gate arrays (FPGAs), logic gates, registers, semiconductor devices, chips, microchips, chip sets, and the like.

[0138] The processors may comprise a general purpose processor, a network processor that processes data communicated over a computer network, or other types of processor including a reduced instruction set computer RISC or a complex instruction set computer CISC. The processor may have a single or multiple core design. Multiple core processors may integrate different processor core types on the same integrated circuit die

[0139] Many variations of the methods described herein will be apparent to the skilled person. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings), may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.

[0140] The invention is not restricted to the details of any foregoing embodiments. The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed. The claims should not be construed to cover merely the foregoing embodiments, but also any embodiments which fall within the scope of the claims.