Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
ANALYSIS SYSTEM
Document Type and Number:
WIPO Patent Application WO/2014/108559
Kind Code:
A1
Abstract:
A method and system for directed analysis of a website uses techniques for following links and ranking links so as to efficiently extract information from a site for analysis.

Inventors:
SIVACKI NIKOLA (GB)
GRIFFITH GARETH (GB)
HEGARTY DANIEL (GB)
Application Number:
PCT/EP2014/050594
Publication Date:
July 17, 2014
Filing Date:
January 14, 2014
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
WONGA TECHNOLOGY LTD (IE)
International Classes:
G06F21/51
Foreign References:
US20100023850A12010-01-28
Attorney, Agent or Firm:
LOVELESS, Ian Mark (16 Theobalds Road, London WC1X 8PL, GB)
Download PDF:
Claims:
CLAIMS

A system for providing an authorisation message in response to a device in communication with a website, comprising:

- means for receiving an initial link to the website and for retrieving a first page of the website using the initial link;

- means for retrieving links to pages within the website from the first page and adding to a link list;

- means for retrieving further pages by following links on the link list by operating a routine comprising the following repeated steps:

- selecting a link from the link list;

- retrieving the page designated by the selected link;

- extracting links from the retrieved page;

- adding the links extracted from the retrieved page to the link list;

- means for extracting information from the pages retrieved and for producing an authorisation message from the extracted information.

A system according to claim 1 , wherein the routine comprises scoring links on the link list and the step of selecting comprises selecting the top scoring link.

A system according to claim 2, wherein the means for retrieving further pages is configurable to operate N multiple instances of the routine whereby N links on the link list may be followed in parallel.

A system according to claim 2 or 3, wherein the step of scoring links comprises searching for keywords within the links and assigning a score to each link based on the presence or absence of the keywords.

A system according to any preceding claim, wherein the system is arranged to limit the time taken to produce the authorisation message by limiting the time between receiving the initial link and producing the authorisation message. A system according to any preceding claim, wherein the system is arranged to limit the time taken to produce the authorisation message by limiting the number or repetitions of the routine.

A system according to any preceding claim, wherein the system is arranged to limit the time taken to produce the authorisation message by limiting the number or links selected from the link list.

A system according to any preceding claim, wherein the means for receiving an initial link is arranged to receive the link from a client device browsing the website.

A system according to any preceding claim, wherein the system is a client device.

A system according to any of claims 1 to 8, wherein the system is provided on a server.

A system for providing an authorisation message as a result of a communication with a website, comprising:

- means for aggregating data related to the website, and

- means for determining whether to assert an authorisation message using the aggregated data;

- wherein the means for aggregating data is arranged to derive multiple values related to the website and to represent the multiple values as a multidimensional vector.

A system according to claim 1 1 , wherein the means for aggregating data related to the website comprises means for following links on the website based upon a link scoring process.

13. A system according to claim 12, wherein the link scoring process comprises matching each link to keywords and retrieving a corresponding score. A system according to claim 13, wherein the link scoring process comprises following each link in turn on a web page with the highest score.

A system according to claim 13, wherein the link scoring process includes a parameter defining the maximum number of links to follow.

A system according to any of claims 1 1 to 15, wherein the means for aggregating includes a time limit for aggregating data.

A system according to any of claims 1 1 to 16, wherein the means for aggregating data further comprise means for retrieving data from other sources using one or more keywords from the website.

A system according to claim 17, wherein the other sources include search engines, social networking sites, review sites and other reference sites.

A system according to claim 17 or 18, wherein the means for aggregating data comprises multiple threads operable in parallel to retrieve data from the website and the other sources.

A system according to any of claims 1 1 to 19, wherein the means for determining whether to assert a signal comprises means for reducing the aggregated data to a vector.

A system according to any preceding claim, wherein the means for determining whether to assert an authorisation message comprises means for reducing the vector to a proceed or deny signal. A method for providing an authorisation message in response to a device in communication with a website, comprising:

- receiving an initial link to the website and retrieving a first page of the website using the initial link;

- retrieving links to pages within the website from the first page and adding to a link list;

- retrieving further pages by following links on the link list by operating a routine comprising the following repeated steps:

- selecting a link from the link list;

- retrieving the page designated by the selected link;

- extracting links from the retrieved page;

- adding the links extracted from the retrieved page to the link list;

- extracting information from the pages retrieved and producing an authorisation message from the extracted information.

A method according to claim 22, wherein the routine comprises scoring links on the link list and the step of selecting comprises selecting the top scoring link.

A method according to claim 23, wherein the method is configurable to operate N multiple instances of the routine whereby N links on the link list may be followed in parallel.

A method according to claim 23 or 24, wherein the step of scoring links comprises searching for keywords within the links and assigning a score to each link based on the presence or absence of the keywords.

A method according to any of claims 22 to 25, wherein the method is arranged to limit the time taken to produce the authorisation message by limiting the time between receiving the initial link and producing the authorisation message. A method according to any of claims 22 to 26, wherein the method is arranged to limit the time taken to produce the authorisation message by limiting the number or repetitions of the routine.

A method according to any of claims 22 to 27, wherein the method is arranged to limit the time taken to produce the authorisation message by limiting the number or links selected from the link list.

A method according to any of claims 22 to 28, comprising receiving the initial link from a client device browsing the website.

A method according to any preceding claim, wherein the method operable on a client device.

A method according to any of claims 22 to 29, wherein the method operable on a server.

A method for providing an authorisation message as a result of a communication with a website, comprising:

- aggregating data related to the website, and

- determining whether to assert an authorisation message using the aggregated data;

- wherein the step of aggregating data comprises deriving multiple values related to the website and to represent the multiple values as a multidimensional vector.

A method according to claim 32, wherein aggregating data related to the website comprises following links on the website based upon a link scoring process.

A method according to claim 33, wherein the link scoring process comprises matching each link to keywords and retrieving a corresponding score.

35. A method according to claim 34, wherein the link scoring process comprises following each link in turn on a web page with the highest score.

36. A method according to claim 34, wherein the link scoring process includes a parameter defining the maximum number of links to follow.

37. A method according to any of claims 32 to 37, wherein the aggregating includes a time limit for aggregating data.

38. A method according to any of claims 32 to 37, wherein the aggregating data further comprise retrieving data from other sources using one or more keywords from the website.

A method according to claim 38, wherein the other sources include search engines, social networking sites, review sites and other reference sites.

A method according to claim 38 or 39, wherein the aggregating data comprises multiple threads operable in parallel to retrieve data from the website and the other sources.

A method according to any of claims 32 to 30, wherein determining whether to assert a signal comprises reducing the aggregated data to a vector.

A method according to any of claims 32 to 41 , wherein determining whether to assert an authorisation message comprises reducing the vector to a proceed or deny signal.

A server system comprising a processor and memory storing code which when executed undertakes the steps of any of claims 22 to 42.

44. A computer program product comprising code which when executed undertakes the steps of any of claims 22 to 42.

Description:
ANALYSIS SYSTEM

BACKGROUND OF THE INVENTION This invention relates to methods and systems for verification of websites and other sources of data relating to an entity to improve security.

Online systems are increasingly being used in which a client device connects with a website system over a communication path, such as the Internet, and in which a third party software module forms part of the communication chain. An example of such an approach is a so called popup or plugin used in a Web browser to request data from a user and provide data to a remote system while the user interacts with a website server. Such arrangements are used, for example, in security systems in which a browser may be redirected to a remote third party site to exchange information prior to continuing interaction with a website system.

SUMMARY OF THE INVENTION We have appreciated that arrangements such as described above can provide additional security to the website system by providing an extra authentication process with a trusted remote third party. However, the exchange of data potentially provides a risk to the trusted remote third party from the unknown website system.

In broad terms, a system embodying the invention comprises a system for providing communication between a client device, a website system and remote third party system in which the remote third party system or the client device includes, means for aggregating data related to the website system, and means for determining whether to assert a permission signal using the aggregated data. Preferably, the aggregation comprises a crawling algorithm. Preferably, the aggregation comprises reducing unrelated data sources to a multidimensional vector. BRIEF DESCRIPTION OF THE DRAWINGS

The invention will now be described in more detail by way of example with reference to the drawings, in which:

Figure 1 : is a functional diagram of the key components of a system embodying the invention;

Figure 2: is an overview of the key functional components of the remote system component embodying the invention;

Figure 3: is a flow diagram showing data collection using a crawling process;

Figure 4: shows the process of accessing and parsing multiple external data sources concurrently;

Figure 5: shows the aggregation of data from various sources; and

Figure 6: shows the output module of Figure 1 .

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention may be embodied in methods of operating client devices, methods of using a system involving a client device, client devices, modules within client devices and computer instructions for controlling operation of client devices. Client devices include personal computers, smart phones, tablet devices and other devices useable to access remote services. For ease of understanding, a system embodying the invention will be described first, followed by details of client devices and methods and message flows.

Overview A system embodying the invention is shown in Figure 1 and comprises a website server 104 for providing web pages, one or more client devices 100 for receiving, presenting and interacting with the web pages and an analysis module 14 having a processor ane memory separate from both the website server and client devices for providing additional interaction with the client device. The client device connects with the website server 104 and analysis module 14 over a network 16, preferably the internet, but other technologies, whether wired or wireless, may be used in the communication path. Analysis module 14 may be a self-contained system holding data available to client devices, but can also be a system that provides connectivity to other sources of data and functionality, shown by communication path 15. The analysis module 14 may thereby both retrieve data from other systems for provision to the client device, but may also provide instructions to other systems as a consequence of interaction with the client device. Preferably, the analysis module comprises a process hosted at a remote system connected to the client device by the internet. The analysis module is coupled to an output module 17 having a processor ane memory that can assert an output to the client device over network 16.

In an alternative embodiment, the functionality provided by the analysis module and output module described below may be incorporated within the client device, in which case the analysis module may be a web browser plug-in, a Javascript process or dedicated functionality within the client device for retrieving data from a website, analysising and determining whether to proceed as described below.

In a further alternative embodiment, the functionality of the analysis module, output module and client device may be provided at a computer system, such as a server, PC or cloud system, such that the computer system may connect to the website server and perform the retrieval and analysis steps described. One such arrangement may be a computer system for autonomously retrieving and checking a website, comprising a processor and memory arranged to undertake the checking steps to provide an output authorisation message.

In the various possible embodiments described, the analysis module and output module may be implemented as processor and a memory storing program code for execution by the processor. The processor may be a general purpose processor or a dedicated processor. The message flow in a system embodying the invention is shown in greater detail in Figure 2. The flow shows the process when a client device interacts with a website server 104, here shown as a client website, and the analysis module 14 intercepts and becomes part of the communication. The analysis module 14 accepts a request from a client application executing at the client device, containing a URL. This URL is accessed by a crawler module 101 of the analysis module 14. The client 100 issuing the request initiates the process. The client also sends a request containing a URL of the website representing the website server.

The crawler module 101 accepts the URL and company name in the request and proceeds on to "crawling" the website referenced by the received URL and gathers various relevant data during the crawling, such as existence of SSL certificate, average read and open time for pages on the website and the actual content of accessed pages on the website. The term "crawling" is understood to mean a process of stepping through selected links within a website to extract information which may then be analysed. The crawler sends content from the client website 104 to a parser 102, which extracts various features, such as the existence of certain keywords anywhere in the content (from a configurable keyword list). The crawler 101 also accesses various third-party websites which may be queried using the company name and the responses from the websites are parsed in a manner specific to those websites. For example, a search engine 105 such as Google™ may be queried to obtain the number of websites indexed by Google™ referencing the website, generating an integer number. The existence of entries on Twitter™ 106 may be retrieved. In case of Linkedln™ 107, the presence of the company on Linkedln™ is checked (thus returning a binary feature of yes/no).

A Risk Engine 103 gathers all the features produced above and calculates a score which is used to determine the next steps in the process. The risk score may impact the further behaviour of the client. For example, if it is determined that the website is deemed high risk, then interaction with the website may be restricted or providing of additional services via the website or a third party may be terminated. We have appreciated that the process of retrieving data (crawling) of websites needs to be completed in a limited amount of time, during the interaction of the user with a website. This imposes some limitations, namely in the amount of pages the crawler can collect. Because of this, the crawler needs to follow the links that are most likely to contain pages with content of likely interest when extracting internal features.

The crawler should therefore follow links that are likely to contain interesting data such as target words. So, for a word set relating to a topic (say, customer service), the system stores another set of short words that are likely to be contained in the links pointing to pages that would contain the target words. This set could be, for example 'contact', 'help' and 'customer'. Since the system cannot know in advance what page will contain the target words, this shorter word set is used to navigate through the links toward the pages that are likely to contain them, or other useful data. Using the words, certain links are scored higher, while some are filtered, so the short set directs the crawling process, trying to minimise the total number of pages requiring crawling before the target features are collected.

As described above, the embodying system may be used as part of a web interface, which generates a request that is handled by a web service and the request is inserted into a queue data structure. The system then polls this queue periodically and reads the request. After reading the request, it extracts the URL of the company website from it and extracts the main string from the URL, noting the company name. These two are used by the internal and external crawlers to crawl the website and 3rd party services for parsing and generating features. The features are then passed on to a statistical model, here a Risk Engine 103, which outputs a score for that feature vector. A request is made to the service, with three parameters passed:

1 . the URL of the company to be processed by the system

2. maximum number of pages to crawl (in the directed crawler)

3. maximum number of seconds for crawling to take If the last two parameters are omitted, default values such as 50, 0 (no time limit) may be used. The webservice system responds with a message 'Processing for companyname.co.uk started' noting that the request is valid and that all request threads have been activated. After waiting for 15 seconds or more, calling get_score with the company URL as the argument returns the score calculated by the model.

Crawling Process

As discussed above, for security systems and the like, processing time is an important factor. A client device should interact with a non-trusted website for a minimum amount of time. Similarly, the remote service wishes to interact with a non-trusted website for a minimum time and then assert a signal ceasing communication with the website and instructing the client device that interaction with the website to obtain a service should cease. Accordingly, we have appreciated the need for directing and constraining the manner and time spent in the data retrieval "crawling" process on the target website. The process for following links and retrieving data from pages designated by those links, may be referred to as "crawling" as already mentioned.

The crawling process therefore needs to be constrained in some way, to balance the duration of crawling against the quality of crawled data and preferably to make this constraining tuneable. This is done by applying additional heuristics to the crawling process, whereby each link, which is considered for crawling, is scored for 'quality' and only top scoring links are followed. The crawling process ends once the predefined maximum number of pages is crawled or once the allocated time duration has been exceeded. Other ways of constraining the time period may be used in addition, or as alternatives.

Different scenarios might afford different durations of time for the crawling process before output of a signal derived from the crawled data is required, so the requested duration is an input parameter to the crawling process (as is the maximum number of links to crawl). The crawler does not know in advance how many pages it will end up crawling, nor what the total duration will be, since these depend on the structure of the website, but also on current parameters of the network, such as client device, bandwidth, latency, contention, etc. The requirement that the crawling process ends within N seconds with some acceptable loss in quality of retrieved data, means the system should adapt to potentially changing parameters of the network, without missing on the synchronisation point, which could be more costly than partially missing data. The process operates by retrieving links on a given page, scoring those links, adding the links to a list and ranking in order of descending score. The link or links at the top of the list are then followed and links retrieved from the page defined by that link. The process of retrieving, scoring and adding to the same list is repeated for that page. In this way, a single list of links most likely to produce useful data is continually maintained and updated during the process. At any given point in time, the links scored at or near the top of the list are the ones to be followed next, irrespective of whether these were retrieved from a high level or low level page within the website structure. Figure 3 shows the process for directed crawling. At step 401 the web page link is inserted to the crawling process. Links from the home page of the website are then retrieved and scored and the top scoring link selected at step 402. The link scoring stage is key for leveraging the crawling time with crawling accuracy. The score for each link is calculated by matching keywords in the link description, increasing the score of the link if certain keywords are present and decreasing the score if others are present anywhere in the link. The list of words used for scoring is maintained in a score list 410.

The link scoring is presented in the following pseudocode.

LINK_URL=LINKS(i)

LINK_SCORE=0

for WORD in KEYWORDS do

if WORD exists in LINKJJRL then

LINK_SCORE = LINK_SCORE + SCORES(WORD) end If (LINK_SCORE > THRESHOLD) then

NEWJJNKS = CRAWL_LI N K(LI N K_U RL)

LINKS.ADD(NEW_LINKS)

end

The quality of the keywords selected impacts the crawling process. Having a better quality set of keywords, which direct the crawling process, allows for shorter duration of crawling (since better quality links are being followed and desired data is likely to be reached sooner). Together with tuning the score threshold for acceptable links, the keyword set in the score list 410 forms a configurable set of parameters that are tuned to meet the requests for the maximum amount of time allowed for crawling. The set of keywords may be formed either by applying insight of what substrings are probably contained in the links of interest, or they may be formed by applying an algorithm, which is given a set of good links and it extracts substrings that tend to occur in them. The implementation of the crawling process involves crawling each link in a separate thread, to maximize concurrency. The process thus continues by crawling the web page at step 403 until the link limit passed as a parameter is reached at step 404 and terminates at step 408, or continues to extract the links on the page at step 405, filter the links at step 406 based on a filter list 409 and then score the links at step 407. The process thus follows links from each page in parallel based on the top scoring link on each page. Several of these processes are executed concurrently and they synchronise after parsing when their outputs are aggregated into the unified feature vector as previously described.

After each link is crawled, the system time is consulted to establish whether the allocated duration for the crawling has been exceeded. If so, the crawling does not proceed and the crawled links are not scored for the next iteration. The time granularity of crawling corresponds to the time it takes to crawl an individual page (which is unknown in advance), after which the check is performed, so each thread would need to know when to stop crawling, not to exceed the time limit. This is achieved by each thread maintaining an average amount of time needed to crawl previous pages. Each thread contains the data of when it would need to finish and a comparison is made between this time limit, the current time and the average time required to crawl a page. If the remaining available time (after a links had just been crawled) is less than the average time needed for crawling, the crawling process ends. This way, although not guaranteeing the time of crawling to be less than requested, the system gives the statistical expectation is of that being the case. An example of this might be that given two links, the crawler needs to decide in which order to crawl them (prioritise), since there might not be time to crawl the second one:

• http://www.site.com/contactus.php

· http ://www.site.com/faq .php

and the link keywords configuration contains the following keywords and scores.

• where: 1

• address: 1

• contact: 2

· faq: l

• help: 3

The crawler process will score the first link with score 2 (having found the substring 'contact' in it) and the second link with score 1 (having found the substring 'faq'). It will therefore choose to crawl the first link first.

Multiple threads or processes may be used for the process of analysing the website as well as for retrieving data from other sources. To achieve this a main thread is provided to coordinate one or more sub-threads. The additional crawling threads are started by the main thread, which requires these additional threads to finish before it proceeds with aggregating their outputs into a unified feature vector. Each of the started threads receives a link to crawl and outputs the links extracted from the crawled page into the central list, administered by the main thread. As each of the threads produces a set of links, they are all scored and entered into the main list, from which the main thread retrieves the top scoring links (among all the links in the list) and passes them to the started threads. The centralised administration of the list by the main thread is provided to guarantee aggregation of the results from the threads and consistent following only of top scoring links during the crawling process. Figure 4 shows the process for retrieving data related to the website in question from other sources. The figure explains the process of accessing and parsing multiple external data sources concurrently. The figure separates the functions performed within the remote service (left of interface 505) and on 3rd party servers (right of interface 505). For conciseness, the figure provides examples of only three threads on the left side of line 105, but other external sources of data would be explained in the same way. Similarly, only one example of a 3rd party server process is provided (Twitter™ server - 502), but additional servers would be explained in the same way. The system starts with starting jobs in several threads at the same time - one job for each external service accessed at step 500. Taking the example on the figure of Twitter™, the thread processing that request 501 initiates an external request to the Twitter search service 502 querying for the mentioning of the company within Twitter™. The service returns all mentions and the calling thread within the system 503 proceeds to parse and process the response to generate the required features.

At the synchronisation point 504, the main thread waits until all request threads have completed processing and then combines the results of their parsing into the unified feature vector as described above. After this, the feature vector can be passed on to other parts of the system as an output signal.

Data Aggregation

The process of following links to pages selected according to the crawling process described above, or to another retrieval process, produces a set of pages that can be analysed to determine whether the website as a whole is deemed safe for use by the system. If so, an output signal may be asserted to allow the device to continue with a process at the website. The retrieval of data from the website may include retrieving words, graphics, certificates and other indicators of authenticity. In particular, a list of keywords may be compared to words on the pages of the website visited. If such words are found anywhere on the visited pages then a "feature" is determined to be present. Similarly, the presence of items such as SLL certificates, kitemarks and the like may also indicate that corresponding features are present. The existence of such features may be reduced to a value which, along with values for other features, may be handled as a multi-dimensional vector. Figure 5 shows graphically how data from various sources may be reduced to such a vector. The system 14 as a whole may be considered as two logical or physical parts, one part 141 for retrieving and processing the data from the client website 104 that the client device is viewing, and the second part 142 for retrieving and processing the data from 3rd party web services, such as Google™, Twitter™, and so on. Module 141 is responsible for crawling the client website producing features into the feature vector 200. The feature vector 200 represents a unified numerical vector that may be used by a statistical prediction module. The crawling of pages may be performed in several threads 203, so that the retrieving of pages is performed concurrently. When the crawling of all pages is finished, module 141 proceeds to parse the data (producing the features for the unified feature vector 200). Module 142 crawls 3rd party websites and services, to produce the remaining of the features for the unified feature vector. This is also done by using several threads and is explained above. The unified feature vector 200 is ready to be processed once all stages in 141 and 142 have finished, so a synchronisation point exists here and is explained in more detail below.

The data aggregation related to a given website 104, therefore comprises a combination of generating a scalar value from each of multiple sources and representing this as a vector, each dimension relating to a source. The individual scalar values may be calculated from matters such frequency of occurrence of certain words, values extracted such as validity of certificates and other such sources already mentioned. In this way, the complex question of determining the authenticity of a given website may be reduced to a single vector for subsequent processing by an decision engine.

For example, when analysing a website, the crawler looks for certain keywords that are indicative of credibility or authenticity. For example, presence of words like 'customer service', 'company history', 'our vision', 'our address' 'live chat assistance' somewhere on the website, indicate credibility. The crawler reads groups of such pre-defined words and outputs the corresponding feature having the value of Ύ if any of the words from a group is present on any page and '0' otherwise. Some groups of words can have negative value as well, of course. This is all configurable and a set of files defining the word sets are administered together with the system and used to initialise the parser.

The parsing of external data can be very diverse and may include additional processing. For example, in the case of Twitter™, the processing may include calculating a sentiment score (a real number) of the mentions, using additional statistical packages in step 503. As another example, in the case of accessing domain registrar data, the returned absolute date of registration may be transformed into an integer offset noting the number of days from current date, so a date of one year ago would be transformed to an integer number 365. The key is that after stage 504 all these features are ready and aggregated into a single numerical vector, which itself is ready to be aggregated with internal features into a unified feature vector. An example of the aggregated feature vector, with both internal and external feature is given here:

FEATURE EXAMPLE DESCRIPTION

1 ssl redirect 0 HTTP redirects to HTTPS

2 ssl_expiry_days 233 days until ssl certificate expiry

3 time_open 0.01 avg time to access a page

4 time read 0.0084 avg time to read a page

5 links all 394 total links on crawled pages

6 links local ratio 0.637 fraction of local links

7 img_ratio 0.0480 fraction of image links

8 wc seals 0 presence of SSL seals

9 wc font 1 font change detected via ess

10 domainjength 18 characters in domain

1 1 Ndigits 0 # of digits in domain

12 Credibility 1 Presence of credibility words

13 twitter_sentiment_avg 0.870 sentiment score of tweets

14 twitter sentiment std 0.177 sentiment std of tweets

15 Linkedin_present 0 company present on LinkedinTM

16 Google_inlinks 306000 # of 3rd party links pointing to site

17 whois domain in email 1 domain present in whois contact email

18 whois_expiry_days 421 days until domain expiry The feature column is the name of the feature, the example column is a sample value produced by the system and the description column contains a description of the feature. The first 12 features above are internal features, produced by the directed crawler and the rest are external, produced by querying 3rd party services.

The derivation of each feature within the vector may be done in a variety of ways. Taking the example of word checking, there may be a "feature" for each group of words, such as the presence of words like 'customer service', 'company history', 'our vision', 'our address' or 'live chat assistance' somewhere on the website, which indicate a "credibility" as shown at location 12 above. This may be a binary value indicating the presence or absence of the selected words within the website. Similar word checking lists of words may be used to derive other features covering aspects such as security, ease of use and so on.

Various features may be explicitly security related. The SSL related features at locations 1 and 2 of the example vector show the existence of an HTTPS redirect on the home page of the site in question and the number of days to SSL certificate expiry.

Some features relate to the way in which the crawling process operates, such as features at locations 5 and 6 in the example vector. These show the total number of links followed in the crawling process and the fraction of these deemed to be local links, rather than external links. This provides a measure of the size of the website.

The aggregated feature vector thus provides a representation of apparently disparate sources of data related to a website. The unified feature vector may be passed on to a separate output module that uses it to classify the website into one of several predefined categories. This external module would contain an implementation of statistical learning model which would also be trained on this data. The fact that the vector is composed of numerical values (real numbers and integers) means that the data can be directly fed into such a statistical model with minimal modification. The output classifications from the model would be used as a signal to further influence the behaviour of the system (including potentially the client). The output module may be written in a variety of languages known to the skilled person and need not be discussed further. The output module for generation of the authorisation message or signal is shown in greater detail in Figure 6. The feature vector, as previously described is received and provided to a prediction engine 601 . This is the module that contains a prediction model, which has previously been trained in machine learning module 603. The vectors are also stored in the database 602 (in addition to being sent to the prediction engine), for training the model in the future.

Upon receiving the vector for a particular website, the macine learning module feeds the vector into the previously trained model and produces the output score, which is used to further control the client device (or an external system, which interacts with the client device). Website classifications may contain previous correct signals for historical websites and are used to train the model in the machine learning module 603. The criteria for classification can be diverse and correspond to the desired meaning of the authorisation signal.