I was pleased to be invited recently to an Open Meeting of the Euforbia project where a detailed insight into the work was given. As the project is nearing its conclusion, the consortium is seeking a variety of views as they examine possible future work.
In this short paper I will outline my impressions of the work done and end by suggesting areas in which I believe it can be carried forward. These include, but are not restricted to, possible collaborations with ICRA.
Readers not familiar with Euforbia are encouraged to read the following introduction to the project:
Experiments aboUt the Filtering of internet documents accORding to an unBIAsed and semantic-rich approach.
This project aims to demonstrate the possibility of using an unbiased and semantic-rich approach for executing an accurate (rating and) filtering of Internet documents.
The principal objective of this project is to contribute to the production and use of new generations of Internet filtering systems, more powerful and flexible than the existing ones, and easier to adapt to cultural, political or religious differences. This objective requires creating systems endowed with the two following properties:
A prototype software system operating on user relevant data and satisfying the above quoted properties will be realised by the consortium under the control of an "EUFORBIA user group", active during the whole life span of the project.
Technically the proposal consists of the combination of two established techniques. The first (NKRL-like tools) will be used for "annotating" the documents, i.e., for associating them with an in-depth, formalised and neutral description ("annotation") of the "semantic content" of these documents. The second (Milan model for filtering accesses) generalises and makes more flexible approaches based on the PICS standard by organising the important notions of the domain into a hierarchy of concepts rather than a set of simple keywords. The rules based on the annotations will be used according to two strategies. Firstly, they will be used to calculate the rating categories and the associated values needed by the Milan model; secondly, they will be employed autonomously to filtrate the Internet documents.
The results will consist mainly of:
The overlap between the two approaches is very clear. In both cases, an academically-defined descriptor set is applied to online content. Filters can then block/allow access depending on policies set by the user. Both systems allow whole websites, sections of websites or specific pages to be "annotated" (ICRA would say "labelled"). These classifications/ratings/annotations/labels - call them what you will, can be applied by the content provider or by a third party.
There are, however, significant and interesting differences between the Euforbia and ICRA approaches:
Aside from these general points, there is a very significant difference in approach to the technologies and standards used by the two organisations. The ICRA descriptor set is not a technical standard, it's a set of questions that can be written on a single sheet of A4. Likewise, (as I understand it) NKRL does not, of itself, define a technical standard. Here's where the two diverge dramatically· in its short history, ICRA has expressed its vocabulary and promoted it using the PICS standard (as defined by W3C2 ). Euforbia does not follow any such open standard. The Milan model is based on PICS, but uses a non-compliant implementation of it.
To elaborate on that last point: according to the Milan Model, a web page may include a PICS label in the normal way but may also carry a short meta tag that directs software to look in a separate location to find the NKRL annotation. This "de-referenced meta data" feature is one ICRA would very much like to have seen in PICS as it would make the labelling of some sites much easier, but it's not (at least, not directly). Therefore, ICRA doesn't support it. I did not spot any other deviation from the PICS standard but this one feature alone means that PICS compliant filters (notably Internet Explorer's Content Advisor) can't work with the Milan model as currently specified. De-referenced labels would suit some large content providers as it would provide a method for ratings to be added to a resource after it has been published, probably by a colleague or perhaps a third party organisation.
As I see it, these differences boil down to:
ICRA has a limited descriptor set which is only of relevance to child protection. The rating is delivered using open standards and therefore suitable label generating and reading technology can be developed by anyone.
Euforbia uses vastly more detailed rating vocabularies which give a much fuller description of the semantic content. This has potential application not only in child protection but in wider meta data applications, notably search engines. But· efforts to achieve wide take up are going to be handicapped by the use of proprietary systems.
There is a final difference - the complexity of the rating vocabularies has a real impact on the annotation/labelling process. Professor Zarri was at pains to point out that more user-friendly label generators were envisaged, but at present, annotating a site, particularly using NKRL, is a very complicated process. Even with well-designed tools, a complex rating system must require a relatively complex label generator. Likewise, the filtering options are, of necessity, complicated.
The ICRA descriptor set inevitably leads to a minimum of 45 block/allow decisions - 90 buttons on ICRAfilter! The ICRA solution to this was to utilise another part of the PICS standard, PICSRules files (PRFs) or "templates" as we refer to them colloquially. These text files contain filtering rules and URLs to be blocked/allowed and so provide a shortcut to setting up the filter. I suggest that a similar shortcut method would need to be found for the Euforbia filtering solutions.
IF - I stress the conditional - Euforbia wanted to establish NKRL/Milan model rating and filtering across the board, the task would be enormous. Content providers would need to be encouraged to annotate their sites, or a third party body would need to take on the not-insubstantial task of rating the web through what in PICS terms is a label bureau. ICRA and its RSACi predecessor have been trying to make this happen with a simple system for more than 6 years and only now are we seeing any real progress. Similarly, the filtering tools would need to undergo further development to bring them to market readiness.
In such a scenario, Euforbia would be competing with ICRA for the attention of the major content providers and filter manufacturers. However, unless I completely misread the mood of the meeting, this is NOT how Euforbia members are thinking of proceeding.
Good.
The Euforbia work, and its antecedents such as Concerto3 , has enormous potential as a system for describing the semantic content of digital resources. The filtering solutions are very interesting and, as stated earlier, both appear to be superior to ICRAfilter in some respects.
As I discussed briefly with the Euforbia group, ICRA is likely to be involved very shortly in developing the successor to PICS: PICS-RDF. Prof. Zarri has already looked at how NKRL could be expressed in RDF and has published work on this which he has kindly forwarded to me4. The Milan model has identified areas in which the old PICS standard does not offer everything wanted. All this surely points to a collaborative effort on the work to define PICS-RDF.
The whole point of RDF is that it provides a syntax through which different descriptive schemas can be applied to any digital resource that has a URI. PICS-RDF needs to work within that framework whilst defining how PICS features, such as gen true/false flags, lifespan parameters etc. can be incorporated. RDF itself is still under development but some work has already been done on RDF-PICS5.
If we want to see a standard syntax that can be used to express simple rating vocabularies such as ICRA's and much more complex vocabularies such as NKRL, and for those vocabularies to be delivered using techniques desired by users of PICS but not universally specified - collaboration on defining PICS-RDF seems the obvious way forward.
The intention is that Ernest Miller and Raul Ruiz, operating as Filter Tobacco and based at Yale University (see appendix) should work with ICRA and the W3C to define PICS-RDF. We are about to begin to seek support for this initiative and to generally assess the potential. It is important to note here that the project has not yet been finalised or even approved in principle by the ICRA board. It is my hope that members of the Euforbia group will be interested in supporting the initiative. Quite how the project will progress is not yet clear, it's very early days, but an initial signal of support and interest would be very beneficial.
Ernie Miller and Raul Ruiz are also responsible for managing the open source project for ICRAfilter. (The source code should be available on Source Forge any day now6.) It is envisaged that the PICS-RDF work should include additions so that the filter can also work with PICS-RDF. The development of additional modules in Euforbia's two demonstration filter solutions to work with NKRL/Milan model as expressed in PICS-RDF would be a significant boost to the standards-definition project, and, I would suggest, to the filters themselves.
In short, we have a chance to define the standards and to present the relevant filtering technologies.
ICRA has long moved away from the idea that the only people who should generate an ICRA label is us. On the contrary, we have taken steps to promote the hosting of label generators far and wide. A tool that generates NKRL/Milan model labels in addition to ICRA labels could very possibly be hosted in many places, including the ICRA website.
Along with ICRA and NCSR Demokritos (Greece), Optenet (Spain) is developing a platform through which a variety of filtering solutions can be integrated7. ICRAfilter is at the core with other filtering modules, such as image analysers and list-based solutions accessed through the PICS label bureau protocol. All the various modules act as proxy servers - as does the Milan filter - and integration there seems like a natural fit. Sift is ICRA's only current IAP project.
During the meeting, we discussed, at length, the issue of motivating content providers to include meta data. Governments and publishers have a vested interest in doing this but for the majority of webmasters - amateur or professional - there is little to be gained by adding highly descriptive meta data today. Over time, however, all that can be recorded about human knowledge and experience will exist electronically. In this not-too-distant scenario, meta data is going to become more and more critical.
This issue is well understood by the IT community. There are numerous meta data initiatives and standards, all of which reflect in one way or another, a particular perceived need. The prime example is Dublin Core8, a set of 15 meta data types designed primarily to describe books and documents, but it falls down when trying to describe, for example, a film or song.
RDF is being developed as the answer to this - a common syntax which enables different meta data schema to be applied, but that's not the whole answer. Adding meta data is only useful if there's something that can read it. And the major search engines don't. To get a top ranking on Google you need to:
You can add all the keywords and descriptions you like but if Google takes any notice of them at all, they're very secondary. There's good reason for this - webmasters were forever adding irrelevant meta data in the hope of fooling the search engines.
There may be scope here for commercial exploitation, either through working with a major search engine - and for the time being Google really is the only show in town - or perhaps through setting up afresh.
ICRA has talked about this a great deal in the past but is hampered by its non-profit constitution. We have also considered setting up a commercial arm and this might still be possible.
PICS isn't enough. There is a PICS-based search engine/directory but it has few takers (and is entirely American)9.
As a broad vision, the idea that the majority of the web's traffic is directed at resources that have meta data that includes NKRL, Dublin Core, ICRA ratings and more, all used by search engines that deliver what the user actually asked for is · well, just that· a broad vision. Commercialisation might just be an important part of the effort to make it a reality. The key factor would be to deliver a positive reason for adding accurate and detailed meta data so that traffic to the site (or at least, relevant traffic) increased. At present, adding an ICRA label is often seen as providing a mechanism by which traffic can only be reduced (this is an oversimplification but holds some water).
Euforbia offers some very interesting and highly developed ideas about rating and filtering. The level of description available is far superior to anything a PICS rating could convey. However, as a rating and filtering solution for the limited purposes of child protection, this complexity works against it. Furthermore, by using proprietary techniques, development of tools by other parties is not encouraged.
There are some relatively easy ways in which ICRA and the Euforbia consortium could work together, but of prime importance, I would suggest, is that both groups become engaged in the project to define PICS-RDF. This would provide a better-defined platform through which both simple and complex rating vocabularies could be applied in a real world situation. Commercialising the system is then, perhaps, a more attractive proposition.
Phil Archer
Chief Technical Officer
Internet Content Rating Association
20/9/02
Filter Tobacco is an organization (soon-to-be non-profit) dedicated to giving parents the tools to protect their children from tobacco marketing on the Internet. Filter Tobacco aims to accomplish this goal in two ways: By providing parents with free, customisable, and easy-to-use filtering software that will allow them to determine which tobacco-related sites their children should have access to. By encouraging websites that sell tobacco products or promote tobacco use to self-rate their websites using the ICRA system and take other steps to inhibit access by minors Filter Tobacco was established in March 2002 in memory of Salvador "Uncle Chava" Orozco, who began smoking as a teenager and died at age 44 of lung cancer.
Mr. Miller is a free-lance contractor. Currently, his main project is to establish a new science museum for economics, The Museum of Money and Financial Institutions, in Lower Manhattan. He is also a Fellow of the Information Society Project at Yale Law School, where he pursues research and writing on cyberlaw, intellectual property, and First Amendment issues. Mr. Miller directs the ISP's Metadata Initiative and is the lead editor for LawMeme, an interactive on-line weblog that delivers the latest technology law and policy news. Mr. Miller attended Yale Law School, where he was president and co-founder of the Law and Technology Society. He has worked for both the Center for Democracy and Technology and the Electronic Frontier Foundation. Prior to his legal career, Mr. Miller was an honors graduate of the U.S. Naval Academy and had served as an enlisted man in the U.S. Marine Corps, where he handled advanced cryptographic communications equipment. Mr. Miller built his first computer in high school by using a Motorola 8060 processor and programming it in assembly language. Salvador Orozco was Mr. Miller's uncle.
Mr. Ruiz is an undergraduate at Yale University majoring in Economics and Electrical Engineering. He currently serves as the Director of Information Technology for the Yale Law and Technology Society. Mr. Ruiz is the founder and President of the Yale Pre-Law Society and serves on the Executive Committee of the Yale Chapter of the IEEE. Mr. Ruiz is also an editor and technical director for LawMeme, an interactive weblog specializing in technology law and policy news. Mr. Ruiz has served as an intern to Senator Bob Graham's Office, the Florida State Attorney's Office, and the Microsoft Corporation.