Wikipedia talk:Category intersection

This is an old revision of this page, as edited by Vegaswikian (talk | contribs) at 08:23, 31 August 2006 (automatic generation vs the old hype for xml). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.


Latest comment: 18 years ago by Pascal.Tesson in topic Can't wait...

About this proposal

This proposal was started by User:Rick Block and User:SamuelWantman. The initial discussions leading to the proposal are in Archive 1. There is was also discussion about the different options here.

Please leave comments about the proposal on this page. Thank you for your input.


Laurence Fishburne

I haven't read all of this, but shouldn't the Laurence Fishburne article in your example end up with categories such as "MASH", and not "MASH Actors", "Miami Vice", and not "Miami Vice Actors"? If someone wants to know the actors in Miami Vice, they should be able to check the intersection of those two. Just a thought. --Kbdank71 11:01, 30 August 2006 (UTC)Reply

I think the counter-example here is an actor who later becomes associated with Miami Vice, but not in an acting role. For example, if a famous actor produced or directed Miami Vice. Looking at the "American people" example, you might similarly ask why you can't intersect "American" and "people". I think these sort of cases will make it difficult to pin down the concept of a primary category, but I do like this proposal. I just haven't had time to read it again and decide which bits I like best. Carcharoth 11:09, 30 August 2006 (UTC)Reply
It is possible to have the category "Actors" and the category "MASH". People would just need to realize that the intersection is not necessarily "people who acted in MASH", but is instead "People involved with MASH who also are Actors". I suspect this will get to be a commonly understood distinction. CFD will probably have discussions about whether a category should be replaced with an intersection or vice versa. -- Samuel Wantman 19:54, 30 August 2006 (UTC)Reply

Random notes

Some of these may have been mentioned already, but just in case....

Existing software that's good to be aware of:

  • The intersection mediawiki extension
    I don't have experience with this... you can try pinging the contributors (n:en:User:IlyaHaykinson, n:en:User:Amgine, w:de:Benutzer:Unendlich, m:User:Dangerman) and seeing if they have comments about real-world use (eg. is it slow? what sort of categorization schemes do they find work well?)
  • Duesentrieb's CatScan on the toolserver
    Lets you experiment with category intersection using the current categorization scheme. Allows searching to some limited depth, but it can get pretty slow
  • Semantic MediaWiki extension
    The ultimate point of where this proposal is going... lets you answer questions like "list all female mayors who were in office after 1950". If you're serious about wanting complicated UIs and such, you may want to look into when it might be feasible for it to be used on something like Wikipedia. (mp3 from wikimania 2006)
    They also mentioned that their queries can be pretty slow.

This proposal calls for reworking most of the categorization system. I wanted to note that at least some reworking is required to make categories useful for automated tools, because the current categorization system is set up to only support human use. Automated use of current categories (eg. CatScan) can work up to a point, but currently, if you try to search as far as possible in an effort to get complete results, you get increasingly unrelated results the deeper you go. [1] [2] --Interiot 12:08, 30 August 2006 (UTC)Reply

Thanks for the link to the Category Ladder tool. I'd been looking for something like that for ages! I was having to manually construct category trees, and this does it for you! Carcharoth 19:37, 30 August 2006 (UTC)Reply
Yes, this proposal calls for reworking the categorization system. The German language Wikipedia has already repopulated primary categories and has removed intersection categories. I think it would be a difficult battle to get that to happen here before category intersection is implemented. Once implemented, more and people will climb on board, and categories can be quickly depopulated, repopulated and redefined. -- Samuel Wantman 20:07, 30 August 2006 (UTC)Reply

This is very interesting

I need time to think about it, but it looks like a great extension to the category functionality (with possible caveats of "people could be resistant to change" and "I wonder what the processor load would be?"). Syrthiss 12:13, 30 August 2006 (UTC)Reply

Categories are currently implemented with a database, and the "large category" performance issue was addressed by limiting the search result. The intersection search results would be similarly limited, so I think the question boils down to how efficiently the underlying database can do intersection queries and how much such a feature might be used. In a real time sense, I think an intersection query could be quite fast (google seems to do OK, for example). It might be interesting to know what percentage of database queries are currently category related. I don't know, but I'd guess not much (I'd guess there are way more history, watchlist, and previous version related queries than category queries). -- Rick Block (talk) 13:55, 30 August 2006 (UTC)Reply

Support

I will have to check the implementation, but I wholeheartedly support this concept. Subcategorization constitutes, I dare say, 50% of the controversy at WP:CFD ... and boolean categories constitute the other 50%. But more importantly than making CFD easier, this will actually make categories useful to researchers -- which must be the guiding purpose of any project on Wikipedia. Right now, our category system is barely an improvement over traditional text searches. --M@rēino 14:03, 30 August 2006 (UTC)Reply

I add my full support! -- SatyrTN (talk | contribs) 15:28, 30 August 2006 (UTC)Reply

Query on intersecting

You have in your Laurence Fishbourne example the categories "People from Georgia (US State)" and "People from Augusta, Georgia". Now forgive me if my US geography is wrong, but isn't it possible to extrapolate the former from the latter? If we put someone into "People from Augusta, Georgia", shouldn't it be possible to make the database populate Laurence Fishbourne into "People from Georgia (US State)", and even "American people"?

Other than that, yes, it looks like a good idea, and would solve a lot of issues with categorisation. Be interested in a developer's comments. Well done! Steve block Talk 21:55, 30 August 2006 (UTC)Reply

The issue about these categories, "People from Georgia (US State)" and "People from Augusta, Georgia" is how they are defined, and how they will be used. We're assuming that people will want to find "Actors from Georgia" or "Politicians from Augusta" etc... So you need to have an intersection that arrives with the correct population of articles. The intersection will depend upon how the categories are defined. Right now, the subcategories of "American people" have people who are American citizens, but the "People from Georgia" are not necessarily citizens. Politicians from Augusta may be people who are neither citizens nor people originally from Georgia. The distinctions between these categories can and should be made clearer, and they might change when category intersection is implemented. It would be possible to categorize any person closely associated with a location with the location. So ex-mayor Willie Brown could be put in category:San Francisco not because he was born there, but because he is a person very much connected with the city. He might also be put in the categories "People born in Texas" and "American people". There are other ways to do this as well. These sorts of conversations need to happen with quite a few categories. The criteria for making decisions about this should be, "What will be people be looking for?" and "What are the primary distinctions?". -- Samuel Wantman 23:38, 30 August 2006 (UTC)Reply
Hmm. How would you stop Willie Brown showing up as a mayor of Texas? And I'm still unclear as to why it wouldn't be possible for the database to extrapolate back and see someone categorised as a mayor of san fransisco as being categorised as a mayor. Steve block Talk 00:00, 31 August 2006 (UTC)Reply
The extrapolation is much more involved. We have not put that into this proposal. I'm thinking Willie might be categorized as "Mayors of San Francisco", (perhaps duplicated in "Mayors"), "People born in Texas", "San Francisco", "California Assemblyman", "American people", "Mineola, Texas", "Members of the California State Assembly", "People of African descent", "Politicians", "San Francisco State University alumni", "Alpha Phi Alpha brothers", "Freemasons", "1934 births", "Living people". This is more categories than currently (my mistake using an undercategorized article as an example!) I don't think you can say that "San Francisco mayors" is the intersection of "San Francisco" and "Mayor", He is a San Franciscan and a Mineola, Texan who was a mayor. That doesn't make him mayor of both places. So "San Francisco Mayors" is not an intersection and would be a primary category. The value in this proposal is not that it is going to lower the number of categories of all articles. For many articles, like this example, there will be more categories. The value is that you will be able to find the intersection of these categories. "San Franciscan African-American members of the CA State Assembly", "Freemasons from Texas", "San Francisco St. Univ. alumni born in 1934", "Politician Alpha Phi Alpha brothers", etc... It might be possible to add "Category unions" as the next step in the process. Unions would remove the need for duplications. I don't think duplication between parent and children primary categories will be a big problem. Perhaps only 2 or 3 levels in the hierarchy if at all. To make this work, categories must be fully populated. The first option describes methods of keeping categories fully populated that are the union of other categories. This system could be used in the other options as well. -- Samuel Wantman 02:08, 31 August 2006 (UTC)Reply
It might be possible to automatically "upward populate", but I think this would be a little tricky to get right. The focus of this proposal is intersections, and basically doesn't address the issue of primary categories that are subsets of other primary categories (e.g. mayors of San Francisco and mayors, or people from Augusta, Georgia and People from Georgia). It's fairly clear for intersections to work, these categories have to be fully populated. This proposal just doesn't include a solution. -- Rick Block (talk) 02:14, 31 August 2006 (UTC)Reply

Wikipedia:Category math feature

Have you seen Wikipedia:Category math feature? Not sure how similar this is, but there's a good consensus on the talk page Wikipedia talk:Category math feature#Straw poll, although drawn from a small pool which has grown over time. Steve block Talk 22:38, 30 August 2006 (UTC)Reply

Prima facie support with question

Having just scanned the (seemingly very well-prepared) proposal, I think I'd support its development; the Laurence Fishburne examples suggest it might be very useful. I've printed it out and will read more carefully anon. One question for now, with apologies if I've scanned past the answer: Is there an intention to provide folk with (say) an option on their preference page to use or ignore the system...?
Looks like some sterling work!  Best wishes, David Kernow 01:12, 31 August 2006 (UTC)Reply

We're not thinking of a user preference setting to use or ignore the system. The basic point is to replace manually updated "intersection" categories with automatically dynamically generated ones. -- Rick Block (talk) 01:58, 31 August 2006 (UTC)Reply
Ah, okay; thanks for clarification. Regards, David 02:45, 31 August 2006 (UTC)Reply

automatic generation vs the old hype for xml

In the dark ages when xml was first coming out, it was proposed by some as a way to search and get good hits. So you could search for say duck as recipe=duck to only get hits on duck when it was in a recipe. I always thought that was a needed improvement. So I see this proposal as a step in the right direction. It would be nice if it was dynamic rather then something generated. But I guess that will depend on performance and something is better then nothing. Vegaswikian 03:02, 31 August 2006 (UTC)Reply

I'm not sure I understand the distinction you're making between dynamic vs. generated. The intent is that these intersections reflect a database query executed at the time the page is requested to be viewed, i.e. the list of matching articles is generated dynamically from the current contents of the intersected categories. -- Rick Block (talk) 04:14, 31 August 2006 (UTC)Reply
We are both looking for the same result then. I was a little worried by the use of automatically generated above. As far as I'm concerned, the result is dynamic. Vegaswikian 04:17, 31 August 2006 (UTC)Reply

Can't wait...

Good work. I hope people won't bicker too long about the details of this proposal because such a feature is way overdue. I think it makes sense to fast-track this as soon as possible and then reflect in a few months on how well the proposed interface works so that we can make improvements on it. In that spirit, I support the proposal as is. Pascal.Tesson 07:53, 31 August 2006 (UTC)Reply