Wikipedia talk:AutoWikiBrowser
Things to do
- Remove categories.
- Add spell checker.
- Ability to create list from searching the database dump.
- Disambiguation tools
- HTML entities to unicode
What does remove categories mean? Do you mean, remove deleted categories from a page? In addition, what is a "database dump"? I am currently working on a spell checker, though. — MATHWIZ2020 TALK | CONTRIBS 23:49, 5 January 2006 (UTC)
- Yes, that is what I mean about categories, the database dump can be found here, it is the entire wikipedia database in XML format. How are you going about the spell checker? I have a plan to implement one fairly simply, but am waiting on Microsoft to release the next beta of WinFX. Martin 09:43, 6 January 2006 (UTC)
- I started a script to fix misspellings with navigation popups autoedits, but haven't worked on it in a while. I recently was thinking I could change the script to make an array of words from Wikipedia:Lists of common misspellings/For machines and correct them using the AWB - I just need more C# training. Why are you waiting for the next WinFX beta? I know it's the new Vista programming API, but how will that help you spell check? — MATHWIZ2020 TALK | CONTRIBS 20:48, 6 January 2006 (UTC)
- Because it has an extension to allow spell checking in textboxes and richtextboxes, so it would be really simple, and a lot better as it suggests the correct spelling just like a word processor does, which would be really cool. Martin 22:15, 6 January 2006 (UTC)
- Wow... there goes my project. (Does it have an "add" function to add words such as Wikipedia to the dictionary?) Even if you do get your hands on a copy of WinFX beta, how will you extend its benefits to all users? Microsoft is usually very protective. — MATHWIZ2020 TALK | CONTRIBS 22:22, 6 January 2006 (UTC)
- WinFX will be freely available, in the same way .NET is, I already have a beta, but AFAIK they havent released the beta with this functionality yet. I can't remember when it will be released properly, but I'll certainly be ready! Martin 22:27, 6 January 2006 (UTC)
- Check out how to use Word's spellchecker from a C# application - without WinFX. — MATHWIZ2020 TALK | CONTRIBS 22:29, 6 January 2006 (UTC)
- Word must be installed on your computer for the above to work - what a pity. — MATHWIZ2020 TALK | CONTRIBS 22:30, 6 January 2006 (UTC)
- Or maybe we should wait for WinFX - read about how much easier that option is - and it doesn't require Word. — MATHWIZ2020 TALK | CONTRIBS 22:31, 6 January 2006 (UTC)
- In the website above, people left comments about testing the spellchecker - it seems as if it is already out. — MATHWIZ2020 TALK | CONTRIBS 22:33, 6 January 2006 (UTC)
- I have already made it interact with word, but is not very satisfactory. I dont think the public beta has the spell checker, but I'll have another look. thanks Martin 22:46, 6 January 2006 (UTC)
- WinFX will be freely available, in the same way .NET is, I already have a beta, but AFAIK they havent released the beta with this functionality yet. I can't remember when it will be released properly, but I'll certainly be ready! Martin 22:27, 6 January 2006 (UTC)
- Wow... there goes my project. (Does it have an "add" function to add words such as Wikipedia to the dictionary?) Even if you do get your hands on a copy of WinFX beta, how will you extend its benefits to all users? Microsoft is usually very protective. — MATHWIZ2020 TALK | CONTRIBS 22:22, 6 January 2006 (UTC)
- Because it has an extension to allow spell checking in textboxes and richtextboxes, so it would be really simple, and a lot better as it suggests the correct spelling just like a word processor does, which would be really cool. Martin 22:15, 6 January 2006 (UTC)
- I started a script to fix misspellings with navigation popups autoedits, but haven't worked on it in a while. I recently was thinking I could change the script to make an array of words from Wikipedia:Lists of common misspellings/For machines and correct them using the AWB - I just need more C# training. Why are you waiting for the next WinFX beta? I know it's the new Vista programming API, but how will that help you spell check? — MATHWIZ2020 TALK | CONTRIBS 20:48, 6 January 2006 (UTC)
AWB text reformatting clutters diffs
Hi there AWB developers. I've noticed that AWB-assisted edits have a habit of reformatting wikitext in addition to the noted changes in the edit summaries. It would be nice if the reformatting were performed as a separate edit before the intended change (with an edit summary like "reformatting wikitext"). This would lead to much clearer diffs and a better reflection of what was actually done to the article. None of this applies, of course, if this behavior has changed in more recent versions or if what I have seen is a result of the AWB user's actions and not the software itself. Mike Dillon 16:34, 5 January 2006 (UTC)
- The edit summary is always "AWB-assisted" followed by a phrase of the user's choice. Therefore, it is the user who wrote the incorrect edit summary, not the AWB itself. — MATHWIZ2020 TALK | CONTRIBS 22:48, 5 January 2006 (UTC)
- How is it "incorrect" if the user doesn't know it's happening or doing it intentionally? I don't have access to AWB since I have no Windows machine to run it, so I don't know if they see a diff before saving or if they have to request it just like on the primary web interface. Does a user really know that AWB realphabetized the categories? Is that an automatic behavior, or did the editor I observed do this intentionally and neglect to note it? Not to sound like I'm on a witch hunt or something, as the diff issue is a pretty minor inconvenience. I would say that AWB should strive toward being neutral on the original wikitext formatting, as far as possible. Reformatting is an excellent functionality to expose to the user by choice, but not if they aren't aware of it. Mike Dillon 03:59, 6 January 2006 (UTC)
- Diff by default, yes.--SarekOfVulcan 04:10, 6 January 2006 (UTC)
- The user is 100% aware of all changes before saving, and can they can introduce any extra changes they want. Martin 09:48, 6 January 2006 (UTC)
- How is it "incorrect" if the user doesn't know it's happening or doing it intentionally? I don't have access to AWB since I have no Windows machine to run it, so I don't know if they see a diff before saving or if they have to request it just like on the primary web interface. Does a user really know that AWB realphabetized the categories? Is that an automatic behavior, or did the editor I observed do this intentionally and neglect to note it? Not to sound like I'm on a witch hunt or something, as the diff issue is a pretty minor inconvenience. I would say that AWB should strive toward being neutral on the original wikitext formatting, as far as possible. Reformatting is an excellent functionality to expose to the user by choice, but not if they aren't aware of it. Mike Dillon 03:59, 6 January 2006 (UTC)
- It has been tweaked a bit more recently as well, so doesn't make quite as many changes. Martin 23:36, 5 January 2006 (UTC)
- How? I know it doesn't fix dates anymore, but what other features have been removed? — MATHWIZ2020 TALK | CONTRIBS 23:49, 5 January 2006 (UTC)
- I tweaked it so it didnt need to remove spaces before and after == which it did before to make other fixes easier. Martin 00:11, 6 January 2006 (UTC)
- Good to hear. Mike Dillon 03:59, 6 January 2006 (UTC)
- I noticed that tweak when I was reviewing the code for the first time, but I didn't know that was new recently. In addition, when I was reviewing the code, you seemed to use * and ? differently than listed at Regex. The article says ? matches 0 or 1 recurrences of the character, and * 0 or more, but somewhere, you used a *?. This leads me to believe that, in C#, * means 1 or more, the equivalent of + in most systems. Is this correct? — MATHWIZ2020 TALK | CONTRIBS 20:48, 6 January 2006 (UTC)
- I tweaked it so it didnt need to remove spaces before and after == which it did before to make other fixes easier. Martin 00:11, 6 January 2006 (UTC)
- How? I know it doesn't fix dates anymore, but what other features have been removed? — MATHWIZ2020 TALK | CONTRIBS 23:49, 5 January 2006 (UTC)
- The *? is a single regex atom. It means 0 or more, but it says to use stingy matching instead of the default greedy matching of *. Unfortunately, the Regex article doesn't address greediness, but basically, a greedy regex will match as match characters as possible until it fails, while a stingy regex will match only until the atom that follows can match. There is a better explanation of greediness in Chapter 4 of the canonical Mastering Regular Expressions (search for "greedy" in the text). Mike Dillon 03:23, 7 January 2006 (UTC)
- P.S. Search for "laziness" and "non-greedy" to get the explanation of lazy/stingy regexes, or better yet, read the whole thing ;) Mike Dillon 03:31, 7 January 2006 (UTC)
- Thanks for the link Mike! I need to read up on Regexs. Martin 11:49, 7 January 2006 (UTC)
- P.S. Search for "laziness" and "non-greedy" to get the explanation of lazy/stingy regexes, or better yet, read the whole thing ;) Mike Dillon 03:31, 7 January 2006 (UTC)
another thing
Another thing that should be added is the ability to change categories with a modifier after them, for example {{Category:Wikipedians in the United States|Jtkiefer}} The way AWB currently handles them if you tried to change them over to say {{Category:Wikipedians}} or {{Category:Wikipedians|Jtkiefer}} I'd end up with something like {{category|Jtkiefer}} which causes problems and which is an unusable category. JtkieferT | C | @ ---- 23:41, 5 January 2006 (UTC)
- I will work on that with Martin. Thanks for notifying me of the problem! — MATHWIZ2020 TALK | CONTRIBS 23:49, 5 January 2006 (UTC)
- It doesnt have any logic to remove keys, but otherwise it handles that fine, see this. thanks Martin 23:52, 5 January 2006 (UTC)
- Thanks for the sandbox demonstration, but what's a "key"? — MATHWIZ2020 TALK | CONTRIBS 00:04, 6 January 2006 (UTC)
- The key is the bit after the pipe " | ", if a category has a key it is sorted alphabetically by its key and not by its name. Martin 00:13, 6 January 2006 (UTC)
- Oh - I always just referred to that as the modified page name. — MATHWIZ2020 TALK | CONTRIBS 00:24, 6 January 2006 (UTC)
- The key is the bit after the pipe " | ", if a category has a key it is sorted alphabetically by its key and not by its name. Martin 00:13, 6 January 2006 (UTC)
- Thanks for the sandbox demonstration, but what's a "key"? — MATHWIZ2020 TALK | CONTRIBS 00:04, 6 January 2006 (UTC)
Open Source
Due to the recent success of Firefox, OpenOffice, and other open source programs, I was wondering what the general consensus would be on making the AWB open source. I could release the source code and then make a page, maybe Wikipedia talk:AutoWikiBrowser/Open source, where anyone could request features, and developers could post code. If I implemented such a plan, I would also make available to developers an extensive list of the changes in each version of the AWB. Any ideas? — MATHWIZ2020 TALK | CONTRIBS 23:49, 5 January 2006 (UTC)
- At the moment we make people register to avoid anyone abusing the software, being completely open would make that impossible, any features can be requested here, it would be cool to have more people developing it though. Martin 23:52, 5 January 2006 (UTC)
- Okay, I understand. It would make the code more susceptible to abuse. — MATHWIZ2020 TALK | CONTRIBS 00:04, 6 January 2006 (UTC)
Feature requests
I have two requests. Every now and then I notice the alert about a long article having stub status. Could there be a button (or something) that would quickly remove all the stub templates. I hate scrolling through the text and finding it.
Secondly, this tool is works wonderfully with fixing typos. I imagine it could do the same for disambiguating pages, but I'm really not sure how it would work. The idea I have is somehow a the program receives the different terms from the user (which he collected from the disambiguation page). After getting the list of pages linking to the DAB page, the user goes through each one. If it's the first term (say, pop music), he clicks on it (or maybe presses a hotkey) and the link changes to that term. Say pop -> pop. Clicking on option two (pop art) results in changing pop -> pop. I hope I've explained myself alright, it's difficult to describe. Let me know if you have questions or any suggestions. Oh, and welcome back! :) Gflores Talk 07:48, 6 January 2006 (UTC)
- Tools for disambiguation is a really good idea, at the moment I am working on scanning the database dump, but this will probably be my next target. (that and introducing an spell checker, but I am waiting on Microsoft for that). Martin 13:17, 6 January 2006 (UTC)
- If you use navigation popups, you can access a similar feature via the popup. If you hover over a link to a disambig page, the bottom of the popup lists all the links on the page. Clicking on one replaces the link Pop with, e.g., Pop (it adds the correct page while keeping the text seen the same). — MATHWIZ2020 TALK | CONTRIBS 20:48, 6 January 2006 (UTC)
Minor regex request: I'm looking for a regular expression to fix bad links. Essentially, it needs to look for links in this form... [[http://www.abc.com]] and change it to this [http://www.abc.com]. Same with [[http://www.abc.com link]]. Sometimes, linke are like this [[http://www.abc.com|link]]. This needs to be changed accordingly to [http://www.abc.com link]. Any help is appreciated. I read a little about regex and came up have used this in AWB... \[\[([Hh]ttp:[^\]\]]+)]] However, it doesn't change for the later caveat (the '|') and may find false positives. If you have time. Thanks. Gflores Talk 18:04, 6 January 2006 (UTC)
- That can be completed with some regex. I'll work on the code. — MATHWIZ2020 TALK | CONTRIBS 20:48, 6 January 2006 (UTC)
- I just wanted to say thanks for working on this item specifically. Currently the bad link cleanup process tasks many hours for each dump and this would speed up the task considerably. --PS2pcGAMER (talk) 22:30, 6 January 2006 (UTC)
- Note to Martin - try:
- I just wanted to say thanks for working on this item specifically. Currently the bad link cleanup process tasks many hours for each dump and this would speed up the task considerably. --PS2pcGAMER (talk) 22:30, 6 January 2006 (UTC)
replace \\[\\[http:\\/\\/(.*)\\]\\] with [http://$1] replace \\[http:\\/\\/(.*)\\|(.*)\\] with [http://$1 $2]
- This removes the double [ from http links, and changes the pipe to a space. It finds all links beginning with http://, which means it will also do this to links to articles such as [[http://]]. In addition, it will not fix links beginning with hTTP - I did this since, if you try [1], Wikipedia does not recognize it as a link. Wikipedia only recognizes external links that begin with an all-lowercase http, but the regex could be easily tweaked to fix any case. — MATHWIZ2020 TALK | CONTRIBS 17:59, 7 January 2006 (UTC)
- Martin - I just tried the above regex. I added it between lines 58 and 60 in Parsers.cs, and it works. — MATHWIZ2020 TALK | CONTRIBS 19:18, 7 January 2006 (UTC)
- Another note - you have to have the two regex replaces listed in the order above. For example, if you have [[2]] and do regex replace one and then two, you get Google - in the reverse order, you still have [3]. — MATHWIZ2020 TALK | CONTRIBS 22:57, 7 January 2006 (UTC)
- Wikipedia:Bad links also has some bad characters for internal links. I have developed these regexs to fix them:
- This removes the double [ from http links, and changes the pipe to a space. It finds all links beginning with http://, which means it will also do this to links to articles such as [[http://]]. In addition, it will not fix links beginning with hTTP - I did this since, if you try [1], Wikipedia does not recognize it as a link. Wikipedia only recognizes external links that begin with an all-lowercase http, but the regex could be easily tweaked to fix any case. — MATHWIZ2020 TALK | CONTRIBS 17:59, 7 January 2006 (UTC)
fixes double space: replace \\[\\[(.*) (.*)\\]\\] with [[$1 $2]] fixes space at beginning: replace \\[\\[ (.*)\\]\\] with [[$1]] fixes space before "#": replace \\[\\[(.*) #(.*)\\]\\] with [[$1#$2]] fixes double underscore: replace \\[\\[(.*)__(.*)\\]\\] with [[$1_$2]] fixes underscore at beginning: replace \\[\\[_(.*)\\]\\] with [[$1]] fixes underscore before "#": replace \\[\\[(.*)_#(.*)\\]\\] with [[$1#$2]]
- I just tested them, putting them after the two lines above. The reason why I have separate regexs for spaces and underscores is because I don't want to change a link such as January__1#External_links to January 1#External_links - I want the link to use all spaces or all underscores. — MATHWIZ2020 TALK | CONTRIBS 23:06, 7 January 2006 (UTC)
Another request: according to Wikipedia:Manual of Style (headings), the sections should be in the following order at the end:
- See also
- Notes
- References
- External links
or
- See also
- References
- Notes
- External links
I know the AWB separates the categories, language links, FA templates, and Persondata templates and puts them in the correct order - could you do the same with the above sections, i.e., could you write code to separate them and then put them in the correct order? Thanks. — MATHWIZ2020 TALK | CONTRIBS 21:09, 7 January 2006 (UTC)
1.6.2
FYI: There is a 1.6.2 listed under the list of changes, but the check page doesn't show this version as enabled. — MATHWIZ2020 TALK | CONTRIBS
17:28, 7 January 2006 (UTC)
In addition, I was looking through the code of 1.6, and, in AboutBox.cs, on line 131, there is a type: guidlines should be guidelines. In AssemblyInfo.cs, the copyright date should be 2006 on line 13. Can I have the source for 1.6.2? — MATHWIZ2020 TALK | CONTRIBS 19:08, 7 January 2006 (UTC)
- Sure, I'm busy at the moment, but I'll get all of the above sorted tomorrow evening, thanks for the regexs! Martin 22:21, 7 January 2006 (UTC)
- I understand - we're all busy at some time or another. Thanks for all your work on the AWB, and especially for returning! — MATHWIZ2020 TALK | CONTRIBS 22:58, 7 January 2006 (UTC)
1.7
Note to all: a new version (1.7?) is on its way! I have already added the following to it:
- Heading sorter
- Bad link repair (both internal and external)
- Improved script to see if user is logged in
- Other minor changes (e.g., updated copyright year, typos)
I e-mailed the source to Martin, who will make a few changes of his own (I'm not sure what they will be yet). Depending on how long Martin takes, it should be out today or tomorrow. — MATHWIZ2020 TALK | CONTRIBS 19:34, 8 January 2006 (UTC)
- Assuming we can fix some technical difficulties. — MATHWIZ2020 TALK | CONTRIBS 22:01, 8 January 2006 (UTC)