Wikipedia talk:WikiProject Molecular Biology/Gene Wiki/Archive 3

Archive 1Archive 2Archive 3Archive 4
Gene Wiki – Discussion


Duplicates?

Hi! Should File:PBB GE MYH6 204737 s at fs.png and File:PBB GE MYH7 204737 s at fs.png be duplicates? And File:PBB GE PSG3 209738 x at fs.png and File:PBB GE PSG6 209738 x at fs.png? --MGA73 (talk) 19:27, 5 March 2010 (UTC)

Hi MGA73... Yes, you've hit on a slight oversight when we uploaded those images. There is generally a one-to-one relationship between gene symbols (e.g., "MYH6" in your first example above) and probe sets (e.g., "204737_s_at"), but in a few cases that relationship is many-to-one. And that leads to duplicate images with slightly different names. Unfortunately, there are independent links to those two images (from MYH6 and MYH7, respectively) so I don't think we can just delete them. If I had access to more programmer resources I'd get a bot made to correct these duplications. But, absent that, I'm open to suggestions. Thankfully, I don't think these imperfections are so common/impactful that they have a strong negative impact on Wikipedia as a whole. Do you agree? Cheers, AndrewGNF (talk) 19:40, 5 March 2010 (UTC)
Actually gene stuff is not my strong side. I just thought it was a mistake during transfer to Commons. It is possible to use a redirect to avoid duplicate images. But if more pages "use" the same image it sould be a good idea to fix description of image to tell that the image works both for MYH6 and MYH7 or whatever :-) --MGA73 (talk) 20:01, 5 March 2010 (UTC)
Sadly, it was our original mistake and not one that was introduced in the transfer. I've added it to our list of things to do as soon as we can find the bandwidth! Cheers, AndrewGNF (talk) 23:28, 5 March 2010 (UTC)
That sounds good. Once you have a plan you could concider to talk to User:Multichill. Perhaps he can be of assistance. --MGA73 (talk) 19:57, 7 March 2010 (UTC)


Just found my way to Template:PDB Gallery, um the description there could do with some expansion! Was wandering if I could I add it to huntingtin for example? Lee∴V (talkcontribs) 01:01, 23 March 2010 (UTC)

The first crystallographic structures of huntingtin were published fairly recently (September of 2009). This date was after the last manual upload of PDB graphics. Unfortunately there is no automatic method for uploading new structures, so if you want a graphic of a new structure, you will need to upload it yourself. I have uploaded a graphic for huntingtin and added it to {{PBB/3064}} which is transcluded into the huntingtin article. There are six additional structures, however these at the resolution of the graphics displayed, look virtually identical to the first one, therefore I don't think it would be worth the trouble of including these in a picture gallery. PDB links to these additional structures are include in the template, so if anyone is interested, they can easily find these additional structures. Cheers. Boghog (talk) 05:40, 23 March 2010 (UTC)
Many thanks - once again - Boshog! I still think the template notes should be expanded a little - at least something like 'this template is used to display a gallery of PDB images, please ask at the gene portal for further info...' Lee∴V (talkcontribs) 12:12, 23 March 2010 (UTC)
I added a note to the template docs -- just a simple link to ask more here. You're the first one to inquire, so let's wait to see if there's more people who are interested before we invest a lot of work in adding more info. Cheers, AndrewGNF (talk) 16:09, 23 March 2010 (UTC)
Heya, I've just been thinking about the PDB gallery. Almost all the examples of an 'Infobox protein family' have a list of pdb structures at the bottom. Is worth thinking about doing something similar for those pages? I reckon it would look pretty good :) Abergabe (talk) 13:33, 13 August 2010 (UTC)
Sounds perfectly reasonable and sensible to me. I'd say mock one or two of the changes up so we all can comment on it. Cheers, AndrewGNF (talk) 14:39, 13 August 2010 (UTC)

I thought I'd point out that the gallery generated by this Template is broken on the Thrombin page. Not sure if it's a problem with the template itself or with the way it was implemented, but it produces a mess of code on the page (though fortunately hidden until you expand the section). I'm going to comment out the template but leave it there for someone to fix. GiftigerWunsch [TALK] 13:06, 17 May 2010 (UTC)

Apparently there are some characters in some of the graphic captions that interfered with the template. I have commented these out and the {{PDB_Gallery/2147}} template now displays properly. Thanks for altering us to the problem. Cheers. Boghog (talk) 14:51, 17 May 2010 (UTC)

In the current implementation, the GO sub-box contains links to the individual GO terms, but not to the protein itself. Would it be possible to add this link? E.g. Aurora A kinase should link here. MichaK (talk) 09:50, 27 May 2010 (UTC)

Hi MichaK, yes, that makes a lot of sense. The question though is where to put the link. What do you (and others) think of this prototype? Cheers, AndrewGNF (talk) 04:15, 28 May 2010 (UTC)
Hi Andrew, great! I think the link caption needs to be more descriptive. Perhaps "Source: Amigo" or "view ITK at Amigo"? best, MichaK (talk) 15:44, 28 May 2010 (UTC)
Good suggestion and reasonable looking prototype. I have added a second link to the EMBL-EBI ontology entry and modified the format to add "source". How does it look now? Boghog (talk) 18:00, 28 May 2010 (UTC)
This looks good. I can't decide though if it's confusing to switch categories: Cellular Component / Biological Function / Source doesn't seem right. But it's more consistent with the layout. I'm fine either way (two columns or colspan=2). Thanks! MichaK (talk) 20:26, 28 May 2010 (UTC)
I tend to agree with MichaK that Source doesn't quite fit with the other section headings. I made this change that tried to merge the best of both. Thoughts? Cheers, AndrewGNF (talk) 22:53, 28 May 2010 (UTC)
The latest version looks great. I was trying to make the format look more consistent, but I would agree that "source" is not a ontology category. Boghog (talk) 08:19, 29 May 2010 (UTC)
I'm in favor of the latest version. MichaK (talk) 06:25, 31 May 2010 (UTC)

While we are at it, I would like to propose that the available structure section of the {{GNF_Protein_box}} be made collapsible. The list PDB links can be very long (see HBA1 for example) and can overwhelm the rest of the content in the infobox. I have edited the sandbox in this change. The code is a bit messy, but it seems to work (see prototype). Does this look OK? Boghog (talk) 08:19, 29 May 2010 (UTC)

Yes, absolutely. You did the same in Pfam box. Looks great. As a critical comment, I do not like the expression profiles by ProteinBoxBot. Did anyone ever looked at them? See this profile for rhodopsin: [1]. It shows Parietal lobe. It does not show even eye, not mentioning retina where it is actually located. Biophys (talk) 15:47, 30 May 2010 (UTC)

Amazing work

Hi all, the work you are doing here is amazing. I am a PhD student in Bioinformatics, as well as a Wikipedia-administrator-on-retirement, and I'm very much interested in the details on how you've designed bots to create the gene records, and where the data is coming from. Who should I talk to about this topic? :-) Cheers, Venullian (talk) 14:04, 8 June 2010 (UTC)

Hi Venullian, there are so many people involved in the project (in the WP spirit) that the best forum is probably this talk page. I suppose I initiated the project way-back-when and am now actively thinking about how to mine the contributed data, so you're welcome to ask questions of me directly (by talk or email) if you like. But really, it's pretty much a self-sustaining machine now based on the efforts of many wikipedians. (You can see some recent edit statistics in this recent paper...) Cheers, AndrewGNF (talk) 15:54, 8 June 2010 (UTC)
Great, thanks Andrew! I was hoping there would be a citation :-) I'll read that first and might e-mail you when I have some remaining questions. Keep up the great work anyways! Cheers, Venullian (talk) 04:53, 10 June 2010 (UTC)

Biological wikis

I started List of biological wikis, but it needs improvements if anyone is interested.Biophys (talk) 17:00, 13 July 2010 (UTC)

Gene Wiki project template?

Is there template that provides a infobox/link to the Gene Wiki portal project? I was just noticing that from a gene's article or discussion page there's no way to navigate to the main Gene Wiki portal page. Such a thing would help spread the word about the project and attract potential contributors. I'm imagining something like {{Wikiproject Gene|class=Stub|importance=Low}} which would expand to a box stating something like "This article is within the scope of the Gene WikiProject. To participate, visit the WikiProject for more information...". I'd recommend something like this appear at the bottom of every gene's page as well as at the top of the gene's discussion page. Kudos++ on this project, btw! SteveChervitzTrutane (talk) 02:17, 21 August 2010 (UTC)

Hi Steve, I don't know about adding it to the bottom of gene pages in the main namespace, but certainly a banner on the talk page would be a great idea. I've added it to the list of things to do (which, sigh, is growing much too long...) Cheers, AndrewGNF (talk) 16:11, 23 August 2010 (UTC)
OK, how does this look? Boghog (talk) 21:19, 25 August 2010 (UTC)
I like it a lot. I don't know that we need a whole separate class/importance rating than the MCB banner, but I'm not opposed to it either. Nice! Cheers, AndrewGNF (talk) 23:02, 27 August 2010 (UTC)

SWL

A pilot project for integrating semantic wikilinks (SWLs) into gene articles, listed at User:ProteinBoxBot/Ideas, seems interesting. The first two bullet points suggest finding some facts about a set of genes and encoding them with SWLs. There are several database from which such information could be gathered en masse, including MIPS, IntAct, BioGRID and others.

However the quality of these datasets has been brought into question -- see for example Literature-curated protein interaction datasets, 2009. The maintainers of the MINT repository have done some work to address concerns about reliability by introducing a scoring system for interaction confidence (MINT, the molecular interaction database: 2009 update). Would MINT be the most promising dataset to use for encoding semantic wikilinks? Given concerns about protein interaction datasetes, what should be the threshold for reliability for encoding a SWL? Would the SWLs only consider interactions between human proteins? Emw (talk) 03:28, 24 August 2010 (UTC)

Hi Emw... Good points raised. For the mass seeding of protein interactions, we used BioGRID and filtered for interactions that were supported by either two publications or two methods. But agreed, the (ir)reproducibility of protein interaction data definitely gives me pause. I'm certainly open to discussion of the best resources to use here...
Can I suggest that the PPI data issue is a separate one from the use of {{SWL}} ? I've been stewing a lot over SWL's recently, and still with high enthusiasm. While creating the "PPI" type of SWLs was convenient to start, I think there are probably better examples to demonstrate the utility. Or at least more complex examples. To demonstrate the proof of concept, I'm thinking of creating the SWLs that allow for a multi-faceted query like "Show me all genes that were associated to type 2 diabetes via GWAS, that are also kinases, and that also localize to the plasma membrane." Or something like that. This is definitely an area that can use more brainstorming... Cheers, AndrewGNF (talk) 06:08, 24 August 2010 (UTC)
I agree that the issue of PPI dataset reliability can be separated from the issue of how semantic data should be linked. For the latter, I think an obstacle is determining the controlled vocabularies, or ontologies, to use for describing relationships for genes across multiple domains: molecular function, cellular components, health and experiments.
Existing ontologies, like those used by the Gene Ontology project, seem like they could fulfill part of the need (and conveniently already have some presence in many Gene Wiki articles). In your query, for example, I believe existing GO terms could be used to determine genes that are both kinases (molecular function GO:0016301) and localized to the plasma membrane (cellular component GO:0005886). The domain of health is outside the Gene Ontology project's scope, but supplemental (and possibly less stable) ontologies for human disease exist, like this human disease ontology from the OBO Foundry. The OBO Foundry also has ontologies for experiments, which may be useful for determining the type of evidence used to determine a gene's association with a health condition. A list of OBO ontologies sorted by domain is available at http://www.obofoundry.org/index.cgi?sort=domain&show=ontologies. Emw (talk) 02:54, 25 August 2010 (UTC)
If using these existing ontologies makes sense, the next task I see would be annotating Gene Wiki articles with those controlled terms. An initial part of this would presumably be incorporating existing annotations, which fortunately often accompany those ontologies. For example, the same group that developed the human disease ontology has used it to annotate the human genome: http://projects.bioinformatics.northwestern.edu/do_rif/.
In that initial annotation effort, what would be the advantage of using SWLs in the body of the article (as present in the examples at Category:SWL) rather than augmenting the existing PBB infoboxes, similar to how Gene Ontology terms appear? Emw (talk) 11:39, 25 August 2010 (UTC)
On the issue of controlled vocabularies and ontologies... This is an issue that I've been debating recently with my colleagues. I'd argue that we shouldn't require that SWLs use any specific ontology. I think the same thing applies to naming articles in WP. One person may create an article at Cancer research. Another creates one at Oncology. Probably they eventually will be tagged and merged, and a pointer will be left in its place. Why not the same thing for SWL types?
On the other hand, if an experienced ontologist later wants to relate SWL types to specific terms in an ontology, then that could be noted on the category page (e.g., Category:SWL/phosphorylated_by). Bottom line, we enable contributors to use the ontology but we don't require it. What do you think?
As for the benefit of using SWLs versus infoboxes? I'm hoping that the SWLs are a bit more friendly than the infobox code, especially for domain experts who often don't have much experience with wikitext (and not a lot of motivation to learn). Plus, I think SWLs are more generic because it's not limited to ontologies. Thoughts? Cheers, AndrewGNF (talk) 23:14, 27 August 2010 (UTC)
I agree that enabling but not requiring contributors to use ontologies is ideal. I think this project -- and the wider effort to move toward a Semantic Wikipedia -- could be greatly helped along if there were a way for editors to see some immediate practical effect of their work. It seems like it would be especially difficult to gain momentum in machine-semanticizing articles via SWLs unless that were possible.
The most advanced project in this vein seems to be Semantic MediaWiki. As covered in the July 2010 article Wikipedia to Add Meaning to Its Pages, the Wikimedia Foundation is aware of SMW, but there are apparently questions of it could be integrated into Wikipedia's instance of MediaWiki without significantly detrimenting the site's performance. Emw (talk) 14:48, 18 September 2010 (UTC)

Unused files

There is a few thousand unused File:PBB Protein xxxx image.jpg and as suggested on User_talk:AndrewGNF/Archive2#To_Commons it should be safe to delete them. Before we start a DR I would like to hear if other users think that thy should not be deleted (and why). AndrewGNF has a lot of things to do so there is still some files in use that should be replaced if someone has some time to help. --MGA73 (talk) 21:12, 26 August 2010 (UTC)

I think it would be fine to delete the unused PBB images. I see no pressing need to replace the ~100 PBB images that remain in use; they are still accurate, just not as visually high-quality as those produced by PDBbot. These images represent a few types of edge cases that weren't easily fixable during PDBbot's large-scale generation and upload of replacement images. Emw (talk) 04:38, 27 August 2010 (UTC)
Thank you. I started a DR here. I hope you will comment. --MGA73 (talk) 20:01, 27 August 2010 (UTC)

Multiple single quotes

Some protein names have multiple single quotes, like PPP2R3A: Serine/threonine-protein phosphatase 2A regulatory subunit B'' subunit alpha. This normally initiated an italic section. I've manually added the nowiki tag around the '' in the page title, but it's still there in the sections edited by PBB. Could you please check the bot to add nowiki tags when necessary? Thanks, MichaK (talk) 12:10, 7 September 2010 (UTC)

Sorry I missed this before. Wow, that's quite an edge case. I can only find nine examples of human genes with two double quotes in the gene title or aliases, and only four of them had WP pages (PPP2R3A, PPP2R3B, PPP2R3C, BDP1). Looks like you caught most of them, I did the few remaining ones I saw. Thanks for catching that... Cheers, AndrewGNF (talk) 16:27, 28 September 2010 (UTC)
I agree that this is quite a weird edge case. I wasn't sure if in some of the places, the descriptions etc. might be overwritten at some point by PBB. Did you also check / disable this? MichaK (talk) 05:49, 29 September 2010 (UTC)
I think the general philosophy is to avoid bot updates of text that has been edited by humans. Hence there will be little risk of a bot overwriting the nowiki tags. To be on the safe side, it should be straight forward to include a regular expression in the bot script to search for adjacent single quotes and add nowiki tags if not already present. However considering the the adjacent single quotes occur very infrequently in gene/protein names, I am not sure it is worth the effort. Boghog (talk) 18:15, 29 September 2010 (UTC)
Agreed, I think we're pretty much at the point where bot edits will be limited to the PBB templates. Too many weird cases we'd have to account for to do auto-editing of the main articles... Cheers, AndrewGNF (talk) 21:54, 29 September 2010 (UTC)
[the discussion moved towards the PBB Summaries, I've moved them to a new section]

PBB Summary

So the warnings "The PBB_Summary template is automatically maintained by Protein Box Bot. See Template:PBB_Controls to Stop updates." are more or less obsolete now? In this cases I'd suggest to remove them (and the PBB_Summary template?), as editors (like me) then refrain to edit the summaries. I know that it is possible to tell PBB not to edit the summary, but if I have the impression that PBB is going to update the summaries regularly, then I'm reluctant to disable this. MichaK (talk) 08:46, 30 September 2010 (UTC)

The intention of the PBB_Summary is only to provide seed text for the article. Editors are strongly encouraged to make edits to the text and then to turn off the automatic updates to preserve these edits. Unique content not available any where else is obviously more valuable than mirroring the content of another database. Concerning the templates, these I think are still potentially useful since they provide a mechanism for keeping track of which summaries have not been edited by humans. If the Entrez database summary has been updated and if and only if no human edits have been made, it would be make sense to make a corresponding update in the Gene Wiki article. Boghog (talk) 09:29, 30 September 2010 (UTC)
Yes, as Boghog says, human editors are strongly encouraged to edit the PBB Summary section and to even remove the PBB Summary template when they do. Although Boghog points out how these templates can be useful, I think we've heard the feedback multiple times now where editors are refraining from editing because they see official-looking templates. So I'm even contemplating the wholesale removal of all PBB templates (both the PBB_Summary and PBB_Controls) from the main namespace pages. (We could add PBB_Controls to the PBB templates themselves in case users want to block bot edits there.) Unintentionally discouraging human editors might be too high a price to pay for having an easy marker of whether the summary had been previously edited... Thoughts? Cheers, AndrewGNF (talk) 21:00, 30 September 2010 (UTC)
The easiest might be to change the wording of the comment along the lines of "To keep the protein summaries up-to-date before a human editor has turned their attention to this page, the PBB_Summary template is automatically maintained by Protein Box Bot. Feel free to remove the PBB_Summary template and improve this section." MichaK (talk) 05:11, 1 October 2010 (UTC)
I have no objections to removing the PBB templates. This would certainly make the text easier to read and reduce the risk of scaring off editors. Concerning moving the PBB_Controls to the PBB templates, if the only remaining purpose of the controls is to provide a mechanism to block updates to the PBB template, why not use the {{nobots}} template instead? Boghog (talk) 05:27, 1 October 2010 (UTC)
Agreed, {{nobots}} would be even better. MichaK, I think I'm going to err on the side of just removing those templates. Although in theory it does allow for updating untouched summaries, in practice we've sadly never done it. I think on balance, we'll be best served by just treating those as a one-time seeding of content. I've added this to our running task list. Cheers, AndrewGNF (talk) 16:50, 1 October 2010 (UTC)

Guanylate cyclase-C receptor

According to various sources, the guanylate cyclase-C receptor is the target of the drug linaclotide. Any idea what this receptor is? Do we have a page about it/its gene? Thanks, ἀνυπόδητος (talk) 14:59, 28 September 2010 (UTC)

I am not 100% certain, but it appears to be guanylyl cyclase c. Boghog (talk) 15:08, 28 September 2010 (UTC) Cheers.
Thanks. I'll ask as WP:PHARM as well, perhaps I'll reach 100% there... Cheers, ἀνυπόδητος (talk) 15:32, 28 September 2010 (UTC)

Selectin

The article Selectin has an infobox containing the sushi domain, which isn't mentioned anywhere in the article. Then again, the section "Examples" lists "Human genes encoding proteins containing this domain" (without explaining what "this" domain is). Some list entries definitely contain sushi. Could someone clarify this? Thanks --ἀνυπόδητος (talk) 18:58, 11 October 2010 (UTC)

While all selectins contain the sushi domain, not all sushi domain containing proteins are selectins. Therefore I have split out the material concerning this domain from selectin into a newly created sushi domain article. Cheers. Boghog (talk) 19:39, 11 October 2010 (UTC)
Great, thanks --ἀνυπόδητος (talk) 09:20, 12 October 2010 (UTC)

As part of an unrelated task, I've accidentally generated a list of unused PDB Galleries. Posted here in case it's of use:

10395 - 11166 - 1401 - 1454 - 1471 - 156 - 1742 - 2919 - 335 - 336 - 344 - 350 - 367 - 3952 - 4179 - 462 - 5176 - 5447 - 5617 - 596 - 627 - 6310 - 633 - 637 - 6426 - 6658 - 7143 - 718 - 728358 - 7293 - 835 - 847 - 8654 - 9370 - _26002

Inter-alpha-trypsin inhibitor

We have got ITIH1, ITIH2, ITIH3, and ITIH4, but we seem to be lacking an article on the inter-alpha-trypsin inhibitor itself. Or did I miss anything? --ἀνυπόδητος (talk) 16:07, 6 December 2010 (UTC)

FYI, I have started translating the German article on Kunitz domains. It focuses on the pharmaceutical use, so any material on its biological functions would be welcome. --ἀνυπόδητος (talk) 17:13, 6 December 2010 (UTC)

I created a stub for the former, and using material generated by User:Biophys, expanded the later. Cheers. Boghog (talk) 20:41, 6 December 2010 (UTC)
Good work as always :-) I'll probably nominate Kunitz domains for DYK in a day or two. --ἀνυπόδητος (talk) 08:06, 7 December 2010 (UTC)

Glucocerebrosidase

Glucocerebrosidase (GBA) lists "glucosylceramidase" and "β-glucosidase" as synonyms, while Glucosylceramidase (EC 3.2.1.45) seems to imply that GBA is a type of glucosylceramidase (and the only one listed). Beta-glucosidase has EC 3.2.1.21 and contains a box names "glucosidase, beta; acid (includes glucosylceramidase)". Perhaps I'm being silly, but I don't get the relationship between these terms. Can someone enlighten me? --ἀνυπόδητος (talk) 17:52, 30 December 2010 (UTC)

This is out of my field, but it appears that EC 3.2.1.21 and EC 3.2.1.45 are two very closely related enzymes with similar substrate specificity and therefore similar names:

Enzyme Enzyme commission number Human genes Aliases
Beta-glucosidase EC 3.2.1.21 GBA3 amygdalase, beta-D-glucoside glucohydrolase,
cellobiase,
gentobiase
Glucosylceramidase EC 3.2.1.45 GBA, GBA2 acid beta-glucosidase,
beta-glucocerebrosidase,
D-glucosyl-N-acylsphingosine glucohydrolase

I hope this helps. Cheers. Boghog (talk) 21:31, 30 December 2010 (UTC)

Thanks. I hope the articles are a bit clearer now.
Any thank you for adding Navboxes and clarifying the nomenclature. Cheers. Boghog (talk) 13:34, 31 December 2010 (UTC)