Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with publication information by BDII-Local #135

Open
jas01 opened this issue Mar 17, 2015 · 33 comments
Open

Problem with publication information by BDII-Local #135

jas01 opened this issue Mar 17, 2015 · 33 comments
Assignees
Milestone

Comments

@jas01
Copy link

jas01 commented Mar 17, 2015

I got a alert on

midmon.egi.eu

on my bdii about information the bdii publish

CRITICAL - errors 78, warnings 26, info 52
Summary per type of error, warning and info message:
E022 - Default value published (GLUE2ComputingShareWaitingJobs): 26
E023 - Default value published (GLUE2ComputingShareEstimatedAverageWaitingTime): 26
E024 - Default value published (GLUE2ComputingShareEstimatedWorstWaitingTime): 26
I046 - Number of seconds higher than 1 million (GLUE2ComputingShareEstimatedAverageWaitingTime): 26
I047 - Number of seconds higher than 1 million (GLUE2ComputingShareEstimatedWorstWaitingTime): 26
W025 - Incoherent number of total jobs (GLUE2ComputingShareTotalJobs): 26
@jrha jrha added this to the 15.4 milestone Mar 17, 2015
@Pansanel
Copy link
Contributor

Which is the version of glite-info-provider-service?

@jas01
Copy link
Author

jas01 commented Mar 18, 2015

Actually on my bdii-local I've

glite-info-provider-service-1.13.4-1.el6.noarch

Regards

@Pansanel
Copy link
Contributor

Can you try to downgrade to 1.13.3-1 to see if it works better?

@Pansanel
Copy link
Contributor

Dismiss the previous comment. I have found where the code is:
/usr/libexec/info-dynamic-pbs
Which version of lcg-info-dynamic-scheduler-pbs are you using?

@jas01
Copy link
Author

jas01 commented Mar 18, 2015

The downgrade doen't change anything. But you already known that ;-)

I've split my torque server and the cream-ce.

On both

lcg-info-dynamic-scheduler-pbs-2.4.5-1.el6.noarch

regards.

@jrha
Copy link
Member

jrha commented Mar 18, 2015

Is this a problem with the Quattor template library or a middleware problem?

@Pansanel
Copy link
Contributor

It may be related to Quattor if it is a gip parser configuration problem. It is not sure.
We have the same rpm (lcg-info-dynamic-scheduler-pbs-2.4.5-1.el6.noarch).
Can you check the content of /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce and try to execute it manually?

@jas01
Copy link
Author

jas01 commented Mar 18, 2015

so in the /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce I've got a very simple

[root@torque-grid ~]# more /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce
#!/bin/sh
cat /swareas/dteam/gip-ce-dynamic-info-maui.cache
[root@torque-grid ~]#

and inside this /swareas/dteam/gip-ce-dynamic-info-maui.cache
I've got lot of information, If I stay on one vo

dn: GlueCEUniqueID=cream-ce-grid.obspm.fr:8443/cream-pbs-glast.org,Mds-Vo-name=resource,o=grid
GlueCEInfoLRMSVersion: 2.5.13
GlueCEInfoTotalCPUs: 112
GlueCEPolicyAssignedJobSlots: 112
GlueCEStateFreeCPUs: 0
GlueCEStateFreeJobSlots: 0
GlueCEPolicyMaxRunningJobs: 110
GlueCEPolicyMaxWallClockTime: 4320
GlueCEPolicyMaxCPUTime: 1440
GlueCEStateStatus: Production

dn: GLUE2ShareID=GLAST-ORG_GLAST-ORG_torque-grid.obspm.fr_ComputingElement,GLUE2ServiceID=torque-grid.obspm.fr_ComputingElement,GLUE2GroupID=resource,o=glue
GLUE2ComputingShareFreeSlots: 0
GLUE2ComputingShareMaxRunningJobs: 110
GLUE2ComputingShareMaxWaitingJobs: 220
GLUE2ComputingShareMaxTotalJobs: 330
GLUE2ComputingShareServingState: Production
GLUE2EntityCreationTime: 2015-03-18T14:55:02Z
GLUE2ComputingShareMaxWallTime: 4320
GLUE2ComputingShareDefaultWallTime: 4320
GLUE2ComputingShareMaxCPUTime: 1440
GLUE2ComputingShareDefaultCPUTime: 1440
GLUE2ComputingShareMaxSlotsPerJob: 1
GLUE2ComputingShareMaxMainMemory: 2000
GLUE2ComputingShareMaxVirtualMemory: 20000

but if I try to find some from the nagios alerte it's doent' exist like

[root@torque-grid ~]# grep GLUE2ComputingShareEstimatedAverageWaitingTime /swareas/dteam/gip-ce-dynamic-info-maui.cache
[root@torque-grid ~]#grep GLUE2ComputingShareEstimatedWorstWaitingTime /swareas/dteam/gip-ce-dynamic-info-maui.cache
[root@torque-grid ~]#

@Pansanel
Copy link
Contributor

Which version of template-library-grid are you using? Many work has been done on glue2 static publishing in November. Are you sure you are up to date?

@jouvin
Copy link
Contributor

jouvin commented Mar 18, 2015

We started with an email exchange with @jas01 and he confirmed me that he was using 14.10.0. As you said, it is not something we see in other sites so probably due to some local conditions but I guess this is coming from some wrong Quattor configuration at site.

We are running the same version of glite-info-provider-service as @jas01 so I'll try to run the GLUE validator on our config again to check that we don't have the same problem...

@jouvin
Copy link
Contributor

jouvin commented Mar 19, 2015

I have checked and cannot reproduce the problem on our CE... which doesn't mean that there is no problem! I'm trying to think about what could help to troubleshoot the problem...

@jouvin
Copy link
Contributor

jouvin commented Mar 19, 2015

From what I can see running glue-validator (*), I have the feeling that something is not working with the GIP cache. This cache is made of 2 files: /swareas/dteam/gip-ce-dynamic-info-maui.cache and /swareas/dteam/gip-ce-dynamic-info-scheduler.cache. The problematic information is coming from the latter one (gip-ce-dynamic-info-scheduler.cache). Could you check that this file is created and that you have not error in /var/spool/maui/gip-info-dynamic-scheduler-plugin.sh when running the dynamic-info-scheduler plugin.

(*) glue-validator -H topbdii.grif.fr -p 2170 -b GLUE2DomainID=OBSPM,GLUE2GroupID=grid,o=glue -k -v 3 egi-profile

@jas01
Copy link
Author

jas01 commented Mar 23, 2015

Sorry I didn't see your email because it's tag as spam.

From what I can see running glue-validator (*), I have the feeling that
something is not working with the GIP cache. This cache is made of 2 files: /
swareas/dteam/gip-ce-dynamic-info-maui.cache and /swareas/dteam/
gip-ce-dynamic-info-scheduler.cache. The problematic information is coming from
the latter one (gip-ce-dynamic-info-scheduler.cache). Could you check that this
file is created and that you have not error in /var/spool/maui/
gip-info-dynamic-scheduler-plugin.sh when running the dynamic-info-scheduler
plugin.

Your right that is the problem. I don't have the symbolic link between

/var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif

and

/var/glite/static-file-all-CE-pbs.ldif

I create that symlink manually and every thing seem correct now.

I don't understand how that link disapeare. I try to launch ncm-ncd and he
didn't create this link. Should I modify my own quattor template so it's
create this link ? (for next time I re-install the server)

Regards

@pigay
Copy link
Contributor

pigay commented Apr 8, 2015

Dear all,

We came to the same error in midmon (https://midmon.egi.eu/nagios/cgi-bin/extinfo.cgi?host=sitebdii.m3pec.u-bordeaux1.fr&type=2&service=org.bdii.GLUE2-Validate)

We run the same version of lcg-info-dynamic-scheduler-pbs as @jas01 also on 14.10 templates, but we can't find an error running /var/lib/bdii/gip/plugin/lcg-info-dynamic-ce or /var/spool/maui/ gip-info-dynamic-scheduler-plugin.sh.

Can you help us?

pierre

@pigay
Copy link
Contributor

pigay commented Apr 17, 2015

Hi,

we investigated a bit further on our problem.

We didn't find any occurence of (for example) GLUE2ComputingShareEstimatedAverageWaitingTimein any of the /usr/libexec/lcg-info-dynamic-* scripts we have on the CE.

we have the following rpms:

  • lcg-info-dynamic-maui-2.2.0-3.noarch
  • lcg-info-dynamic-scheduler-pbs-2.4.5-1.el6.noarch

We wonder which piece of software /should/ update these entries with the dynamic value...

Thanks in advance,

Pierre

@Pansanel
Copy link
Contributor

Hi Pierre,

Both RPMs are up to date. What is the content of the following file:
/etc/glite-ce-glue2/glite-ce-glue2.conf

Jerome

@pigay
Copy link
Contributor

pigay commented Apr 17, 2015

Dear Jérôme,

This file doesn't exist. We only have a /etc/glite-ce-glue2/glite-ce-glue2.conf.template from glite-ce-cream-utils-1.3.5-1.el6.x86_64.rpm

How the .conf file is supposed to be created?

Thanks,

Pierre

@jouvin
Copy link
Contributor

jouvin commented Apr 18, 2015

Have you checked that the problem is not the same as for @jas01, the missing symlink? This symlink is needed because of a bug introduce in lcg-info-dynamic-scheduler and recently fixed (but may be not yet released).

glite-ce-glue2.conf should exist but in /etc/bdii/gip.

To troubleshoot this kind of problem, you may want to look at /var/log/maui-monitoring.ncm-cron.log. BTW, are you running the CE and the batch system on the same machine? Do you have GIP_CE_USE_CACHE true or false in your configuration? If true, you should have to plugins in /var/lib/bdii/gip/plugin/ on your CE that are just a cat command of some file): check the existence/content of these files.

@pigay
Copy link
Contributor

pigay commented Apr 26, 2015

Dear Michel, thanks for helping. Sorry for late reply, I was in holidays.

The symlink looks correct:

[root@ce0 ~]# ls -al /var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif /var/glite/static-file-all-CE-pbs.ldif
lrwxrwxrwx 1 root root    50 17 avril 16:49 /var/glite/static-file-all-CE-pbs.ldif -> /var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif
-rw-r--r-- 1 ldap ldap 69924 24 déc.  14:46 /var/lib/bdii/gip/ldif/static-file-all-CE-pbs.ldif

/etc/bdii/gip/glite-ce-glue2.conf is present and looks ok, as far as I can see. The file /etc/glite-ce-glue2/glite-ce-glue2.conf mentionned by @Pansanel is missing however.

In /var/log/maui-monitoring.ncm-cron.log, I see that the cron runs every 5 minutes without errors.

We run 2 CEs, each has it's own batch system (and a separate set of WNs). GIP_CE_USE_CACHE is set to true in our templates.

Concerning the plugins, we have:

[root@ce0 ~]# ll /var/lib/bdii/gip/plugin/
total 12
-rwxr-xr-x 1 ldap ldap 49 14 mai    2014 lcg-info-dynamic-ce
-rwxr-xr-x 1 ldap ldap 54 14 mai    2014 lcg-info-dynamic-scheduler-wrapper
-rwxr-xr-x 1 ldap ldap 97 14 mai    2014 lcg-info-dynamic-software-wrapper

Each file contains a cat command (and runs without errors)...

@jrha
Copy link
Member

jrha commented Apr 27, 2015

I don't mind this kind of discussion here, but is this problem specific to the grid template library?
If not, is there a better forum for this where more people could benefit from the discussion?

@pigay
Copy link
Contributor

pigay commented Apr 27, 2015

Dear @jrha, you're probably right. But I think I got it.

@jouvin : by the way, I noticed in quattor-grid mailing list a thread I missed, related to #113. We are not synchronized with this PR (we are in umd3-14.10 I think), so it seems normal that we miss glue2 information...

I'll try to get the correct version and post in quattor-grid if I have trouble.

sorry for the noise.

@pigay
Copy link
Contributor

pigay commented Apr 28, 2015

Finally, it turns out to be a template-grid-library issue.

As @jas01 stated before I pollute this thread, this is a missing symlink problem (I had the same issue with /var/glite/ComputingShare.ldif``).

I couldn't find how these links are created in the templates.

@jrha
Copy link
Member

jrha commented Apr 28, 2015

☺️

@jouvin
Copy link
Contributor

jouvin commented May 5, 2015

I had a look to the problem and proposed fix and I am not completly sure this is the right solution to the problem. I now remember that the problem comes from lcg-info-dynamic-scheduler that in EMI3 releases was ignoring ldif_file and ldif_file_glue2 in its config file (/etc/bdii/gip/lcg-info-dynamic-scheduler-pbs.conf): see https://ggus.eu/index.php?mode=ticket_info&ticket_id=110336 for the explanation.
I would tend to refuse the PR and consider that the correct fix is the new version of lcg-info-dynamic-scheduler. I'll try to assess this before 15.4 release...

@jouvin
Copy link
Contributor

jouvin commented May 6, 2015

See #137 for the discussion explaining why it was a workaround but not a real fix for the problem introduced by the need to work around a bug in lcg-info-dynamic-scheduler. #141 should be the real fix.

@jouvin
Copy link
Contributor

jouvin commented May 7, 2015

I have been very bad following this issue, my apologizes... Anyway, just checking the status of everything after comment #141 (comment), I realize that there may be some ambiguity about whether your are running a fixed version of the GIP plugin. The real culprit is not lcg-info-dynamic-sheduler-pbs but the underlying module dynsched-generic. The pb is fixed in 2.5.5-1.
A for the short term, as we are very close to 15.4 release without enough time to test the things properly for something that is difficult to tests (some effects need time to be identified), I suggest that we merge #137 (which is an acceptable workaround) and work quietly on #141 afterwards...
Any objection?

@jouvin
Copy link
Contributor

jouvin commented May 7, 2015

In fact the required version of dynsched-generic has been released only in EMI repos (a while ago)... I try to understand the real status...

@jouvin
Copy link
Contributor

jouvin commented May 8, 2015

For the record, the new version with the bug fix required for this PR will be released by UMD end of may. So let's delay it until 15.6.

@jouvin jouvin modified the milestones: 15.6, 15.4 May 8, 2015
@jrha jrha closed this as completed in #137 May 11, 2015
@jrha jrha reopened this May 11, 2015
@jrha
Copy link
Member

jrha commented May 11, 2015

#137 is a workaround, the real fix should be incorporated into #141.

jouvin added a commit to jouvin/template-library-grid that referenced this issue Oct 8, 2015
@jrha jrha modified the milestones: 15.10, 15.8 Oct 29, 2015
@jrha jrha modified the milestones: 16.2, 15.12 Dec 7, 2015
@jouvin
Copy link
Contributor

jouvin commented Jan 18, 2016

After closing #141 (see comments inside), I propose to restart the work on this issue with the last version of GIP CE as provided in #162 and intended to be released in 16.2. The fixed UMD component should have been available for a long time now and I suggest to remove the workarounds and integrate properly the working version.

@jouvin jouvin self-assigned this Jan 18, 2016
@jrha
Copy link
Member

jrha commented Feb 12, 2016

@jouvin - #162 is merged, what is left to do?

@jrha
Copy link
Member

jrha commented Feb 17, 2016

@jouvin ping?

@jrha
Copy link
Member

jrha commented Feb 22, 2016

@jouvin and @jas01 if this is fixed please close this issue, otherwise please move the milestone to 16.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants