HPC Nightmares & HPC Best Practices

Remember that time you hit Enter instead of Backslash and deleted your entire virtual infrastructure? We are looking for volunteer speakers. If you manage or use HPC resources (even modest ones) please add your voice.

This open round table discussion will offer a chance to hear stories from the datacenter as well as some of the best ways to get real work done in HPC.

Should you participate?

At some level, anything that gets the job done is a best practice. We're all guilty of implementing a hack to get things working. Perhaps if we join forces, we can all reduce the number of hacks and increase the quality of our HPC.

Not all HPC is thousands of cores. Real work gets done on single systems with one GPU. Students, users and administrators at all levels can teach us something new.

Stories we want to hear:

• When things went wrong and couldn't be fixed

• When things went wrong and you managed to save the day

• What toolchain you use to manage your infrastructure/software/users/etc

• A problem that has been frustrating you and no one has stepped up to fix it

I have a story!

Please send a message to Eliot Eshelman (either via Meetup or [masked]). I will publish teasers as stories come in.

Join or login to comment.

  • Robert T.

    Also, here is a question, does anyone use dotkit?
    It appears to be another reincarnation of modules.
    I remember talking to someone at the Broad, and they mentioned that they were using it.

    https://computing.llnl.gov/?set=jobs&page=dotkit

    June 5

    • Eliot E.

      You're a fountain of useful information! I've not heard of dotkit. On first glance, I'm not certain why one would select it over the LMOD environment modules. Anyone else have a better understanding of the nuances?

      June 6

    • Shawn D.

      We are moving to lmod... Mostly just to keep general compatibility for users so they don't have to deal with that and the switch to slurm.

      June 8

  • Robert T.

    Also, here is a link to a paper from 2004, and if you look at table 1, you may still see some of the same issues today on your cluster.
    http://genome.cshlp.org/cgi/reprintframed/14/5/971

    You might recognize one of the authors too...

    1 · June 2

  • Robert T.

    Also, (not really hpc related) someone had talked about pxe/dhcp booting, I mentioned dhcp snooping to avoid rogue dhcp servers. Here is a link to a cisco article about it, but many managed non-cisco switches have similar features.... http://www.cisco.com/c/en/us/td/docs/switches/lan/catalyst6500/ios/12-2SX/configuration/guide/book/snoodhcp.html

    They have other feature too, like dynamic arp inspection, to help avoid having people try to spoof and poison the arp cache...

    2 · May 30

  • Robert T.

    Someone had asked about the joyent outage, here is a link to the summary...

    http://www.joyent.com/blog/postmortem-for-outage-of-us-east-1-may-27-2014

    1 · May 30

  • Steve Z.

    Hi Eliot, Thank you for your information on Nvidia GPU card Grid k1 and k2. It turns out that those cards and Quatro K6000 and K5000 are for Windows 2012 Hyper-V host and Windows 7 Enterprise / Windows 8 Enterprise guest OS with RemoteFX supporting vGPU. This solves one of my problem.

    Could we use GRID k1 or k2 to do CUDA/OpenCV computing using vGPU in VMs of Windows server 2012 hyper-v? GRID k1 and k2 are GPU cards with CUDA cores inside, why not? Does vGPU support CUDA drivers in Windows 2012 hyper-v VMs?

    Thanks again,

    Steve

    May 30

    • Eliot E.

      You can use the NVIDIA Grid GPUs for compute, but I think you will find that they are under-powered. Their primary purpose is display. For the best compute performance, you would want Quadro K6000. It has performance equivalent to the Tesla K40.

      May 30

  • Won

    It was my first time at this meetup, but it was very useful. Thanks, Eliot and everyone who shared own experience. BTW, I have found a nice general introductory article about what we have talked about yesterday:
    http://blog.ajdecon.org/the-hpc-cluster-software-stack/
    See you guys at the next meeting!

    1 · May 30

  • James C.

    Great job Eliot - that was a whole bunch of fun. Good to swap some war stories and have a bit of a chuckle about them!

    May 30

  • Rob P.

    I know there was some discussions around Ceph last night. Inktank is having a Ceph Day in Boston on June 10th. Here is a link: http://www.inktank.com/cephdays/boston/

    2 · May 30

  • Dionis E.

    I am so new to HPC it was great to hear all the technologies discussed with real world experience.

    1 · May 30

  • Jilang M.

    It's a pity I didn't find the place at NERD. Hope for next meetup

    May 30

  • Kun L.

    it's interesting and have some new sights on the HPC

    May 29

  • Peter J. L.

    Great meeting, and thanks to Eliot again for organizing!

    May 29

  • Steve Z.

    I am senior software engineer at Trilion. I am going to use CUDA and OpenCV to do image processing and analysis.

    May 29

  • Eliot E.

    I'm sorry that we've filled up the registration already. I'm waiting to hear if Microsoft NERD can give us more space.

    May 23, 2014

  • Theodore O.

    Ah, forgot on: A colo sales person saying they can house any gear, and when we provide a 60kW rack design, the facility manager replying that the data center is designed for 8kW racks....

    As a matter of fact, we have never been able to find a US colo that can support a 60kW rack.

    May 23, 2014

  • Theodore O.

    bad memory DIMMs in a 500 node cluster.....

    cable management to route a fat tree

    Does it really cost $25k for a single cable?

    VPN reliability between data centers in an emerging market

    What do you mean that I can't connect my Arista TOR to my Juniper firewall?

    and many more....

    May 23, 2014

  • Eliot E.

    Here are some of the stories we have on tap. Please bring your own (and send me photos/screenshots in advance)!


    Enter vs. Backslash

    That time it rained on the Head Node.

    The over-heating datacenter #1 and #2 (I'm sure we'll have tons of these).

    D@#% over-zealous provisioning scripts/servers!

    Where are the hard drives?

    May 23, 2014

People in this
Meetup are also in:

Create a Meetup Group and meet new people

Get started Learn more
Bill

I started the group because there wasn't any other type of group like this. I've met some great folks in the group who have become close friends and have also met some amazing business owners.

Bill, started New York City Gay Craft Beer Lovers

Sign up

Meetup members, Log in

By clicking "Sign up" or "Sign up using Facebook", you confirm that you accept our Terms of Service & Privacy Policy