Friday, December 20, 2013

Avoiding the CGminer BSOD/Crash on Exit for R9 290/290X

One of the major annoyances with mining via cgminer on R9 290/290X hardware is that it will hard crash your system when you exit "gracefully" -- by pressing "Q" or "CTRL+C". Some people will tell you that the problem is your clock speeds, or that you're using unstable settings, but this is absolutely false. The reality is that the crash is caused by outdated code in cgminer.

You see, cgminer's developer has decided to stop supporting GPUs. I don't blame him -- he's probably earning a ton of money working on developing and maintaining his program for ASICs, and since he's only one man he can't really keep doing both. But it means that code that worked fine in the past may now have difficulties. And that's exactly what's happening.

With their Hawaii architecture, AMD updated some things for the R9 290/290X. Specifically, they've updated their Overdrive engine from version 5 to 6, which helps with things like preventing the GPUs from frying if you run them too hard. This is part of the reason why you need to carefully tune your mining rigs for each GPU -- one R9 290X may run great at 975/1550 clocks while another will totally fail at those speeds. You can often see that you've pushed the GPU engine clocks too far by watching the clocks in MSI Afterburner; if you've set the engine for 950MHz and it's bouncing around between 800-950MHz, AMD's hardware is doing its job and dropping clocks to reduce power draw and protect the chips. Drop the engine clocks 25MHz (or 50, 75, etc. MHZ) and see if clocks go up. But I digress....

So the problem with R9 290/290X (Hawaii) and BSOD/crash on exit is that cgminer is tying into older software/hardware paths that don't behave quite the same on the latest GPUs. When you exit normally and cgminer tries to release things, there's a glitch and a BSOD is the result. Note that this happens on Windows as well as Linux (though without the BSOD on Linux, obviously), so it's not just an OS problem. But what's the solution?

Update: Forget everything below this point! There's a recompiled version of cgminer available that fixes the problem, and I recommend using it.


[Old text follows so you can see how things have developed.]

Simple: don't use any of cgminer's GPU overclocking and/or temperature monitoring features. That means you have to avoid the following (as far as I can tell):
  • auto-fan
  • gpu-engine
  • gpu-fan
  • gpu-memclock
  • gpu-powertune
  • gpu-vddc
  • temp-cutoff
  • temp-overheat
  • temp-target
Okay, those are easy enough to remove, but there's a problem: without tuning your GPUs, performance, hash rates, temperatures, and stability are all compromised. Yuck. The temperature targets in particular are important, as if you leave AMD's drivers to manage everything you'll have severe throttling. So you need to turn to other utilities to manage temperatures and clocks as best you can...or just live with the inability to do a normal exit from cgminer.

For what it's worth, I'm still using cgminer to control things. I tried playing with MSI Afterburner to get things to work "properly", but getting everything to persist between reboots isn't something I've fully figured out yet. When I do, I'll post instructions.

36 comments:

  1. You can also close the process in your task manager to avoid BSOD with R9's.

    I have a problem with my cgminer, after installing new card (R9 290) I cannot run cgminer without -T parameter, it just won't show me any stats, on the 7870 everything worked perfectly but on 290 text-mode only. I tried different versions and different drivers. Any suggestions?

    ReplyDelete
    Replies
    1. Hi,

      I also had this problem at the beginning. It always happened with me with a 'clean' install of Windows 7. After updating with Windows Updates, it never happened again.
      Had this problem on 2 machines, I hope this helps.

      Delete
  2. I noticed the same problem but I found that if I just close the cgminer window it does not BSOD. I only wish I could adjust voltage to the cards, although after installing some aftermarket coolers these puppies are humming at 60 and 80 degrees respectively. I could probably get the second one to run at 60 if I got a riser and didn't have the first one venting to the underside of the second.

    ReplyDelete
    Replies
    1. Yes, you can close out the window to avoid the BSOD, but the problem with that is there's no way to use something like CGWatcher to keep cgminer running properly.

      Delete
    2. Because it can't restart cgminer without bsod? I guess worst case it restarts the pc due to bsod rather than just cgminer which is not all bad either. I have CGWatcher running on my headless mining pc with 2 of these cards and so far so good.

      Delete
    3. Yeah, CGWatcher should just cause a BSOD/restart, which is a bit much for a simple restart of cgminer, and it might cause issues with Windows as well (if you get frequent BSODs). Linux however often seems to hang/get stuck and need a manual restart (often via a hard power reset).

      Delete
    4. The new CGWatcher (for sure 1.3.4.6, but I think earlier too) has the option to "Always kill miner process instead of sending 'quit' command" under Settings->Miner. I have that and the option above it enabled, which works well for me (R9 270Xs and 290s)

      Delete
  3. Any ideas on how to get 6 290's to run in Windows (7 or 8)? I have 5 running in Win8.1 now, but the 6th comes up with the "code 43" issues in device manager. Tried the downgrade driver method ((https://bitcointalk.org/index.php?topic=193695.msg2054705#msg2054705) but the 12.6 drivers don't support the R9 290

    Thanks for any help! :D

    ReplyDelete
    Replies
    1. I have not tried more than four GPUs in a rig (mostly due to power constraints), so I'm not sure what specifically is required. I do know that AMD's drivers can be pretty particular, so you might need to uninstall and clean the drivers and reinstall to see if that helps.

      Delete
    2. Ah ok.... To get around the power issue I'm running 2 PSU'S on it w/ powered risers. Think I'll try a fresh windows install and see if it helps, thanks!

      Delete
  4. I think a bigger annoyance than this is headless Windows systems. When you reboot a machine and try to start mining through RDP (no monitor connected to the machine, hence headless), you get some errors initializing and the fans don't start. Things overheat, you get lower hash, and things are just crappy.

    There is also no way to trick the 290 in to thinking there is a monitor connected (through a dummy plug or otherwise). No software, etc.

    I don't mind force quitting the miner, but this is truly annoying if there is a crash.

    ReplyDelete
    Replies
    1. I managed to get my pc to bypass those errors by changing gpu choice from auto to pcie slot but also by allowing the on board gpu to stay functional as well. After making that change the pc would boot to windows, start cgwatcher and start mining without the monitor attached.

      Delete
    2. Don't use Windows RDP, use Teamviewer. I think it uses a different video/display virtualisation method, so it doesn't cause the monitoring info to disappear. Plus it's handy to see your rig in the "devices" list to see that it's still at least online.

      Delete
  5. I can't contribute anything usefull for Windows, but on linux something like:
    aticonfig --pplib-cmd "set fan speed $adapter $fanspeedinpercent" for instance
    aticonfig --pplib-cmd "set fan speed 0 80" for a single card might be useful.

    ReplyDelete
  6. If anyone is willing to recompile cgminer I think I might have a fix for the crash on exit (the newer drivers do no like setting a fan speed without telling them exactly what type of speed it is RPM or percent).

    I posted the details here:
    https://litecointalk.org/index.php?topic=8859.msg78496#msg78496

    All the best,
    tkg

    ReplyDelete
  7. BFGminer seems more stable than CGminer for my 290x's. Anyone expert at BFGminer?

    ReplyDelete
    Replies
    1. I have only played with BFGminer a bit, but I need to look into it more and see if performance is the same or potentially better these days. I do hate that the hash rates reported in BFGminer are often incorrect for scrypt mining, at least in my experience.

      Delete
  8. There is a fix for the people that get a blank screen with only a blinking cursor. This is usually fixed by using -T or --text-only, but that disables the entire ncurses interface and the control options it gives.

    Instead of -T, you can use --no-restart option and get the full ncurses interface. This is true for cgminer 3.7.2 and my Sapphire R9 290 card at least.

    It is a bug in adl.c, in the gpu_fanpercent() function, where some R9 290/290X cards return a negative value for fan speed, where earlier models returned the default value (usually something like 33%). This triggers the protective code to believe the card has crashed and tries to restart.

    I've experienced the BSOD too, but I just added --no-adl to cgminer parameters and used http://code.google.com/p/overdrive5/ instead.

    ReplyDelete
    Replies
    1. Can you set the voltage with overdrive on the 290??

      Delete
  9. Hi Martin,

    You are quite right about the no restart option.
    I have also found today (using the 13.12 drivers on windows) that on a system using a water cooled Sapphire R9 290x (with no fan connected) and a standard Sapphire R9 290 with the original fan I was getting a black terminal screen as cgminer reported that the fan speed for the first card was wrong and cgminer tried to restart.
    By disabling the restart, cgminer works and shows the RPM just for the card that still has a fan.

    Also as a side note, the fix I proposed on litecoin (see my previous post) works for the BSOD on quit on both Linux and Windows.

    I still think that the crash on exit is an AMD driver bug as they should just ignore setting the fan speed if the command does not specify a proper fan speed type instead of crashing but then also the cgminer should be fixed to send the proper data as I did.

    I can confirm this fix works as I succeeded to compile cgminer on MinGW (which was a bit of a struggle using the latest mingw as it messed up the winsock headers and also llround is kind of missing) and also on Linux and both versions are exiting fine on pressing q.

    Also I can confirm that the only thing that does not work properly is voltage reporting and control as the newer cards apparently use a different voltage control mechanism (percentage) and maybe od6 (overdrive6) needs to be used.

    Thus, with the current implementation of OD5 you can still control the gpu and mem clocks (I have set new clocks and they were updated), the fans (if you have them ..., I have not tried to control them, but I can see the RPM if the fan is present) and the powertune level but not the voltage.

    tkg

    ReplyDelete
    Replies
    1. You were correct about the fix, I just recompiled cgminer and I can enter and exit cgminer without it crashing my machine (Windows 7 64-bit with latest drivers) - at least it worked the couple of times I tried it.

      Good find and thank you!

      Delete
  10. May I get a copy of your recompiled version?

    ReplyDelete
    Replies
    1. I've put the recompiled version up here: http://k-dev.net/cgminer/
      Since I've been messing with other stuff, see changes here: http://k-dev.net/cgminer/md-changes.txt

      Delete
  11. Thank you for the compiled version. Should I still use the no restart option with your version?

    ReplyDelete
    Replies
    1. Yes you still need to use the 'no-restart' parameter if you encounter the blank screen with a blinking cursor, as I haven't changed the code that triggers the restart.

      I suppose I could so, but I'm not sure how to approach it best. Maybe check if there ever was a fan speed value from the card or something. I'll look into it, but no promises.

      Delete
    2. Replying to myself, good fun!

      Anyways, turns out the fix was easier than I thought, so I've updated the change log and the zip archive with the latest version.

      As usual, it hasn't been tested outside my own computer (Win7 64-bit with latest drivers + R9 290 and 6970), so your mileage may vary.

      Delete
    3. I just tried it and got a BSOD, tried again and the computer locked up. I'm simply using a batch file with cgminer parameters to run it, which worked fine before. Before i saw your version i was trying to compile it myself, but it's a hard task since i barely touched code or compilers before.

      If you only changed those few things, i can't see why it's crashing. Can't be linked to the batch file, right? All it includes is:

      del C:\Users\Rig\Desktop\Dropbox\((Stuff))\Apps\Mining\Scripts\*.bin
      del C:\Users\Rig\Desktop\Dropbox\((Stuff))\Apps\Mining\CGminer\*.bin
      setx GPU_MAX_ALLOC_PERCENT 100
      setx GPU_USE_SYNC_OBJECTS 1
      ..\cgminer\cgminer.exe --scrypt -u Gamdag.Rig -p x -o stratum+tcp://stratum01.hashco.ws:8888 -I 13 -g 2 -w 256 --thread-concurrency 20480 --gpu-engine 1000 --gpu-memclock 1500 --gpu-powertune 20

      Delete
    4. OK, i tried it again, but this time with no overclocking options, and it works fine. When i add OC options (see last 3 parameters above) it just dies.

      Also on Win 7 x64 like yourself, running 3 x 290 cards.

      Delete
    5. The batch file looks fine, only you don't need to run the 'setx' commands more than once. Also what kind of speeds are you getting with -I 13 -g 2? I've always run my R9 290 with -I 20 -g 1.

      What is the brand of your R9 290s? And can you set all three to 1000/1500/20 manually without any issues? And you get the BSOD on exit of cgminer right? Not when you start it.

      Delete
    6. Hi Martin, thanks for the reply.

      When you say you only need to run "setx" once, is that just once ever? I thought it was required after a reboot, so i left it in. Are they just environment variables? I never looked into them. :)

      My 290s are Sapphire. They overclocked fine just before, but i'll try them again tonight with your version.

      The -I 13 was just for testing, i like to test with lower intensity. Normally i run -I 20 -g 2 and get around 700-750 per card with no overclock, and 850-900 per card with 1000/1500/20 settings.

      I also find myself getting many more rejects when using these settings while not overclocked and then have to put in lower settings to avoid them. So either way, the loss tends to be around 25%+ (lower speed and/or higher rejects), which is why i'd like to fix this issue! :)

      Delete
    7. setx stores the value in the registry, you can find them under HKEY_CURRENT_USER in the Environment folder. There's no harm in calling setx more than once, it's just redundant. :)

      I think you should try and see what '-I 20 -g 1' at 1000/1500/20 gives you, but if you get above 850Kh/s per R9 290, although you're already pretty close to optimal speeds. But with these settings and a thread-concurrency of 28456, I get ~880Kh/s with zero hardware errors and 1-2% rejects using the middlecoin european stratum.

      But I'm more interesting in why you're still experiencing blue screens upon exit, since tkg's fix worked both for him and myself. I may make a debug build with an extended log, so I can see what is going on. But it'll have to wait a few days, family is demanding my presence for some weird holiday thing.

      Delete
    8. Ah yes of course, i forgot whether it was the reg or an environment variable. So much to take in when learning how to mine that some things just go over my head.

      I will try those settings when i get home (currently visiting family for christmas) as i don't want to fiddle with it remotely now it's running. If it crashes, i'm screwed for the week! With the settings i gave plus the overclock, i was getting 2.5-2.6k total with zero errors and rejects when i tested for 2 hours. The rejects only turned up (around 6%) now that i'm using no overclock, maybe because it misses the powertune option. :)

      Understable that you have no time, i have similar issues. Bah humbug. I won't be able to test it for another 5 days anyway, no rush!

      Delete
    9. Call me sad, but i was just tweaking my rig! :)

      I tried the settings you mentioned, though with no overclock (since i'm still not near the rig and don't want to crash it). What i have noticed in the past 10 minutes is zero rejects (before it was a clear 5-8% in 24 hours) which i thought was simply due to no overclocking/powertune being present, which may be the case, but this runs stable even without those... plus i'm getting around an extra 5% base speed. Now i'm at 820KH average per card, stock clock speeds, 81 degrees.

      I can't wait to try this out with some overclocking!

      Have a nice Christmas Day!

      Delete
    10. What version of the AMD-APP SDK are you guys running? I was using 1.2 then upgraded to the latest 2.9 then it fixed the blank screen issue.

      http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/

      Delete
  12. It is a non issue with new update of Cgwatcher. It has an option to send kill command in stead of quit which closes the cgminer without BSOD. It can be found under settings-miner tab.

    ReplyDelete
    Replies
    1. Hi Sead,

      How do you use cgwatcher under Linux (does it work with mono) ? ;)

      My point was that just because you use Windows it does not mean everybody does (a significant number of miners are using Linux).

      And also if you change the fan speed using cgminer and use the kill command it will still not set the proper fan speeds as before running cgminer as the close_adl function will not be called.

      So yes you do avoid the BSOD using the kill option but that is just a partial fix for the outcome and not the cause. ;)

      Regards,
      tkg

      Delete